Deduping Messages in Kafka Streams the Right Way

Most examples I found out in the wild of how to deduplicate identical or unchanged messages, I’ve recently discovered, do it the wrong way. By wrong way I mean they used a groupByKey & aggregate  to compare previous/current values and then filter  out the unchanged values. This seemed like a creative way of leveraging the DSL functionality. The problem with this method, is because this wasn’t the intent of the Kafka Streams DSL’s groupBy & aggregate (which is performing aggregations), various features need to be turned off or worked around to prevent the ‘changed’ events being lost by the DSL’s optimisations. The “right way” is a bit subjective, so to quell the inevitable quibbles, let’s settle for the “deduping messages in Kafka Streams: not the wrong way”. The streams DSL, performs optimisations on aggregations by caching the output before sending downstream and also caching it before writing to the state store and the changelog topic it’s persisted to. Effectively doin

Running Jenkins on ECS

Running Jenkins using ECS tasks to run worker nodes has been documented before , however there aren’t any up to date examples, nor  provide separation of the master and salves. This post is fairly up to date deployment using the newer deployment techniques offered by AWS. Even if your not using Jenkins, being able to create CloudFormation templates using the new EC2 launch options is helpful if you’re using many spot instances, as you will most like be experiencing instance type availability fluctuations. Key features of this deployment Both worker nodes and master node run on ECS: the master as a service and slaves as dynamically added tasks The master node runs on it’s own dedicated cluster; it's file system store is only mounted on and accessible by the master The job can launch and run build docker images from within an already running container The worker nodes can also spawn build agents (docker containers) using the "new" Jenkins pipeline syntax; where the

Use Instance Store With AWS Elastic Container Storage

Many EC2 instance types come with instance attached storage ( Instance Store ) which can provide a fast local storage that is faster than using an EBS volume. If your using the Amazon ECS-optimized AMI (Amazon Linux 1), it’s instance storage is a secondary EBS volume that is used for storing docker containers and volumes. If your launching it on an EC2 with instance store, it is ignored and only the one EBS volume is used . Update July 3, 2019: Added details for Amazon Linux 2 Amazon Linux 1: Amazon ECS-optimized AMI The Amazon Linux 1 based version of the ECS AMI uses the Device Mapper storage driver for container storage, which uses a thin-pool volume (part of LVM). Here is a simplistic cloud-init script that detects the attached SSD and NVMe SSD’s and adds them to the LVM volume group. Just launch your EC2 instance with the following user data or download the script from this gist if you’ve got a more complex init script already. Note: I have not done thorough performance testin

Create a Private Microservice Using an Application Load Balancer

Previously if you wanted to create an REST API powered by a lambda you only had one choice: API Gateway . This has a few limitations notably they’re always public so you need to use IAM or similar to lock it down and you can only use a custom domain name once globally, meaning no duplicating the implementations across multiple accounts with the same host endpoint. AWS recently announced another way to create a RESTful endpoint for Lambda’s: Application Load Balancers .

Using an async iterator on Node.js + S3

There isn't support for  async iterators ( for await...of ) in Node.js v8.9 which is AWS Lambda's runtime. It's shame, as it’s a great feature that allows you to iterate over an iterable that returns as result asynchronously, i.e. retrieving another page from a database, using a compact for loop that feels synchronous but under the covers is actually done asynchronously. Which means, if you want to use a library that written specifically to use it (e.g.  Amazon DynamoDB QueryPaginator ), you have to use an even more verbose syntax. However with a bit of re-purposing you can use a generator function that returns a Promise and if you await each promise given in the loop it will behave like an async iterator.

Auto partition secondary EBS on CentOS 7

If you add an additional blank EBS volume to a CentOS 7 EC2 instance, it won’t auto-partition in the same way as the primary volume gets resized on launch. Additionally if your baking an image you will certainly encounter problems with it not automounting on different instance types and even manually trying to mount it can give misleading error messages.

Using a S3 Hive Metastore with EMR

When configuring Hive to use EMRFS (i.e. s3://) instead of using the implied HDFS cluster storage of the metastore, which is vital if you want to have a persistent metastore that can survive clusters being destroyed and recreated; you might encounter this message: Access Denied (Service: Amazon S3; Status Code: 403;...) .