Deduping Messages in Kafka Streams the Right Way
Most examples I found out in the wild of how to deduplicate identical or unchanged messages, I’ve recently discovered, do it the wrong way. By wrong way I mean they used a groupByKey & aggregate to compare previous/current values and then filter out the unchanged values. This seemed like a creative way of leveraging the DSL functionality. The problem with this method, is because this wasn’t the intent of the Kafka Streams DSL’s groupBy & aggregate (which is performing aggregations), various features need to be turned off or worked around to prevent the ‘changed’ events being lost by the DSL’s optimisations. The “right way” is a bit subjective, so to quell the inevitable quibbles, let’s settle for the “deduping messages in Kafka Streams: not the wrong way”. The streams DSL, performs optimisations on aggregations by caching the output before sending downstream and also caching it before writing to the state store and the changelog topic it’s persisted to. Effectively doin