10 Steps to Get started aggregating data using Kafka Streams

Amandeep Midha
5 min readAug 25, 2022

Ideally this blog is for those developers who know Apache Kafka, beginner level aware of Kafka Streams DSL (using Java or Scala possibly as only two languages supported to utilize full features) but have no prior experience in Kafka Streaming. So here are 7 steps and considerations to get started

  1. What is Topology & which Data Structure to use and when?

First things first — Topology is basically a set/sequence of processing operations applied at one or more streams to arrive at expected output. Detailed explanation around Toplogy can be read here

Primarily — two kind of data structures aka KStream and KTable (also GlobalKTable) are available to utilize while constructing such Topology for your streaming application

Getting to Know the difference between these two main constructs in Kafka Streaming is first important step:

KStream is a log of messages where keys can repeat

KTable is special form of store where every latest key arriving replaces the old value with new one

Another important concept is KStream-KTable Duality how they can be converted to each other seamlessly

GlobalKTable however is special case which requires more memory and can only be used in right side of a join operation — as GlobalKTable has a fully copy of data on each instance, not just particular partition like in case of KTable. Using GlobalKTable should be restrictive and is important to keep in mind the original thought why it was introduced

2. Stateful or Stateless Operation?

Streaming operations can be Stateful or Stateless. Stateless operations in Kafka Streams are useful if you are doing an operation like filtering or any other operation that doesn’t need to remember state. For any other operations that require history of previous messages in context (aka state preservance) — those are termed as Stateful operations

For example: filter, map, branch, FlatMap, FlatMapValues, Map, MapValues, Peek, SelectKey, etc are stateless operations

Among Stateful operations are: count, join, aggregate, reduce, and all Time Window operations

As a most common operation involving more than one stream, i.e. if Joining two streams is the objective, clearly, JOIN is a stateful operation

3. Does your have Messages have Keys?

For any JOIN operation to work — first the input messages should have keys. Normally imported (unless configured possibly via Kafka Connect) messages do not have keys by default, and Kafka Streams DSL operation like “selectKey(…)” comes to be useful

As a Kafka developer otherwise you must understand that any change of keys in messages will trigger re-partitioning of messages and computing costs associated with rekeying should be acknowledged so such operations should be minimized unless absolutely necessary

4. Two Streams to Join at a Time

If you are new to streaming, this may not affect you, but if you have prior message streaming experience in frameworks like Apache Spark, then Kafka Streaming might baffle you slightly. That is because, unlike Apache Spark where multi-way Joins between streams are possible — with Apache Kafka, only two streams can be joined at one step at a time — however those joining operations can be all put in one topology , needless to say

For example if there are 6 input topics to join, it will be two at a time as Kafka Streams are designed (against Apache Spark) — for example the series of join operations might look like this for arriving at one final joined and aggregated final topic — based on actual join and aggregation need for each of four domain topics, be prepared to have one topology written for arriving at each four topics

5. Have you looked at all DSL Operations Possible?

An exhaustive list of operations on Streams DSL (Data Streaming Library) presents all the available operations as Kafka developer available to you as you may require. Ideally that is a continuous learning around different operations which are possible. If you ever had any academic introduction to MapReduce, probably this is right time to refresh that in memory to be proficient in Kafka Streaming

6. Avoid Other Library Constructs (e.g. Spring) at least while Learning Kafka Streaming

Apache Kafka streams are extremely efficient and flexible in itself and goes thru constant changes and improvisations. I am not a big fan personally of cluttering this code with Spring constructs and annotations which binds you with Spring context and application lifecycle also to some extent. More importantly, while some operations can come handy — it tries to dumb you down on your own understanding of Kafka streaming , ability to inspect and debug materialized views and state stores also to some extent. But choice is still yours :-)

7. Get a UI Monitoring Tool for Kafka Stream

Unless you are using Confluent as Apache Kafka provider, which comes bundled with Confluent Control Center UI — you otherwise need to invest little time in having a UI dashboard to help you in development of seeing messages, topics, internal topics etc. You pick one matching your requirements here , or otherwise if you are comfortable with command line tools like kcat (formerly kafkacat), that also would do

8. Does concept of Time play Role in Streaming?

Yes, though the timed window concept is considered advanced topic for streaming and kind of timing windows Kafka Streaming provides — but think it like this — If your Kafka input topics are receiving lets say 30,000 records in one batch transformed as input message — it will take some amount of time for sure.

Different notions of these times are categorized as : (1) Event Time , (2) Processing Time, (3) Ingestion Time, and (4) Stream Time

9. Why Time Windowing Matters?

Now if some aggregation operation has to wait for all relevant matching records before completing — if in worst case the 1st arrived record has to be aggregated with 30th thousand arrived record — the aggregation window must wait for that much time heuristically speaking for aggregation to be successful

Kafka streams provide three kind of timed windows: Tumbling Window, Hopping Window, and Sliding Window. More can be read here

10. Finally, how to Test my Streaming Topology?

Kafka Streams does come with a test-utils package to test your streaming application via TopologyTestDriver. More details can be found here

Otherwise, Kafka-Streams-Viz happens to be extremely useful Streaming visualizer to map visually the journey of data via streaming operations

--

--