I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this.
FINAL REQUIREMENTS
The end goal of the system is to store the raw events as they appear in the Kafka Queue for downstream processing.
Let's say I have three different types of events coming into my system [ eventA
, eventB
, eventC
]. Each of these events has it's own unique schema.
The end goal is to store something like this in Azure Data Lake Gen 2 so that it can be processed later
For 2022-02-01 :
/2022-02-01/eventA/<rawDataHere.parquet>
, /2022-02-01/eventB/<rawDataHere.parquet>
, /2022-02-01/eventC/<rawDataHere.parquet>
For 2022-02-02: /2022-02-02/eventA/<rawDataHere.parquet>
, /2022-02-02/eventB/<rawDataHere.parquet>
, /2022-02-02/eventC/<rawDataHere.parquet>
For 2022-02-03: /2022-02-03/eventA/<rawDataHere.parquet>
, /2022-02-03/eventB/<rawDataHere.parquet>
, /2022-02-03/eventC/<rawDataHere.parquet>
DETAILED REQUIREMENTS
- At any point in time, there can be N different types of events. Think of a type of event as a specific stream of events [ For eg,
userClick
is a type of event anduserPageOpen
is a different type of event. These events are independent of each other. ] - For each event type, the structure is guaranteed to be consistent [ For eg, for event type
userClick
, it'll always have the same properties with the same schema ], but the Schema for different event types will always be different. - The minimum required partitioning for each event is
date
. - There's no requirement to read multiple events at any point of time. During analysis, the user will only read one type of event.
- The end goal is to store EventWise, DateWise parquet files [ We're also considering DeltaLake by DataBricks ] for downstream analysis which is done on a Ad-Hoc fashion.
WHAT WE'RE PLANNING TO DO
- KAFKA
- Given the high ingest rate [ 25000 events / second ], it's important to stream the data through Kafka.
- Question here would be, should we use a separate topic for each event type or stream them into a single topic with a structure like
{"eventType": "userClick", "eventData: {}}
- Both of these approaches seem problematic on the consumer side [ Explained in the next section ]
- CONSUMER [ SPARK ]
- Again, spark seems a decent choice here since we'd have a high rate of ingestion which will only go up as time goes on.
- If we have a single topic, we should be able to group by eventType, but this seems to corrupt the schema inferred by spark. Even if we can store the schema for each event in a metastore, I'm not sure how we can group by event, make each event into a separate dataframe and pass in the schema. Even once this is done, there'll have to be N writeStreams [ One
df.writeStream
per event ] which doesn't seem like a wise thing to do with streaming. - If we have multiple topics, we'll need to have N different jobs [ 1 for each eventType ] where we'll manually need to add a new job with a schedule everytime an event is added. This also seems expensive as we'll need N different clusters for processing.
- In this approach, we'll need to make sure we maximise file size [ 1GB minimum ] while minimising the number of files as storing increasing number of small files negatively impact query performance when processing later.
- CONUSMER [ NON-SPARK ]
- Given the issue in spark, what we could potentially do is write a raw consumer in Java, and keep collecting data for 15 min interval. Once this is done, we can flush out the data to a raw file [ after grouping by eventType ] and write it to
Azure Data Lake
. We could have a separate job running every day which reads all the data from the previous day, compact and write it into parquet files. - This approach seems doable, but I'd like to have a simple architecture if possible. Every extra step in processing always comes with additional overhead and complexity and I'd like to figure out the simplest way to achieve my target.
- In this approach, we'll need to make sure we maximise file size [ 1GB minimum ] while minimising the number of files as storing increasing number of small files negatively impact query performance when processing later.
- Given the issue in spark, what we could potentially do is write a raw consumer in Java, and keep collecting data for 15 min interval. Once this is done, we can flush out the data to a raw file [ after grouping by eventType ] and write it to
I am very new to the world of Data, so please forgive any potential silly mistake. Any help would be appreciated.
Thanks in advance :D