Let us say we have:
- a web app with a
Postgres DB
that produces data over time, - another
DB
optimized for analytics that we would like to populate over time.
My goal is to build and monitor an ETL
pipeline that will transform the data and write it to the analytics DB
. (Preferably using a streaming approach.)
My idea was to:
- have the web app produce a
JSON
representation of the relevant data, write it to a message broker of some kind (sayRabbitMQ
). - use
Airflow
to structure and monitor theETL
process
The part I am confused about is:
- How to have
Airflow
subscribe to the message broker in a way that a given pipeline is triggered every time a message is received? - (If the above is possible), does this approach make sense? (In particular using
Airflow
in a stream-like approach, rather than in a scheduled, batch approach.) - (If 1 is not possible) I have a few follow-up questions below.
Follow-up questions if architecture above is not feasible:
- why is not possible?
- what would be an alternative way to construct a streaming approach that would allow some of form of monitoring/management of tasks out of the box?
- would it be a more standard approach to let Airflow schedule retrieval of data in small batches and what are the components of the trade-off between this approach and streaming approach?
Any form of feedback on these questions is much welcome.