Transform Load pipeline for a logs system: Apache Airflow or Kafka Connect?

23 views Asked by At

Introduction:

Actually, I've build and manage a solution of industrial network inventory:

Data sources => ETL jobs (Python) orchestrated by Apache AirFlow => OpenSearch => Dashboard


My second job is to implement a log centralizator system:

Hosts(log agents) => Ingest pipeline => Kafka => Transform Pipeline => OpenSearch

Logs: system, app & network.

Expected target: 15000 Hosts.

The ingest & transform pipeline could be done with logstash or other parsers.


Requirement of the ingest & transform pipeline:

Must:

Manageable/monitor

Scalable

Resilent

SecurityByDefault

Should:

Use the same tech bricks as much as possible to descrease maintenability cost.


First solution: Using Apache Airflow to orchestrate and manage the Ingest and Transform pipelines.

Question:

Can Airflow process that much I/O logs activity?

Can Airflow scale up to reach the final load (15K hosts)?

Second Solution:

If Airflow is not made for that huge I/O pipeline (logs system), I planned to use Kafka Connect instead (new system block => more maintenance cost).

Note: I wish to avoid discution about maintenance cost because it depend of the sector, teams...

0

There are 0 answers