Deploy ETLT Infrastructure with Airbyte, Minio, Clickhouse, Dbt and Airflow

This project sets up an analytics architecture using Docker containers for data extraction, storage, orchestration, and transformation.

Airbyte (Extract) retrieves data from sources and deposits it into MinIO (Data Lake).
In MinIO, an initial transformation (Python) is performed to clean or enrich the files.
The data is then loaded (Load) from MinIO into ClickHouse, which is why "Load" appears under ClickHouse.
Finally, dbt (Transform) runs SQL models that transform the data directly in ClickHouse to make it ready for analysis.

Prerequisites

Docker, Docker Compose, and DBeaver installed on your machine.
Internet access to download required images.

Included Services

Airbyte: Data integration tool, installed locally via Docker.
Minio: Data lake for storing data extracted by Airbyte.
Airflow: ETLT job orchestrator.
Clickhouse: Data warehouse for analytics.
dbt: Modular data transformation in the warehouse.

Local Installation

1. Airbyte Installation

Airbyte is installed locally via Docker in this setup. For production or Kubernetes environments, refer to the Airbyte OSS installation guide.

It's poossible to install it if you use a kubernetes cluster by using helm. This option is handy when you want to install a specific version of airbyte.

Example Helm installation:

helm install \
  airbyte \
  airbyte/airbyte \
  --namespace airbyte \
  --values ./values.yaml

We might also use this fastest way to install a QuickStart Aribyte by following different steps explained on Airbyte website. Before that, make sure you've already installed Docker Desktop as explained on Airbyte Site.

Once Airbyte is installed, the first step is to connect it to your weather API (e.g., VisualCrossing). You can use Airbyte's API connector builder for this purpose. Review the custom_api_builder.yaml file for connection details—this file outlines the configuration needed to connect to the VisualCrossing API, including authentication and endpoint setup. You can recreate the connector using Airbyte's no-code interface or import this YAML file. Here is a breakdown of its components. You would adjust this file one based on your needs.

To configure the connector, you will need:

API Key from your VisualCrossing account (that's important to note how to handle AUTHENTICATION).
Base URL: https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timelinemulti (supports up to five locations per request for free accounts especially with the endpoint timelinemulti). See VisualCrossing documentation for details.
Locations: Specify locations in the format London,UK|Paris,France|Tokyo,Japan|Cape Town,South Africa.
unitGroup: metric, us,... metric option provides you for instances metrics used in France like °C, m, etc. For guidance on building custom API connectors, refer to the Airbyte Connector Builder tutorial.

To ensure the compatibilty of each part of this project it is important to specifiy their relevant versions

Here the breakdown of each piece:

Part	Version
abctl	v0.30.1
airbyte/connector-builder-server	1.8.0
airbyte/cron	1.8.0
airbyte/workload-api-server	1.8.0
airbyte/worker	1.8.0
airbyte/workload-launcher	1.8.0
airbyte/server	1.8.0
altinity/clickhouse-server	24.8.14.10501.altinitystable-alpine
airbyte/db	1.7.0-17
quay.io/minio/minio	latest (to freeze)
temporalio/auto-setup	1.27.2
clickhouse/clickhouse-server	23.7.4.5-alpine
altinity/clickhouse-server	21.8.10.1.altinitystable

In this project, multiple source connections are created—each handling a chunk of locations (to cover all 99 departments of metropolitan France)—with MinIO as the destination.

Airbyte supports S3-compatible storage, and MinIO works seamlessly with Amazon S3 APIs. To set up the destination:

Obtain your MinIO endpoint URL, access key, and secret key.
In Airbyte, select the S3-compatible destination connector and enter your MinIO details.
Ensure your MinIO bucket has read/write permissions and encryption in transit is enabled.

It will be shown something like displayed below :

1. Define sources you need with Api Connector already created.

2. Define a Minio destination (Note you have to add Airbyte into the Minio network as described in the script one connect_airbyte_to_containers_network)

3. Define how to save data to the destination (`Manual Schedule type` allows to `Airflow` to launch a sync to each connection)

If you encounter issues, consult the Airbyte S3 destination documentation or reach out to the Airbyte community.

2. Clone the repository:

Be aware it required to populate .env.example with environment variables defined into with your own secrets.

git clone https://github.com/donat-konan33/etl-airbyte-minio-clickhouse-dbt-airflow.git
cd etl-airbyte-minio-clickhouse-dbt-airflow

3. Start the infrastructure:

Before starting infra build data for data persistency of each service that need it.

./scripts/docker_external_volumes.sh && ./scripts/docker_external_networks.sh
docker compose up -d
./scripts/airbyte_connect_to_containers_network.sh

Then come back to step 1 at airbyte UI and create a destination for minio from.

As a rule, Airbyte has an IP like 172.x.x.n and adding only minio to its network gives minio a new IP equals to 172.x.x.n+1. Use it to create destination with as the minio endpointID.

4. Access web interfaces:

5. Connect to Clickhouse using DBeaver to view or checkout the DBT transformations achieved.

Before connecting to Clickhouse with Dbeaver, database you would reach must have been created, otherwise it will connect to the default db. You could have need to set user and password. general settings have to be : Host:localhost and port : 8123

6. Airflow Connection and parameters

To connect to airbyte with airflow it is necessarly to note these informations as explained on airflow reference:

With Airbyte connection type:

There is stuff showing how to get, post or simply interact with airbyte server api.

To customize an airbyte deployment do see this link one.

With http Http connection type: The dag named extract_airbyte_sync_to_minio_dag.py you can find at dags/extract_airbyte_sync_to_minio_dag.py is typically used likely when work with Airbyte cloud environment

We struggled to create both user email and secret with auth enabled and we couldn't launch airbyte jobs from airflow as orchestrator mostly when using airbyte connection type.

So we used http Connection type:

You can see an example of Graph of interactions between Airbyte and Airflow applying job to it. Bellow is the Graph one:

As we added Airbyte to airflow network we were able to use the name airbyte-abctl-control-plane as airbyte hostname.

Some relevant alternatives

Note we are capable to automate custom APIs building along with sources and destinations. See here for more.

Useful to know

Understanding ClickHouse Engines and Their Semantics

When working with ClickHouse, it is essential to understand the different table engines and their behavior, as well as how sorting keys affect data storage.

Common Engines:

MergeTree → does not guarantee uniqueness of rows.

ReplacingMergeTree → keeps only the latest version of a record based on a defined version column.

CollapsingMergeTree → allows insertion of rows with a sign column (+1 / -1) to represent row states and collapse them accordingly.

ReplicatedCollapsingMergeTree → same as CollapsingMergeTree, but supports replication across multiple nodes.

VersionedCollapsingMergeTree → similar to CollapsingMergeTree, but with a version column to resolve conflicts.

SummingMergeTree → automatically aggregates numeric columns during merges based on the sorting key.

Sorting keys in ClickHouse

ORDER BY tuple() → no indexing on insertion, faster inserts.

ORDER BY (col1, col2, ..., colN) → creates an index on the specified columns, improving query performance but making inserts slightly slower.

Important Notes

The sorting key in ClickHouse behaves somewhat like a primary key in transactional databases, but it does not enforce uniqueness.

In many cases (e.g., appending tweets to a table), data is simply inserted as new rows instead of being updated in place.

But it is not the case here, we just query table for the moment, our api does not allow it.

So What We eventually use looks like : db.query(models.Tweet).offset(skip).limit(limit).all(), Where db is the database connection Session.

Environment Variables

It is important to create a local .env file with the same contents as .env.example. If you have questions or need clarification about any part of this setup, feel free to reach out for assistance. Here are variables you need for this project by refering to .env.example:

Variable	Description
API_KEY	VisualCrossing API KEY needed to fetch weather data
MINIO_USER_NAME	Username for MinIO user
MINIO_USER_PASSWORD	Password for MinIO user
AWS_ACCESS_KEY_ID	AWS access key, set from MinIO username
AWS_SECRET_ACCESS_KEY	AWS secret key, set from MinIO password
MINIO_HOST	Host address for MinIO server
MINIO_API_PORT	Internal port for MinIO API
CLICKHOUSE_DB	Name of ClickHouse database
CLICKHOUSE_USER	Username for ClickHouse
CLICKHOUSE_PASSWORD	Password for ClickHouse user
CLICKHOUSE_HOST	Host address for ClickHouse server
CLICKHOUSE_HOSTNAME	Hostname for ClickHouse server
POSTGRES_PASSWORD	Password for PostgreSQL user
POSTGRES_USER	Username for PostgreSQL
POSTGRES_DB	Name of PostgreSQL database
POSTGRES_HOST	Host address for PostgreSQL server
HOST	Host address for Airbyte server
AIRBYTE_URL	URL for Airbyte instance
AIRBYTE_USER	Username for Airbyte
AIRBYTE_PASSWORD	Password for Airbyte user
LAKE_BUCKET	Name of first lake bucket for Airflow if Airflow Variable
AIRFLOW_ADMIN_EMAIL	Email address for Airflow admin user
AIRFLOW_ADMIN_USERNAME	Username for Airflow admin user
AIRFLOW_ADMIN_PASSWORD	Password for Airflow admin user
AIRFLOW_ADMIN_FIRST_NAME	First name of Airflow admin user
AIRFLOW_ADMIN_LAST_NAME	Last name of Airflow admin user

Notes

This project is under development. Refer to each service's documentation for more details.

It will leverage airflow version 3.x.x its setup is different to the current one (2.10.3).

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
airbyte		airbyte
airflow_auth_config		airflow_auth_config
assets		assets
dags		dags
data		data
dbt_project		dbt_project
project_functions		project_functions
scripts		scripts
tests		tests
weather_api		weather_api
.dockerignore		.dockerignore
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
Dockerfile_fastapi		Dockerfile_fastapi
README.md		README.md
default-profile.xml		default-profile.xml
docker-compose.yml		docker-compose.yml
makefile		makefile
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deploy ETLT Infrastructure with Airbyte, Minio, Clickhouse, Dbt and Airflow

Prerequisites

Included Services

Local Installation

1. Airbyte Installation

1. Define sources you need with Api Connector already created.

2. Define a Minio destination (Note you have to add Airbyte into the Minio network as described in the script one connect_airbyte_to_containers_network)

3. Define how to save data to the destination (`Manual Schedule type` allows to `Airflow` to launch a sync to each connection)

2. Clone the repository:

3. Start the infrastructure:

4. Access web interfaces:

5. Connect to Clickhouse using DBeaver to view or checkout the DBT transformations achieved.

6. Airflow Connection and parameters

Useful to know

Understanding ClickHouse Engines and Their Semantics

Sorting keys in ClickHouse

Important Notes

Environment Variables

Notes

About

Uh oh!

Releases

Packages

Languages

donat-konan33/etl-airbyte-minio-clickhouse-dbt-airflow

Folders and files

Latest commit

History

Repository files navigation

Deploy ETLT Infrastructure with Airbyte, Minio, Clickhouse, Dbt and Airflow

Prerequisites

Included Services

Local Installation

1. Airbyte Installation

1. Define sources you need with Api Connector already created.

2. Define a Minio destination (Note you have to add Airbyte into the Minio network as described in the script one connect_airbyte_to_containers_network)

3. Define how to save data to the destination (Manual Schedule type allows to Airflow to launch a sync to each connection)

2. Clone the repository:

3. Start the infrastructure:

4. Access web interfaces:

5. Connect to Clickhouse using DBeaver to view or checkout the DBT transformations achieved.

6. Airflow Connection and parameters

Useful to know

Understanding ClickHouse Engines and Their Semantics

Sorting keys in ClickHouse

Important Notes

Environment Variables

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

3. Define how to save data to the destination (`Manual Schedule type` allows to `Airflow` to launch a sync to each connection)

Packages