Managing and processing data are crucial tasks for businesses of all sizes. Extract, Transform, and Load (ETL) pipelines enable organizations to move data between systems and transform it into useful information. In this post, we’ll explore how to configure the connection from PostgreSQL and S3.
If you need more information about “How to build a data pipeline," you can read my previous post.
Now if we go step by step to build our ETL.
Install and set up Airflow: First, install Apache Airflow on your system by following the official installation guide.
Initialize the Airflow database: After installation, initialize the Airflow database by running the following command:
airflow db init
Start the Airflow web server and scheduler: Run the following commands in separate terminals to start the Airflow web server and scheduler:
airflow webserver --port 8080
To use S3 is important to run the next command; for more information, you can review the documentation:
pip install 'apache-airflow[amazon]'
Create a new DAG (Directed Acyclic Graph): In Airflow, workflows are represented as DAGs. Create a new Python file (e.g.,
my_etl_dag.py) in the
dags folder within your Airflow home directory and define your DAG using the provided code snippet.
Connect to a PostgreSQL database: To connect to a PostgreSQL database, set up a connection in the Airflow UI with the necessary credentials.
To do that, you have to go to your Airflow UI, click on Admin > Connections, and then click on the blue icon as I show you in the image.
Now you will have the following screen where you must fill in your connection data; the ID connection you can change is the one I regularly use.