In DE, Idempotency is the idea that a single ETL job or process will produce the same end result regardless of how many times you re-run the job. That means that if you have a DAG that runs on 6/15/2020, then if you clear and run that DAG 1000x, your data warehouse will still hold the exact same data, no duplicates. This concept is extremely important and will save you time in the long run.
Combining this idea with Airflow can be fairly easy with some knowledge of your dataset, let’s take a look:
I hope this short sample DAG helps to outline an idempotent ETL process using Airflow. Since this DAG is dependent on the execution_date, we can run it a few hundred times and produce the same results (as long as the S3 file remains the same). Incorporating idempotency into your ETL processes works will let you quickly rerun failed jobs or backfill.