Airflow Data Pipeline: Migrating and Cleansing Bigtable Data to Snowflake
This document outlines a recent implementation involving the scheduled extraction, validation, transformation, and loading of time-series data from Google Bigtable to Snowflake using Apache Airflow 2.9 deployed on Kubernetes with Horizontal Pod Autoscaling (HPA). The project goal was to incrementally process large batches of records daily with data cleansing and minimal operational overhead.
Context
The source data consisted of approximately 1.2 billion records stored in Google Bigtable, with daily deltas ranging from 3M to 10M rows. The destination was Snowflake, where downstream analytics and dashboards relied on data consistency and latency guarantees. Constraints included:
- Schema evolution across time
- Nested timestamp alignment for time-series
- Multi-region workload with throughput variability
- Need for cloud portability and low operational complexity
Architecture Overview
The pipeline was built using:
- Apache Airflow 2.9 with KubernetesExecutor
- EKS (Elastic Kubernetes Service) with HPA configured per pod template
- Cloud Functions and Pub/Sub for triggering
- Parquet and PyArrow for I/O
- Snowflake as target, using
COPY INTO
from external stage (S3)
A shared logging stack (FluentBit → OpenSearch) provided audit and metric observability.
Pipeline Steps
1. Extraction from Bigtable
Each run began with a Cloud Function triggered via Pub/Sub, notifying Airflow of a new batch. A Python-based Bigtable client then performed bounded scans and emitted compressed Parquet files into GCS using PyArrow.
- Extraction was parallelized by row key ranges.
- Retry and backoff logic were handled in custom Airflow Operators.
2. Data Cleansing and Validation
Each parquet chunk was pulled from GCS into a dedicated pod running the following logic:
- Drop rows with missing required fields
- Normalize datetime columns to UTC
- Enforce type coercion based on a dynamic JSON schema version
- Deduplicate using composite hash keys
- Add metadata fields for batch ID and processing timestamp
This step was implemented using pandas and pyarrow, executed within KubernetesExecutor pods that scaled horizontally via HPA (CPU and memory thresholds).
3. Staging and Loading into Snowflake
Cleansed data was moved to an S3 staging area using parallel gsutil rsync
operations (multi-cloud).
Snowflake’s external stage mechanism was used with COPY INTO
jobs configured via Airflow hooks.
- Partitioning by date and region allowed for bulk insert performance.
- Monitoring was integrated via Airflow callbacks and custom Prometheus metrics.
Why Airflow 2.9 and Kubernetes HPA?
- Airflow 2.9 provided improved TaskFlow decorators, native KubernetesExecutor enhancements, and better dependency tracking.
- HPA** enabled resource-efficient processing — DAGs could run at low cost overnight and scale during high-volume intervals without pre-allocating nodes.
- Using Airflow on Kubernetes gave the flexibility to update operator images independently and isolate failures at the pod level.
Results
- Average rows processed per day: ~7 million
- Typical DAG duration: ~28 minutes
- Max parallel pods (via HPA): 96
- Data validation failure rate: < 0.04%
- Snowflake ingestion throughput: ~9–11 million rows/min (bulk load)
Lessons Learned
- Explicit type checking and schema versioning helped prevent ingestion errors as the upstream schema evolved.
- TaskGroup modularization improved DAG observability and retry targeting.
- Using multi-cloud object sync (GCS → S3) was reliable but requires regional egress cost planning.
- Observability using OpenTelemetry/Prometheus exporters was helpful for autoscaling diagnostics.
Conclusion
This implementation demonstrates a practical pattern for scalable, resilient data migration from Bigtable to Snowflake using Airflow 2.9 on Kubernetes. The architecture supported large volumes, variable workloads, and downstream data quality expectations with minimal ongoing maintenance.
🔍 Need help designing resilient ETL pipelines on Kubernetes?
We support teams building data pipelines with Airflow, Snowflake, and multi-cloud environments. Contact us for architectural reviews or operator design assistance.
Table of Contents
- <strong>Context</strong>
- <strong>Architecture Overview</strong>
- <strong>Pipeline Steps</strong>
- <strong>Why Airflow 2.9 and Kubernetes HPA?</strong>
- <strong>Lessons Learned</strong>
- <strong>Conclusion</strong>
Trending
Table of Contents
- <strong>Context</strong>
- <strong>Architecture Overview</strong>
- <strong>Pipeline Steps</strong>
- <strong>Why Airflow 2.9 and Kubernetes HPA?</strong>
- <strong>Lessons Learned</strong>
- <strong>Conclusion</strong>