Airflow Data Pipeline: Migrating and Cleansing Bigtable Data to Snowflake

This document outlines a recent implementation involving the scheduled extraction, validation, transformation, and loading of time-series data from Google Bigtable to Snowflake using Apache Airflow 2.9 deployed on Kubernetes with Horizontal Pod Autoscaling (HPA). The project goal was to incrementally process large batches of records daily with data cleansing and minimal operational overhead.

Context

The source data consisted of approximately 1.2 billion records stored in Google Bigtable, with daily deltas ranging from 3M to 10M rows. The destination was Snowflake, where downstream analytics and dashboards relied on data consistency and latency guarantees. Constraints included:

Schema evolution across time
Nested timestamp alignment for time-series
Multi-region workload with throughput variability
Need for cloud portability and low operational complexity

Architecture Overview

The pipeline was built using:

Apache Airflow 2.9 with KubernetesExecutor
EKS (Elastic Kubernetes Service) with HPA configured per pod template
Cloud Functions and Pub/Sub for triggering
Parquet and PyArrow for I/O
Snowflake as target, using COPY INTO from external stage (S3)

A shared logging stack (FluentBit → OpenSearch) provided audit and metric observability.

Pipeline Steps

1. Extraction from Bigtable

Each run began with a Cloud Function triggered via Pub/Sub, notifying Airflow of a new batch. A Python-based Bigtable client then performed bounded scans and emitted compressed Parquet files into GCS using PyArrow.

Extraction was parallelized by row key ranges.
Retry and backoff logic were handled in custom Airflow Operators.

2. Data Cleansing and Validation

Each parquet chunk was pulled from GCS into a dedicated pod running the following logic:

Drop rows with missing required fields
Normalize datetime columns to UTC
Enforce type coercion based on a dynamic JSON schema version
Deduplicate using composite hash keys
Add metadata fields for batch ID and processing timestamp

This step was implemented using pandas and pyarrow, executed within KubernetesExecutor pods that scaled horizontally via HPA (CPU and memory thresholds).

3. Staging and Loading into Snowflake

Cleansed data was moved to an S3 staging area using parallel gsutil rsync operations (multi-cloud). Snowflake’s external stage mechanism was used with COPY INTO jobs configured via Airflow hooks.

Partitioning by date and region allowed for bulk insert performance.
Monitoring was integrated via Airflow callbacks and custom Prometheus metrics.

Why Airflow 2.9 and Kubernetes HPA?

Airflow 2.9 provided improved TaskFlow decorators, native KubernetesExecutor enhancements, and better dependency tracking.
HPA** enabled resource-efficient processing — DAGs could run at low cost overnight and scale during high-volume intervals without pre-allocating nodes.
Using Airflow on Kubernetes gave the flexibility to update operator images independently and isolate failures at the pod level.

Results

Average rows processed per day: ~7 million
Typical DAG duration: ~28 minutes
Max parallel pods (via HPA): 96
Data validation failure rate: < 0.04%
Snowflake ingestion throughput: ~9–11 million rows/min (bulk load)

Lessons Learned

Explicit type checking and schema versioning helped prevent ingestion errors as the upstream schema evolved.
TaskGroup modularization improved DAG observability and retry targeting.
Using multi-cloud object sync (GCS → S3) was reliable but requires regional egress cost planning.
Observability using OpenTelemetry/Prometheus exporters was helpful for autoscaling diagnostics.

Conclusion

This implementation demonstrates a practical pattern for scalable, resilient data migration from Bigtable to Snowflake using Airflow 2.9 on Kubernetes. The architecture supported large volumes, variable workloads, and downstream data quality expectations with minimal ongoing maintenance.

🔍 Need help designing resilient ETL pipelines on Kubernetes?

We support teams building data pipelines with Airflow, Snowflake, and multi-cloud environments. Contact us for architectural reviews or operator design assistance.

Context
Architecture Overview
Pipeline Steps
Why Airflow 2.9 and Kubernetes HPA?
Lessons Learned
Conclusion

Designing with Intelligence: How AI Is Redefining UI/UX XGBoost and KMeans: Swiss Army Knife of ML OpenSearch in the Cloud: Essential Use Cases and Deployment Strategies for Modern Data Analytics Top 5 Shipping Tracking APIs for E-commerce (Including Veho)RoBERTa vs. BERT for Social Feedback Analysis: From Comments to Reports

category