category

DatabaseMachine learningKuberneteseCommerceCloudWeb Application

Airflow Data Pipeline: Migrating and Cleansing Bigtable Data to Snowflake

This document outlines a recent implementation involving the scheduled extraction, validation, transformation, and loading of time-series data from Google Bigtable to Snowflake using Apache Airflow 2.9 deployed on Kubernetes with Horizontal Pod Autoscaling (HPA). The project goal was to incrementally process large batches of records daily with data cleansing and minimal operational overhead.


Context

The source data consisted of approximately 1.2 billion records stored in Google Bigtable, with daily deltas ranging from 3M to 10M rows. The destination was Snowflake, where downstream analytics and dashboards relied on data consistency and latency guarantees. Constraints included:

  • Schema evolution across time
  • Nested timestamp alignment for time-series
  • Multi-region workload with throughput variability
  • Need for cloud portability and low operational complexity

Architecture Overview

The pipeline was built using:

  • Apache Airflow 2.9 with KubernetesExecutor
  • EKS (Elastic Kubernetes Service) with HPA configured per pod template
  • Cloud Functions and Pub/Sub for triggering
  • Parquet and PyArrow for I/O
  • Snowflake as target, using COPY INTO from external stage (S3)

A shared logging stack (FluentBit → OpenSearch) provided audit and metric observability.


Pipeline Steps

1. Extraction from Bigtable

Each run began with a Cloud Function triggered via Pub/Sub, notifying Airflow of a new batch. A Python-based Bigtable client then performed bounded scans and emitted compressed Parquet files into GCS using PyArrow.

  • Extraction was parallelized by row key ranges.
  • Retry and backoff logic were handled in custom Airflow Operators.

2. Data Cleansing and Validation

Each parquet chunk was pulled from GCS into a dedicated pod running the following logic:

  • Drop rows with missing required fields
  • Normalize datetime columns to UTC
  • Enforce type coercion based on a dynamic JSON schema version
  • Deduplicate using composite hash keys
  • Add metadata fields for batch ID and processing timestamp

This step was implemented using pandas and pyarrow, executed within KubernetesExecutor pods that scaled horizontally via HPA (CPU and memory thresholds).

3. Staging and Loading into Snowflake

Cleansed data was moved to an S3 staging area using parallel gsutil rsync operations (multi-cloud). Snowflake’s external stage mechanism was used with COPY INTO jobs configured via Airflow hooks.

  • Partitioning by date and region allowed for bulk insert performance.
  • Monitoring was integrated via Airflow callbacks and custom Prometheus metrics.

Why Airflow 2.9 and Kubernetes HPA?

  • Airflow 2.9 provided improved TaskFlow decorators, native KubernetesExecutor enhancements, and better dependency tracking.
  • HPA** enabled resource-efficient processing — DAGs could run at low cost overnight and scale during high-volume intervals without pre-allocating nodes.
  • Using Airflow on Kubernetes gave the flexibility to update operator images independently and isolate failures at the pod level.

Results

  • Average rows processed per day: ~7 million
  • Typical DAG duration: ~28 minutes
  • Max parallel pods (via HPA): 96
  • Data validation failure rate: < 0.04%
  • Snowflake ingestion throughput: ~9–11 million rows/min (bulk load)

Lessons Learned

  • Explicit type checking and schema versioning helped prevent ingestion errors as the upstream schema evolved.
  • TaskGroup modularization improved DAG observability and retry targeting.
  • Using multi-cloud object sync (GCS → S3) was reliable but requires regional egress cost planning.
  • Observability using OpenTelemetry/Prometheus exporters was helpful for autoscaling diagnostics.

Conclusion

This implementation demonstrates a practical pattern for scalable, resilient data migration from Bigtable to Snowflake using Airflow 2.9 on Kubernetes. The architecture supported large volumes, variable workloads, and downstream data quality expectations with minimal ongoing maintenance.


🔍 Need help designing resilient ETL pipelines on Kubernetes?

We support teams building data pipelines with Airflow, Snowflake, and multi-cloud environments. Contact us for architectural reviews or operator design assistance.


Table of Contents


Trending

Serverless Database Showdown: Oracle, Azure, Redshift, and AuroraOrchestrating Spark on AWS EMR from Apache Airflow — The Low-Ops WayCase Study: A Lightweight Intrusion Detection System with OpenFaaS and PyTorchBuilding Resilient Kubernetes Clusters with Portworx Community EditionIntegrating Shopify into a Next.js React Web App