top of page
Screenshot 2025-05-18 114938.jpg

Data Engineering With DLT Pipeline

I engineered a fully automated, cloud-native data platform on AWS to process and analyze Finland's national power grid data. This solution leverages a modern data stack to transform raw data from Fingrid's Open Data API into a governed, analytics-ready lakehouse, providing critical insights into energy markets.

  • Infrastructure as Code (IaC) and CI/CD: The entire AWS infrastructure, including S3 buckets and Databricks workspace configurations, is provisioned and managed declaratively using Terraform. This is integrated into a CI/CD pipeline (using Bitbucket Pipelines/GitHub Actions) for automated, repeatable deployments of both infrastructure and data pipelines across development, test, and production environments.

  • Declarative ETL with Delta Live Tables: Data processing is orchestrated with Databricks Delta Live Tables (DLT). This declarative framework is used to build reliable, maintainable, and testable data processing workflows with built-in data quality management, incremental loading, schema evolution, and automated error handling.

  • Packaged Deployments with Asset Bundles: The entire Databricks project—including DLT pipeline definitions, notebooks, and Python dependencies—is packaged using Databricks Asset Bundles. This IaC approach bundles all project components and settings into a single, version-controlled unit, enabling automated and reliable deployments through the CI/CD workflow.

  • Medallion Lakehouse on AWS S3: The solution implements a Medallion architecture on AWS S3, with data stored in the Delta Lake format. The DLT pipeline manages the flow of data from raw ingestion (bronze), through cleansing and enrichment (silver), to business-aggregated datasets (gold), ensuring ACID transactions and full data versioning.

  • Analytics-Ready Data Modeling: In the gold layer, data is modeled into a star schema, providing optimized, high-performance datasets for downstream consumption. These curated datasets serve as a single source of truth for creating interactive dashboards and reports in business intelligence tools like Power BI.
     

More details about the Terraform configuration and DLT Pipelines in databricks will be found here.

Architecture

Screenshot 2025-10-06 213600.jpg

Technical Solution

 

This solution showcases the best practices in modern data engineering through:

  • Databricks Declarative ETL, Asset Bundles and Infrastructure as Code (IaC): The entire AWS infrastructure, including S3 buckets and Databricks configurations, is declaratively managed with Terraform. This is integrated with a CI/CD pipeline (Asset Bundles/Bitbucket/GitHub Actions) for automated, repeatable deployments. Data pipelines are built using Databricks Delta Live Tables (DLT), which applies a declarative approach to build reliable and maintainable workflows.

  • Efficient Incremental Processing: The DLT pipeline leverages Databricks Autoloader for highly efficient, incremental data ingestion from S3, automatically processing new files as they arrive. This ensures state is managed automatically and resources are utilized optimally.

  • Medallion Architecture with Delta Lake: The solution implements a Medallion Architecture using Delta Lake on AWS S3. Data is progressively refined through bronze (raw), silver (validated), and gold (business-ready) layers. This provides ACID transactions, schema enforcement, and data versioning for exceptional data reliability.

  • Unified Governance with Unity Catalog: Databricks Unity Catalog is integrated across the AWS environment to establish a centralized governance framework. This enforces fine-grained access control, provides end-to-end data lineage tracking, and creates a unified security model for all data assets.

  • End-to-End Integration for BI: The platform creates a seamless connection from the source API to the final business intelligence layer (Power BI). In the gold layer, data is modeled into a star schema, ensuring that analytics and dashboards are performant and built upon a single source of truth.

Centralized Control Table

Screenshot 2025-05-18 124640.jpg

Databricks Ingestion & DLT Pipeline

Screenshot 2025-10-06 214307.jpg

ETL/DLT Pipelines in databricks

dlt pipeline.jpg

 

More details about the Databricks asset bundles will be found here

More details about the notebooks and DLT Pipelines in Databricks will be found here.

bottom of page