Fingrid Data Engineering Project

I developed a scalable, metadata-driven data engineering solution that harvests Finland's national power grid data from Fingrid's Open Data API and transforms it into actionable business intelligence. The system employs an enterprise architecture that processes electricity consumption and power generate forecast data through a configurable ETL pipeline.

- Azure Data Factory (ADF) orchestrates a robust data ingestion pipeline, with CI/CD automation implemented via ARM templates and GitHub Actions.

- Databricks drives data processing with Autoloader and Pyspark for efficient incremental loading and transformation.

- Delta tables enable reliable, versioned data management and time travel capabilities on data stored in ADLS Gen2, while Unity Catalog enforces centralized governance and fine-grained access control.

- The medallion data lakehouse pattern organizes data into progressive layers—bronze for raw ingestion, silver for cleansed and enriched data, and gold for analytics-ready datasets in star schema—ensuring data quality and streamlined data transformation.

- Azure Data Factory (ADF) manages incremental loading with configurable batch sizes and state management, while Databricks performs incremental data processing across all medallion layers (raw → bronze → silver → gold) to ensure efficient and reliable data transformation.

More details about the ADF pipeline and notebooks in Databricks will be found here.

Architecture

Technical Solution

This solution showcases the best practices in modern data engineering through:

Metadata-Driven ETL: Implemented parameterized Azure Data Factory pipelines controlled through a centralized configuration table, enabling dynamic processing without code changes.

Incremental Data Loading:
- ADF: Designed an efficient data ingestion process with configurable batch sizes and state tracking for optimal performance and resource utilization.
- Databricks: Implemented an automated data ingestion pipeline using Autoloader with incremental loading based on the last refresh timestamp, ensuring efficient state tracking and optimized resource utilization.

Medallion Architecture: Structured data through bronze (raw), silver (validated), and gold (business-ready) layers in Databricks for progressive data quality improvement.

Unity Catalog Integration: Incorporated Databricks Unity Catalog to establish enterprise-grade governance across the entire data platform.

Delta Lake Implementation: Utilized Delta Lake as the core storage format throughout the data lakehouse, providing ACID transaction guarantees and schema enforcement for data integrity.

End-to-End Integration: Seamlessly connected source systems (Fingrid API) to business intelligence tools (Power BI) with appropriate transformations at each stage.