Build data pipelines you can trust
September 27, 2025
A compact view of ETL/ELT, data lakes, and streaming systems with a focus on quality and operational simplicity.
Model the business first, then the tables
Data projects fail when “what does this mean?” is unclear.
- Define entities and metrics in plain language
- Track metric ownership and calculation rules
- Keep dimensional models and event models explicit
ETL vs ELT: pick based on where transformations belong
Common patterns:
- ETL: transform before loading when the target is strict or expensive
- ELT: load raw data, transform in the warehouse/lakehouse for flexibility
In practice, you want both: raw ingestion plus curated, versioned transforms.
Example (incremental load pattern):
insert into curated.events
select *
from raw.events
where ingested_at > (select coalesce(max(ingested_at), '1970-01-01') from curated.events);
Data lakes: organize for discovery and governance
A lake without structure becomes a dumping ground. Establish:
- Naming conventions and partitioning strategy
- Schema evolution rules
- Access controls by dataset and role
Treat datasets like products: documented, owned, and monitored.
Streaming: start simple, then grow sophistication
Real-time pipelines introduce new failure modes:
- Late or out-of-order events
- Duplicates and replays
- Backpressure and consumer lag
Design for idempotency and implement clear replay procedures early.
Quality and observability are non-negotiable
If you can’t trust the data, it won’t be used:
- Data tests (nulls, ranges, uniqueness, referential checks)
- Freshness and completeness monitoring
- Lineage to answer “where did this number come from?”
References
Hi, I'm Martin Duchev. You can find more about my projects on my GitHub page.