Data pipelines: the hard part is the data, not the pipeline

September 27, 2025

ETL, ELT, data lakes, and streaming — tools and patterns for pipelines you can trust, with an honest look at where they usually break.


The most common failure mode

Data projects fail when the numbers in the dashboard are wrong and nobody knows it. Users stop trusting the data. Decisions get made on gut feel anyway. The pipeline keeps running, the tables keep populating, and the whole system becomes expensive theater.

The root cause is almost always not the pipeline architecture. It's unclear metric definitions, missing data quality checks, and no lineage when something changes upstream.

Before thinking about streaming vs batch, Kafka vs Kinesis, or dbt vs custom transforms — define your metrics in plain language and write down who owns them. A metric that two people define differently will always produce two different numbers.

ETL vs ELT: a practical decision

The architectural question is simpler than the debates suggest: where is it cheaper and safer to transform?

ETL (transform before loading) makes sense when the destination is expensive to query (legacy data warehouses, constrained storage), or when PII must be removed before data enters the warehouse at all.

ELT (load raw, transform in warehouse) makes sense when compute is cheap in the warehouse (BigQuery, Snowflake, Redshift), you want to keep raw data for reprocessing, and transformation logic needs to evolve frequently.

In practice, most modern stacks do both: raw tables in one schema for ingestion, curated tables produced by dbt models that own the transformation logic. The raw layer is the safety net; the curated layer is what analysts use.

-- Incremental pattern: don't reload what you already have
INSERT INTO curated.events
SELECT *
FROM raw.events
WHERE ingested_at > (
    SELECT COALESCE(MAX(ingested_at), '1970-01-01') FROM curated.events
);

Data quality is not optional

If you don't test the data, you don't know if it's correct. dbt's built-in tests are the minimum:

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: id

These catch data contract violations at the model level. Add freshness checks so you know when sources stop delivering. Add anomaly detection (row count drops, null rate spikes) for critical tables.

The moment a data issue is caught by a dashboard user rather than a pipeline check, you've already lost trust.

Streaming: start with the failure modes

Real-time pipelines introduce problems batch pipelines don't have:

Late-arriving events — an event timestamped 5 minutes ago arrives now. Does your aggregation window handle it? Do you reprocess, discard, or accumulate?

Exactly-once semantics — "at least once" delivery means you'll see duplicate events. Every consumer needs to be idempotent or you need deduplication before the consumer sees the data.

Consumer lag — if producers are outpacing consumers, lag builds up. You need to measure it and alert on it before it becomes hours of backlog at a critical time.

Design for idempotency from the start, even when it feels like gold-plating. The first time you need to replay a Kafka topic, you'll be glad you did.

Treat datasets like products

A data lake without ownership is a data swamp. Every dataset should have: an owner who answers questions about it, documentation of what it contains and how it's produced, a defined SLA for freshness, and a place to report issues.

This isn't bureaucracy. It's the difference between a dataset that gets used and one that accumulates in storage and slowly corrupts downstream models as the source system changes and nobody notices.

References

Hi, I'm Martin Duchev. You can find more about my projects on my GitHub.