
AI is making manually writing complex data pipelines a thing of the past! If AI is writing the pipelines, what tasks are left for data engineers to work on?
Conceptual knowledge is becoming king in 2025 and onward!
In this article we will cover:
How AI will be impact data engineering
What skills and responsibilities are more/less vulnerable to AI
How tools like Cursor and Windsurf make development much faster
What design patterns and best practices should you know as a data engineer to supercharge your career with AI
How AI will impact data engineering
This is the area where data engineers are worried. Will AI take all the jobs? How will things look in the future?
You can think of data engineering responsibilities on a few axes:
Technical and Tactical (hard skills that have a day-to-day focus)
Writing Spark and SQL code - medium risk of disruption
Tools like Cursor and Windsurf make the codegen here much faster. You still need to test and look over the code though so it’s not high risk of disruption yet.
Fixing broke on oncall pipelines - high risk of disruption
A large majority of pipeline failures are actually false positives caused by badly tuned data quality checks or memory errors. This should be a huge relief to data engineers who are currently 80%+ burnt out!
Technical and Strategic (hard skills that have a long-term focus)
Building data processing frameworks - low risk of disruption
Making improvements to things like Airflow, Spark, etc will still need to be done via humans. AI is good at stitching together things that have already been done before. It’s not as great at making good improvements on things that already exist. If you don’t believe me, try having Cursor tackle a large tech debt item in a big code base. It will hallucinate and fail even with the most state-of-the-art models.
Automated data quality - medium risk of disruption
Generating the simple Great Expectations or SQL Mesh quality check queries will be much easier with AI. Having the business context to make these checks really powerful will still be outside the reach of AI for quite some time.
Writing tests for your pipelines - medium risk of disruption
generating fake input and output data for your pipeline unit test is something AI is really good at doing. Being thorough with testing the edge cases and understanding business relevant tests is still something that engineers need to do
Soft skills and tactical (communication skills focused on day-to-day)
Sprint planning - medium risk of disruption
Sprint planning has a human negotiation element to it that will be harder for AI to replace. The opportunity sizing and organization elements will be greatly streamlined by AI though.
Writing documentation - medium risk of disruption
Documentation boilerplate will be greatly augmented by AI and the time it takes to maintain good documentation will be reduced. AI will still struggle with business context necessary to completely automate this task though.
Answering business questions - high risk of disruption
This is only high risk of disruption under the following conditions:
The data engineer models the data correctly and the data documentation is stellar and accessible via AI.
If these conditions are met, AI will make quick work of 90-95% of business questions
This part for data engineers will go from being a tactical conversation to a strategic knowledge center building task!
Soft skills and strategic (communication skills focused long-term)
Creating better pipeline generation processes - low risk of disruption
Getting everybody to agree on what processes are required to create trustable data changes over time. This consensus mechanism within businesses will take a long time for AI to take over.
Conceptual data modeling - low risk of disruption
AI is good at brainstorming. Injecting AI with the necessary business context to be great at conceptual data modeling will be difficult. Conceptual data modeling involves a ton of conversations. AI could aid here but at the end of the day, agreeing on what to build is a human activity.
Creating data best practices - low risk of disruption
Anything consensus driven is something that will take much longer for AI to solve and even once it does it will take humans a long time to believe that “AI knows better.”

How tools like Windsurf and Cursor impact data engineering
Development speed has gone up dramatically. When developing pipelines with Windsurf, you mostly need to know the higher level concepts and schemas and Windsurf can fill in the details.
For example, imagine you had a user table that looks like:
CREATE TABLE users ( user_id BIGINT, country VARCHAR, date DATE)
And you want to track what country each user is and when. If you’re an experienced data engineer, building a slowly-changing dimension type 2 should be a bell that rings in your head! You also know that writing that SQL is a giant pain in the ass if you’ve ever done it!
Instead of writing all that nasty SQL yourself, tell Windsurf this prompt:
Given this schema
CREATE TABLE users ( user_id BIGINT, country VARCHAR, date DATE)
create me an Airflow DAG using Trino that implements slowly changing dimension type 2 tracking on the country column. Make sure to include partition sensors, quality checks using write-audit-publish pattern and have the DAG be idempotent
If we examine this prompt, you’ll notice a few important things.
We start with the inputs. Always give the input schema first in the prompt.
Then we specify the orchestration technology (i.e. Airflow) and processing technology (i.e. Trino)
Then we specify the design pattern (i.e. slowly changing dimension type 2)
Then we specify any quality concerns and best practices
If you follow this pattern of inputs → technologies → design pattern → best practices, you’ll find that Windsurf does an impeccable job at giving you a DAG that just works!
So what does this mean for data engineers who used to pride themselves in writing these nasty SQLs themselves?
LEARN THE CONCEPTS!
I am sure most of my readers know about schemas and technologies so I’ll be covering the design patterns that you might want to be thinking about.
Design patterns (in order of usefulness and probability of finding in the wild)
Traditional Dimensional and Fact Tables (i.e. Kimball Data Models)
Useful for generating the master data that is fed into OLAP cubes
Useful approach to modeling data for charting
OLTP data modeling (i.e. third normal form data modeling)
Useful approach to modeling data in transactional systems
Slowly Changing Dimension Type 2
Tracking historical changes of a column on a dimension table
Useful when needing to do large scale analytics of many complex columns without shuffling. Also really powerful for NoSQL use cases!
Machine Learning feature store architecture
Useful when serving downstream machine learning models that need a centralized location
Kappa Architecture with Apache Flink
Extremely useful design pattern for anything processed in real-time
Microbatch / Hourly processing
Useful when deduplicating data in a low-latency way
But what about the best practices?
There’s really only small set of best practices you need to consider:
Data modeling best practices
Proper naming of tables
Clear separation of dimension, fact, and aggregation tables
Proper naming of columns
Don’t have overly long column names or unclear column names.
Partitioning
Properly partition your data sets. Almost every data set in your data lake should be partitioned
Personally identifiable information (i.e. PII) management
Minimize the usage of PII and anonymize when possible
Don’t break any company privacy policies when building the pipeline
Data Lake best practices
Compression
Your data is leveraging run-length encoding compression the most it can
Your data is stored as Parquet files
Retention
Don’t hold onto too much or too little data. What determines how much is too much or too little is a balance of analytics capabilities, cloud costs, and legal/privacy constraints.
Data Duplication
Don’t recreate data sets that already exist. Even the big tech companies get this wrong many many times!
DAG best practices
A task executes properly and the same whether it’s backfilled or runs in production
This is solved by setting up the right partition sensors and being mindful of if your DAG is a cumulative DAG!
A task executes properly regardless how many times its ran
This is mostly solved by using MERGE or INSERT OVERWRITE instead of INSERT INTO!
Data serving / dashboarding best practices
Latency
Serve your dashboard data from a low-latency store like Druid or Snowflake
Pre-aggregate data before loading into low-latency store
Storage
Don’t put non-aggregated data into your low-latency stores
I’m sure I’m missing some more best practices. Please comment other best practices you would like to include in this list!
We teach all of these best practices and patterns in the DataExpert.io Academy both as self-paced content as well as cohorts!
Our next cohort starts May 26th that will cover Databricks, AI, Iceberg, and Delta Live Table! The first 5 people to use code AIREADY at DataExpert.io can get 30% off either the subscription product or the live boot camp!
Great writeup! I'm super curious to see a good setup of "answering business questions" with AI working at scale. There's a few ways around like snowflake cortex on top of their semantic layer. However I'd prefer something less opinionated and vendor locked
AI is both exciting and scary. But there is still so much that still need human input in the data engineering and modeling. Hopefully that stays that way.
Hey Zach, I would love you to write a foreword for my upcoming SQL book.