GenAI is catching fire over the last two years. Some of us saw ChatGPT in 2022 and were like, this is a fad like NFTs.
Now that we are two years deep, most of us have concluded that this is indeed not a fad and actually a fundamental shift in how we do work that will keep moving us in a direction of more speed and more agility. Most of these things should be celebrated and not feared!
In this newsletter we’ll talk about:
How GenAI impacts the development of pipelines
Boilerplate generation is a huge win here that allows us to focus on the business details that matter! I cover a lot of this in more detail here
Implementing many quality checks and write-audit-publish can be done with a single prompt nowadays
How GenAI impacts the maintenance of pipelines
Troubleshooting false positives and out-of-memory errors will soon be a thing of the past!
How GenAI will move data engineers to two directions
Towards the business and analytics. I call this the merging of the data analyst and data engineer.
Towards the server and online application. I call this the merging of the data engineer and the software engineer.
What should we be doing today as data engineers to be ready for the future
How GenAI impacts the development of pipelines
If you haven’t been using GenAI at all in your pipeline flow whether that be with Copilot, ChatGPT, or other LLMs, you should be concerned. Think about it like you’re an engineer in 2015 who doesn’t use Google Search or StackOverflow. You’re making your workflows unnecessarily more complex and annoying.
After building a million dollar business with GenAI over the last year, I’ve come to a conclusion with the “sweet spot” for generating code:
If you’re writing more than five lines of code but less than two hundred, GenAI is a very good option
Less than five, you should probably just write the code yourself
More than two hundred, GenAI will introduce more bugs that makes the development vs debugging tradeoff too real
LLMs are really good at brainstorming because it doesn’t have to be “correct”
Use it for inspiration in the following areas:
Data modeling
Give it your requirements and a small amount of business context, ask it for what schemas should look like
Example prompt:
I’m building a data pipeline that measures user growth. We care about measuring retention, churn, and new user growth. We’re a small data education startup with a website and an app. What should our data model look like?
Data quality checks
Give it your pipeline code and a small amount of business context, ask it for what quality checks it would add. You’d be surprised how good these suggestions are!
Example prompt:
Here’s my pipeline code: <pipeline code>. We want to make sure we have comprehensive data quality checks on these columns <columns>. What SQL checks would you add to a write-audit-publish pattern to accomplish this?
Documentation
Give it your pipeline code and schemas with a small amount of business context. Ask it for a documentation boilerplate. You’d be surprised how far this gets you. You could try out projects like Mintlify to do this for you automatically as well!
Example prompt:
Here’s my pipeline code: <pipeline code>. Here are all the schemas we need documentation for: <schemas>. This pipeline measures retention, churn, and new user growth. Can you write documentation for the pipeline and schemas with detailed column descriptions for me?
How GenAI impacts the maintenance of pipelines
Data engineers are burning out. If you Google “data engineer burnout,” the numbers aren’t pretty. A 2021 Wakefield Research survey found that 97% of data engineers reported experiencing burnout in their day-to-day jobs.
During my time at Netflix and Airbnb, the number one cause of burnout for me was pipeline maintenance. Needing to troubleshoot a critical pipeline failure at 2 AM killed my morale to keep going. It also made me more hesitant to write more pipelines because that would just mean a higher probability of being woken up at 2 AM.
One of the big patterns I noticed during these 2 AM pain sessions was that 80% of the failures were caused by two issues:
false positive data quality checks
week-over-week row count checks need an arbitrary “cut off” value. Something like, 15% or 20% more than normal would be bad. And I’d wake up at 2 Am and see it was 15.1% off.
The problem with these types of checks is that they need a cutoff that is “sensitive” enough to capture true quality errors but “not sensitive” to false positives. These types of errors are notorious when the traffic pattern of your datasets genuinely change (think Christmas and New Years).
Over time in big tech, they switched for hard thresholds to ML algorithms that understand the “seasonality” of the datasets. These types of data quality checks are still sophisticated and not implemented at most companies.
In the near future, we’ll have a quality check and then an LLM can determine if the failure is false positive or not based on the previous data. This should minimize how often we are actually woken up at 2 AM to troubleshoot issues. This is one of the most exciting parts of LLMs and their impact on data engineering! Long live the data engineers mental health!
extremely easy to troubleshoot out-of-memory errors
The number one type of failure I noticed oncall in big tech was out-of-memory errors with Spark. The solution was one of two things:
bumping up the executor and/or driver memory. This would unblock 80% of pipelines. You do this by increasing spark.executor.memory and/or spark.driver.memory settings.
enabling adaptive query execution (set spark.sql.adaptive.enabled to true). Adaptive execution is great for when the data set is skewed.
In the near future, an LLM will look at failures like this and diagnose them and restarting them with adaptive execution or more memory automatically! Another huge win that will drop pipeline maintenance costs in data engineering by an order of magnitude! Therapists should be concerned by this. Data engineers getting enough sleep will hurt their business!
How GenAI will move data engineers to two directions
The GenAI promise for engineers is they will get more done with less time. This sadly means that companies can get the same amount of work done with fewer engineers. “AI-enabled” engineers will crush the ones who refuse to learn it.
This means that engineers will be able to cover a larger area and have more breadth. This breadth will happen in two ways:
Data mesh architecture and GenAI will enable software engineers who own online systems to own the data their systems produce. If you’re a data engineer, you may consider learning how to build online systems and REST APIs. Being able to own the system “end-to-end” from online system to logging to pipeline to data sets to metrics will be a huge thing in the future.
Another flavor of data engineer will want to own more of the “business side” and start picking up more product manager like behaviors. Think of this as the merging of the data engineer and the data analyst. Pipeline work won’t consume as much of a data engineers time in the future so they’ll be able to pick up work in experimentation, visualization and predictive modeling.
This is why I believe the analytics engineer will be a more stable job title in the future than data engineer. And why I’m launching an analytics engineering boot camp on October 21st! You can join with code GENAI to get 25% off here!
What should we be doing today as data engineers to be ready for the future
Data engineers mostly should be celebrating these changes. It will make the painful parts of their job easier like on call.
In summary data engineers should be:
Using LLMs on a daily basis in their workflows. Googling was the hottest skill of the 2010s. “ChatGPTing” is the hottest skill of the 2020s.
Look for places in their workflows where they can leverage the intelligence to get more done with less time. Places like:
Documentation, Data modeling, data quality checks, etc
Stay up to date about whats happening in AI. This is the most important technological advancement we’ve seen since the internet. Treat it like that!
Look into AI infrastructure like vector databases, RAG, fine-tuned models, auto prompt optimization, LLM evaluation benchmarks
Some technologies I recommend here are:
Vector databases:
Fine-tuned models
RAG (retrieval augmented generation)
RAG is all about pulling in the most relevant data. You might end up querying vector databases, relational databases, or graph databases to get the right information here.
Auto prompt optimization
Prompt engineering is going to be one of the shortest lived professions. Leveraging libraries like AdalFlow will save you all the time consuming guess-work that is involved in prompt optimization right now.
LLM evaluation benchmarks
LLM tasks often have a correct answer. You can see what percentage of the time they get that correct answer. HuggingFace has a huge list of LLM benchmarks you can check out here. Data engineers and other data professionals will be creating more of these in the future. These benchmarks are critical in establishing LLMOps.
Benchmarks have a few different types:
How else do you think GenAI will impact data engineers in the future? Make sure to share this article with your friends who are interested in AI and data engineering!
This is great. I appreciate the specific examples of where DE can leverage GenAI. I subscribed based on this and hope you will go deeper on this GenI + DE topic.
Great Article. Thanks a lot.