Data engineering is changing very quickly in 2025! Decisions are happening faster and faster now. No longer are decisions just being made from analytical dashboards or experiments, now we have AI agents who are making decisions extremely quickly!
AI agents live more in the unstructured world of binaries, videos, and blobs of text. This type of data makes the average data engineer look like a deer in headlights.
We will be using this open source code base as the basis for the rest of this article!
In this article we will talk about:
How to transcribe video into text transcripts
How to handle situations when you have LARGE videos
How to use metadata filtering to increase the relevance of your vector semantic search
How to build synchronous vs asynchronous RAG systems
Text transcriptions of videos
Video files are too rich in information and we need to convert them to “language” for easier understanding in LLMs. If the video is <1 minute long, it is very easy to get the transcript with OpenAI
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
audio_file = open("videos/five_transformations.mp4", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language='en'
)
You’ll quickly realize that if you try to do this for larger video files, you will suffer! You’ll need to use something like ffmpeg to split the video into manageable chunks. WARNING THIS NAIVE APPROACH MAY CUT WORDS OFF AT THE BEGINNING OR END OF THE CUT POINT LEADING TO A MORE INACCURATE TRANSCRIPT!
VIDEO_FILE = "videos/data_contracts.mp4"
CHUNK_LENGTH_SECONDS = 60
VIDEO_LENGTH_SECONDS = 3600
num_chunks = VIDEO_LENGTH_SECONDS / CHUNK_LENGTH_SECONDS
chunk_file_paths = []
for i in range(num_chunks):
start_time = i * CHUNK_LENGTH_SECONDS
output_file = os.path.join(TEMP_DIR, f"chunk_{i}.mp4")
chunk_file_paths.append(output_file)
# ffmpeg command to extract a portion of the video
cmd = [
"ffmpeg",
"-y", # overwrite output if exists
"-i", VIDEO_FILE,
"-ss", str(start_time),
"-t", str(CHUNK_LENGTH_SECONDS),
"-c", "copy", # Copy codec to avoid re-encoding (faster split).
output_file
]
subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
print(f"Created chunk: {output_file}")
Once you split these output files, you can process each smaller chunk into a transcript and “put it all back together” remember, the arbitrary 1 minute chunking of the video might not be the best logical or semantic chunking for your vectors! It’s better to chunk by paragraph or at least to chunk with some overlap.
Metadata filtering
Semantic search is an imprecise science! It’s important to first cut down on the number of potential vector candidates before doing a vector match.
What are some ways we can make our vector search more relevant?
Metadata based on the source
Did it come from Github? LinkedIn? Workday? etc, etc
Metadata based on the type of content
Is it a meme? Is it a “how-to” guide"? Is it an onboarding document?
Metadata based on the theme of the content
Is it a data engineering related document? A machine learning one? etc,etc
Some of these things you can use an additional LLM preprocessing step to determine which categories a chunk or document belongs to:
import os
from openai import OpenAI
import json
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
CATEGORIES = {
'data modeling',
'airflow',
'lessons', 'office hours',
'spark', 'dbt', 'snowflake',
'sql', 'python',
'databricks', 'refund process', 'assignments',
'capstone', 'certification', 'guest speakers'
}
def get_categories_from_chunk(text):
combined_categories = ','.join(CATEGORIES)
system_content = f'''You are a data engineer looking to categorize content based on what you see in the text chunk.
The only valid categories are: {combined_categories} return only the relevant categories as a JSON array.
Do not include any markdown!'''
user_prompt = 'Find all the relevant categories for this chunk. Make sure the rarest category is first ' + text
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_prompt}
],
temperature=0.1
)
answer = response.choices[0].message.content
array_answer = json.loads(answer)
return array_answer
These columns should be loaded into your vector database metadata fields so you can pre-filter on these before doing a vector match!
Once you have the correct categorization of the documents, you need to categorize the incoming prompts too for more relevant search hits!
WARNING: Segmenting like this will add latency into your RAG system because of the extra LLM call!
Synchronous vs Asynchronous RAG systems
Chatbots are not the only way we interface with AI! Sometimes AI can go off and do its own thing like sending emails, making an API request, etc, etc.
For the non-Chatbot use cases, Asynchronous RAG can often times be a better approach than Synchronous RAG because it allows your system to respond more quickly to AI tasks that take quite a bit of time. The rule here is if the task takes longer than 30 seconds, asynchronous is probably what you want to do!


If you want to learn more about RAG systems and data engineering best practices, the first 5 people can use RAG at checkout at DataExpert.io for 30% off!
What else should we be considering in this unstructured world? Please comment below. Make sure to share this with your friends!
Good one, fresh content.
I wrote something related on building gen ai application, based on my real world experience: https://www.junaideffendi.com/p/designing-genai-chatbot-for-business?r=cqjft&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
In the company I worked for, we recently switch off a RAG process because SQL/Python can solve it, it was a classification for a text. I guess our role is to know when AI is a game changer, and when it's just an expensive way to do things we already figure out.