There are a lot of libraries and frameworks coming out in the LLM space. It is overwhelming for developers to keep track and to pick the most fitting ones for your LLM projects.
In this article, we’ll be deep diving into an entire production LLM workflow highlighting and critiquing these technologies::
Unsloth.ai (for fine tuning)
AdalFlow (for pre-production and optimizations)
vLLM (for model serving)
Production LLM Workflow
Step 1: Fine tuning
Hugging Face is the go-to starting point for fine-tuning on proprietary data, we chose Unsloth.ai for its optimized performance with larger models, despite it being a newer and slightly riskier option compared to alternatives like FastAI.
Step 2: Optimizations and pre-production
Although LangChain and LlamaIndex are popular for proof-of-concept, we found a more lightweight framework - Adalflow. It effectively combines pre-production work, optimization, and in-context learning, with superior debugging capabilities compared to tools like DsPy and TexGrad which aren’t built on production-grade architecture.
Step 3: Model serving
Investigating Unsloth.ai for model fine-tuning
LLMs require a lot of GPUs and there are a lot of time constraints. This is unavoidable! It can take months to train or even fine tune certain models, making this a huge accessibility problem.
Unsloth is innovative in how it approaches model training. They’ve done the following things to minimize memory usage:
Unsloth doesn’t use the default autograd implementation for gradient computation. They compute gradients manually in the attention mechanism of training LoRA adapters.
They’ve also rewritten kernels in OpenAI’s Triton language
Implemented other memory optimization and matrix multiplication optimizations.
All these optimizations result in the ability to fine-tune LLMs 30 times faster by reducing memory needed 90%! This is done without the need of new extra hardware and these techniques are accessible no matter what kind of GPU setup you have.
Unsloth feature overview:
- Speed, speed, and speed
Fine tuning an LLM 30 times faster than traditional methods, means you can train models that took months in days while not sacrificing accuracy!
Here is a quick comparison graph from their website:
GPU support
Unsloth AI offers broad GPU support across NVIDIA (from T4 to H100), AMD, and Intel hardware, making it accessible whether you're using cloud infrastructure or local machines. The platform's 90% memory reduction enables flexible experimentation with different models and batch sizes, making it easier to find the optimal model configuration for specific use cases without requiring expensive hardware upgrades.
Flash Attention Implementation
To further enhance training efficiency, Unsloth AI incorporates Flash Attention through the use of xFormers and Tri Dao's implementation. Flash Attention is an optimized attention mechanism for transformer models that significantly reduces computational overhead and memory usage during training. By leveraging this advanced technique, Unsloth AI accelerates the training process and improves scalability, allowing for more efficient handling of large datasets and complex models.
vLLM for model serving
It’s well known that LLM training requires tons of GPUs; what is less known is that LLM requires a lot of resources for serving too. For a model with “only” 13B parameters, a single A100 GPU can serve only around 1 request per second.
To manage virtual memory more effectively, vLLM built Paged Attention
Most transformer implementations skip key-value (KV) caching, but by efficiently sharing memory between different requests, significant space savings can be achieved as demonstrated in their research paper.
vLLM also uses Automatic Prefix Caching (APC). APC caches existing queries, so if queries share the same prefix, the new query will skip the computation of the shared part thus making the computation more efficient.
Useful vLLM features:
High Throughput
vLLM achieves 24x higher throughput than HuggingFace Transformers through minimal memory waste (4% vs traditional 60-80%), efficient request batching, and speculative decoding - which improves the speed of token generation by predicting the next most likely token.
Quantization
vLLM also supports various model optimization techniques including INT4/INT8/FP8 quantization and LoRA fine-tuning to reduce model size and speed up inference.
Deployment flexibility
vLLM offers comprehensive deployment options with both online and offline inference, OpenAI-compatible APIs, and support for diverse hardware including Nvidia, AMD, Intel, and AWS Neuron accelerators.
AdalFlow for pre-production and in-context learning optimizations
The rise of LLMs created a community split where researchers produced hard-to-productionize code with zero abstraction for benchmarking, while engineers relied on overly abstracted frameworks like LangChain and LlamaIndex that proved difficult to customize and black-boxy.
AdalFlow takes inspiration from PyTorch to enable engineers and researchers to work on what matters without having to build a framework from scratch while offering API flexibility and observability.
AdalFlow can make your prompts better via auto-prompt optimization!
Just like neural networks, LLM applications can also use automatic differentiation to break down complex derivatives into traceable primitive operations using Directed Acyclic Graphs (DAGs), a concept first introduced by Text-grad.[4]
AdalFlow’s innovation builds on Text-grad's foundation, it enhances performance through meta-prompt optimization, token-efficient prompting, and few-shot learning support within a unified architecture, ultimately achieving higher accuracy than Text-grad, DsPy, and similar libraries.
So imagine all of this for your LLM tasks, but with great debugging, lightweight design, just the right balance of abstraction, and ability to use it in zero shot and few shot learning cases. This is what Adalflow offers right now! Here’s an auto-optimized prompt example:
Useful Adalflow features:
Model agnostic and token efficient
Adalflow eliminates vendor lock-in by supporting all major LLM providers (Mistral, OpenAI, Groq, Anthropic, Cohere) while optimizing token usage to minimize API costs and maximize prompt performance.
Unified optimization framework
Adalflow's powerful auto-optimization system enables comprehensive prompt optimization (both instructions and few-shot examples) through a simple Parameter-to-Generator pipeline, advancing research from DsPy, Text-grad, and OPRO, while maintaining full debugging, visualization, and training capabilities within a unified framework.
Modular code architecture (debugging and training)
Core Architecture
Adalflow's architecture is built on two fundamental classes - DataClass for LLM interactions and Component for pipeline management - which together provide standardized interfaces, unified visualization, automatic tracking, and comprehensive state management capabilities.
Standardized Interfaces
The Component class ensures consistency across all components through standardized methods for synchronous calls (call), asynchronous calls (acall), and initialization.
Unified Visualization
Pipeline structure visualization is streamlined through the repr method, with extensibility via extra_repr for additional component-specific details.
Automatic Tracking
The system recursively monitors and integrates all subcomponents and parameters, creating a comprehensive framework for building and optimizing task pipelines.
State Management
Robust state handling is implemented through state_dict and load_state_dict methods, while to_dict enables serialization of all component attributes across various data types
LLM application development is complex. We’ll be adding more powerful write ups like these to the DataExpert.io open source repo LLM-driven Data Engineering over the coming months! There are many free long-form videos in that repo like:
If you liked this content, make sure to share it with your friends and on LinkedIn or X!
thank you so much for this,
Does vllm only works with CUDA ? If yes then it means we cannot use any hardware except NVIDIA ? 🥲
Cheers