Blog

Open LLM Day: An Inquisitive Day Into Evaluating and Fine-Tuning Open LLMs

In the last week of May, the AI Brains team, MongoDB User Group Chennai, Boldcap, and Petavue hosted the Open LLM Day! This event was an interactive day for AI leaders, developers, and data analysts to discuss several aspects and the challenges in evaluating and fine-tuning open-source LLMs!

Therefore, in this blog post, let’s briefly discuss the key takeaways we inferred from this event!

Open LLM Day: The sessions that made our day informative and exciting!

Themed to be an interactive event with presentations and workshops from experts in the AI domain, the Open LLM Day had three different sessions to help the participants understand LLMs better viz:

Benchmarking LLMs: Presenting Petavue’s report on evaluating LLMs in NL-to-SQL translation
Retrieval Augmented Generation(RAG): Hands-on Workshop
Fine-tuning LLMs: Exploring how effective is LoRA and QLoRA

Let’s look into the key takeaways of all these sessions below!

Session 1: Petavue’s report on evaluating open-source LLMs in NL-to-SQL translation

Presented by: Jeyaraj Vellaisamy, Co-Founder & CTO, Petavue.

With many leaps and bounds in GPT training, their ability to understand natural language has improved to a greater extent in recent times. And, with several open-source large language models emerging every day in the market, it’s crucial for AI researchers and data scientists to benchmark their capabilities to make use of them in their use cases.

Therefore, in the first session of the Open LLM Day, Petavue presented its detailed report on how they benchmarked LLMs for NL-to-SQL translation.

Now, NL-to-SQL translation is one of the key research areas in NL-to-code generation while training LLMs. From improving human-machine interactions to fueling a user-friendly approach to exploring any vast database under consideration, this side of research helps AI pioneers a lot.

Petavue’s research team took different LLMs, tested them on several platforms and self-hosted ones, and benchmarked their performance for NL-to-SQL tasks through metrics like Execution Accuracy, Throughput, and Total cost of running.

They used the BIRD dataset for their experiments due to its complexity and better tests to evaluate their models while they handled complex queries. With 360 question-query pairs categorized as simple, moderate, and complex, they ran these datasets on platforms like Anyscale, Bedrock, Anthropic, etc, with models like Code Llama, Mistral, Claude 3 models, etc.,

From giving a systematic evaluation of LLM performance to comparing them holistically, Petavue’s attempt to benchmark LLMs paves the way to informed decision-making, strategic resource allocation, etc.,

Read the full report here:

https://github.com/petavue/NL2SQL-Benchmark

Session 2: Retrieval Augmented Generation (RAG) Workshop

Presented by: Jeyaraj Vellaisamy, Co-Founder & CTO, Petavue.

Aiming to bring an enhanced hands-on experience in using RAG in LLM training, the second session explored the depths and fundamentals of RAG through an engaging workshop!

RAG, expanded as Retrieval Augmented Generation, is a technique followed to augment a specialized knowledge base to an existing LLM and define context better so that it responds to queries accurately. It blends the concept of retrieval-based techniques with an LLM’s capability to generate text and eventually helps it to respond to user queries with proper and relevant answers.

When RAG gets used, it involves a retriever and a knowledge base. And the knowledge base gets created in a 4-step process. This 4-steps process involves loading the necessary files in its text format and chunking them into smaller pieces. This chunking step is crucial as every LLM has its own context window and must be handled with care. Later, these chunks are translated to text embeddings and loaded into the vector database for a text embedding-based search.

Therefore, when a user query passes, the retriever checks the knowledge base for relevant information and fetches them to create a prompt output. This prompt is passed on to the LLM so that it generates accurate responses.

That said, when we look into the challenges and limitations in training an LLM to give reliable and crisp answers, some of the noteworthy challenges we will come across include:

Hallucination,
It’s inability to give accurate answers for niche-specific questions from users,
Limitations in updating the training data for these models, etc.,

Therefore, how does using the RAG framework in LLM training reduce these challenges and open room for getting contextually right answers from the LLMs?

This question became the prime agenda to discuss at the open LLM Day and stretched into an informative workshop on RAG!

From explaining the steps involved in creating a knowledge base for a RAG model to teaching the participants to use RAG effectively to get accurate answers, the workshop was informative and engaging.

They used MongoDB Atlas, Google Collab, etc., to create a hands-on experience for the session. Demonstrations with and without RAG were made to let the participants know how helpful RAG will be while training LLMs to get relevant responses.

Session 3: Fine-tuning LLMs: Exploring LoRA and QLoRA

Presented by: Pranav Reddy, Co-Founder & CIO, Xylem AI

The final session of the Open LLM Day took the participants into exploring the modern techniques in fine-tuning LLMs and discussed some of the methods in PEFT (Parameter-Efficient Fine Tuning)

Beginning with a series of elaborate explanations on some of the fundamentals in fine-tuning, some of the key concepts discussed during this session include:

Introduction to fine-tuning
Difference between RAG and Fine-tuning
When should we go for prompt tuning?
How effective LoRA and QLoRA are?

Fine-tuning is expensive and resource-consuming and must be performed by keeping several things in mind. Therefore, PEFT methods like LoRA and QLoRA help developers and coders improve performance, reduce the computational power needed to fine-tune LLMs, and handle storage-related problems.

LoRA refers to Low-Rank Adaption, a PEFT method that uses the concept of matrix ranking to fine-tune an LLM.

So, the session explored how the rank-decomposition step happens when we implement LoRA to fine-tune an LLM. From explaining the equation behind this rank decomposition of the original matrix of the LLM to how this step reduces the checkpoints, everything about LoRA was discussed.

Conclusion

With all these sessions throwing light on different aspects of open-source LLM fine-tuning and benchmarking their performance for NL-to-SQL tasks, everyone had an informative and engaging day while attending the Open LLM Day. That said, we still have a long way to go when it comes to optimizing resources for training, fine-tuning, and benchmarking an LLM for NL-to-SQL tasks!