top of page
smartphones

LARGE LANGUAGE MODEL PLAYGROUND SLM (PT 4)

Silas Liu - Nov. 06, 2024

Large Language Models, Small Language Model

This section demonstrates the deployment of a Small Language Model (SLM) integrated with Retrieval-Augmented Generation (RAG) and SQLite, designed for edge computing environments. By leveraging SLMs, I’ve addressed the challenges of context window limitations, token constraints, and the need for simpler tasks due to the smaller model size. This solution runs entirely locally, ensuring cost-free operation and optimal performance through quantized embeddings.

​​

The main challenges in implementing the system were balancing performance with the constraints of small models. The chatbot was designed to answer questions about this portfolio website, maintaining a conversational flow while generating accurate responses without hallucinations or false information. Despite the efforts, the model still faces limitations and sometimes struggles with interpretation, but it shows significant potential for future improvements.

The rapid advancements in natural language processing have made it possible to integrate language models into countless applications, yet there remains a critical need for models that can operate in constrained environments, such as edge devices. This project demonstrates the process of deploying a Small Language Model (SLM) on my portfolio website, focusing on the unique challenges and opportunities that arise when using lightweight, on-device models without token-related costs.

​

For edge-based applications, the use of SLMs is essential for practical reasons, but these models come with inherent limitations. SLMs are constrained by their context window sizes, token limits or overall capacity, which restricts them to simpler tasks. Complex multi-step flows, such as those seen in agent-based architectures, are challenging to implement due to these limitations. Consequently, the system I developed focuses on answering user questions in a constrained and controlled context within the model's capabilities.

​

Of all SLMs I have tested, three stood out as the most promising ones: Microsoft Phi 3.5 mini (3.8B parameters, released Aug 2024), Meta Llama 3.2 (3B parameters, released Sep 2024) and HuggingFace SmolLM2 (1.7B parameters, released Nov 2024). All three are multilingual, with the latter two featuring agentic functionality for tools. There are several other smaller models, such as Llama 3.2-1B and SmolLM2-135M, but these have proven better suited for agent functionality than for reasoning tasks. Below is a brief comparative table of the models.

Benchmark
Type of Test
Llama 3-2 3B
Phi 3-5 3-8B
SmolLM2 1-7B
Hellaswag
Reasoning

69.8

81.4

66.1

GPQA
Reasoning

32.8

31.9

-

ARC Challenge
Reasoning

78.6

87.4

51.7

TLDR9+
General

19.0

34.5

-

Open-rewrite eval
General

40.1

12.8

44.9

MMLU
General

63.4

59.2

-

IFEval
General

77.4

69.0

56.7

Sources: [Source 1] / [Source 2].

The need to strike a balance between model size and performance also extends to embeddings, which facilitate search and retrieval tasks within the application. To optimize performance on local, resource-limited environment, I used embeddings with reduced floating-point precision through quantization. This adjustment allows the embeddings to run efficiently but may slightly impact accuracy, a trade-off necessary for ensuring that the system remains lightweight and functional in real-time.

​

One key aspect of this solution is its use of Retrieval-Augmented Generation (RAG) to enhance response relevance. To keep responses tailored to my work and portfolio, I integrated a RAG setup using SQLite as the primary database. SQLite offers a lightweight, serverless solution that suits this application perfectly, enabling efficient storage and retrieval of context-specific information.

​

Due to the limitations of available resources, I implemented a simple flow to handle user interactions. However, it became clear that integrating a reasoning step was essential. The reasoning loop helps mitigate the model's occasional inaccuracies, guiding it toward more appropriate responses. Modern SLMs, such as Llama-3.2 and SmolLM2, are capable of integrating agents with tool calls, opening the door to more advanced implementations in the future. Here is the flowchart of the current model.

pipe_llm-slm.drawio.png

The solution consists of a thoughtfully constructed architecture where each component works to optimize efficiency. The system coordinates the LLM flow, integrates RAG with SQLite, and logs usage activity for further analysis. Despite running on an accessible online server, the entire model runs locally, ensuring that the system remains cost-free while benefiting from a stable hosting environment. This architecture minimizes dependencies on external resources, which is ideal for small models operating within constrained environments.

 

A user-facing AI system requires careful consideration of ethical standards. Even in a lightweight setup, precautions against harmful or inappropriate output are crucial. Although the model's design inherently limits complexity, important points are ways to guard against prompt injection, prevent harmful languages and mantain ethical guidelines. 

​​

Below you can test the implemented SLM. Try some inputs like:

- Tell me about projects involving LLM

- Tell me more about LLM Playground

- Tell me about Silas

- Tell me about his work 

​​

bottom of page