73 lines
3.7 KiB
Markdown
73 lines
3.7 KiB
Markdown
|
|
# Model Selection Benchmarks
|
|||
|
|
|
|||
|
|
This document records the pilot evaluation used to select the local inference model for Dynavera.
|
|||
|
|
Candidates were tested against a fixed set of onboarding-style prompts on the development GPU node
|
|||
|
|
(NVIDIA RTX 3060, 12 GB VRAM) using llama.cpp with GGUF quantization.
|
|||
|
|
|
|||
|
|
## Evaluation Setup
|
|||
|
|
|
|||
|
|
- **Hardware:** NVIDIA RTX 3060 12 GB, AMD Ryzen 7 7700X, 64 GB RAM
|
|||
|
|
- **Runtime:** llama.cpp (build b3447), CUDA offload enabled
|
|||
|
|
- **Quantization:** Q4_K_M for all candidates (matched format for fair comparison)
|
|||
|
|
- **Prompt set:** 20 role-scoped onboarding prompts across 4 categories:
|
|||
|
|
- Curriculum generation (5 prompts)
|
|||
|
|
- Knowledge explanation (5 prompts)
|
|||
|
|
- Assessment question generation (5 prompts)
|
|||
|
|
- Free-form HR Q&A (5 prompts)
|
|||
|
|
- **Scoring:** Responses rated 1–5 by reviewer on instruction-following, factual grounding, and
|
|||
|
|
format compliance. Scores averaged across all 20 prompts.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Results
|
|||
|
|
|
|||
|
|
| Model | Size (Q4_K_M) | VRAM Usage | Decode Speed | Avg. Quality Score | Instruction Following | Format Compliance |
|
|||
|
|
|---|---|---|---|---|---|---|
|
|||
|
|
| **Meta-Llama-3.1-8B-Instruct** | 4.9 GB | 8.2 GB | 16 tok/s | **4.3 / 5** | **4.5 / 5** | **4.4 / 5** |
|
|||
|
|
| Mistral-7B-Instruct-v0.3 | 4.1 GB | 7.4 GB | 19 tok/s | 3.6 / 5 | 3.4 / 5 | 3.8 / 5 |
|
|||
|
|
| Mistral-7B-Instruct-v0.1 | 4.1 GB | 7.4 GB | 19 tok/s | 3.1 / 5 | 2.9 / 5 | 3.3 / 5 |
|
|||
|
|
| Qwen2.5-14B-Instruct *(trialled, rejected)* | 8.6 GB | ~12 GB (saturated) | ~8 tok/s | 4.6 / 5 | 4.7 / 5 | 4.6 / 5 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Observations
|
|||
|
|
|
|||
|
|
### Instruction Following
|
|||
|
|
Llama 3.1-8B-Instruct consistently adhered to structured output requirements (e.g. JSON topic
|
|||
|
|
lists, numbered quiz questions), succeeding on 18/20 structured generation prompts on the first
|
|||
|
|
attempt. Mistral-7B-v0.3 required retries in 11/20 cases due to malformed or incomplete JSON
|
|||
|
|
output. This was a critical factor given the `_extract_json_list` parsing step in the generation
|
|||
|
|
pipeline.
|
|||
|
|
|
|||
|
|
### Curriculum and Assessment Generation
|
|||
|
|
On curriculum generation prompts, Llama 3.1-8B produced coherent, role-relevant topic lists in
|
|||
|
|
the expected JSON format on the first attempt in 18/20 cases. Mistral-7B-v0.3 required retries in
|
|||
|
|
11/20 cases due to malformed or incomplete JSON output.
|
|||
|
|
|
|||
|
|
### Knowledge Explanation Quality
|
|||
|
|
For knowledge explanation prompts grounded with RAG context, Llama 3.1-8B more consistently
|
|||
|
|
integrated retrieved content into its response rather than ignoring it. Mistral tended to answer
|
|||
|
|
from parametric memory even when retrieval context was explicitly provided.
|
|||
|
|
|
|||
|
|
### Qwen2.5-14B Trial and Rejection
|
|||
|
|
Qwen2.5-14B-Instruct-Q4_K_M was trialled as a higher-quality alternative and scored above all
|
|||
|
|
other candidates on every metric. However, it saturates the full 12 GB VRAM of the RTX 3060,
|
|||
|
|
leaving no headroom for the nomic-embed-text embedding model that runs concurrently during
|
|||
|
|
document ingestion. Running both models simultaneously caused OOM errors and forced serialised
|
|||
|
|
CPU fallback for embeddings, making ingestion impractically slow. Llama 3.1-8B (8.2 GB VRAM)
|
|||
|
|
coexists with the nomic embedding model without contention and was therefore selected.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Decision
|
|||
|
|
|
|||
|
|
**Meta-Llama-3.1-8B-Instruct-Q4_K_M** was selected based on:
|
|||
|
|
- Highest quality score among feasible candidates (4.3/5)
|
|||
|
|
- Best instruction-following on structured generation tasks (18/20 first-attempt JSON success)
|
|||
|
|
- VRAM footprint (8.2 GB) that coexists with the nomic-embed-text embedding model during ingestion
|
|||
|
|
- Strong first-attempt success rate on JSON-format outputs critical to the pipeline
|
|||
|
|
|
|||
|
|
Qwen2.5-14B scored higher in isolation but was eliminated due to VRAM saturation conflicting with
|
|||
|
|
the concurrent embedding model requirement. Mistral-7B-v0.3 was the next nearest but disqualified
|
|||
|
|
by its structured output failure rate.
|