Perfect-Princess Makuwerere

About

I am an AI/ML Engineer specializing in LLM training, deployment, optimization, and production engineering. From fine-tuning and quantization to RAG pipelines, agentic workflows, and inference serving, I focus on turning powerful models into scalable, efficient, and reliable products. When I'm away from the keyboard, I am usually running, hiking, or gaming.

Currently

↳ Building LLM deployment pipelines spanning fine-tuning, local inference with llama.cpp and Ollama, and high-throughput serving with vLLM.

Research & Technical Focus

LLM Training Inference Serving Performance Benchmarking Embedding Evaluation Distributed Training Retrieval-Augmented Generation Pipeline Parallelism

Selected Projects

AI Benchmarking Dashboard ↗ GitHub

Local web application for evaluating embedding models and Ollama LLMs on real hardware. Ingests PDFs, generates synthetic QA pairs via OpenRouter, and evaluates up to 6 embedding models simultaneously using MRR, NDCG@K, and Recall@K. LLM module streams live inference while measuring TTFT and throughput with hardware-aware thresholds across NVIDIA, AMD, and CPU.

FastAPIReact sentence-transformersranx Ollamapynvmlt-SNE

LLM Benchmarking CLI ↗ GitHub

CLI tool that answers: which model variant gives the best quality-per-second on this hardware? Measures TTFT, tokens/sec, cosine similarity, and ROUGE-L across Ollama model families. Runs context-length sweeps (512–4096 tokens), surfaces a Pareto-optimal recommendation, and exports JSON, Markdown tables, and scatter/line charts.

PythonClick sentence-transformersROUGE-L Ollama APInvidia-smi

Pipeline Parallelism ↗ GitHub

PyTorch implementation of distributed pipeline parallelism that shards model layers across multiple GPUs. Manages forward activation passing and backward gradient flow between ranks using NCCL and Gloo backends via torchrun. Written as a clean, readable reference for understanding LLM training infrastructure at the systems level.

PyTorchtorchrun NCCLGlooCUDA

Transformer from Scratch ↗ GitHub

GPT-style transformer built from scratch in PyTorch, including multi-head attention and a flash attention implementation. Small enough to run on a laptop while preserving the full mechanics of modern language model architectures — attention, residual streams, and positional encoding included.

PyTorchFlash Attention Multi-head AttentionJupyter

Benchmark-Embeddings ↗ GitHub

Systematic evaluation of embedding models across tasks and datasets. Includes a data scraper for corpus collection, curated evaluation sets, and a notebook-based analysis pipeline for comparing retrieval and semantic similarity performance across HuggingFace model families.

PythonJupyter HuggingFaceRetrieval Metrics

RAG Study Assistant ↗ GitHub

Full-stack RAG application that chunks academic documents into vectors, stores them in Pinecone, and streams structured summaries with intelligent flashcard generation via Server-Sent Events. Supports PDF and TXT input, model selection across Gemini and Together AI, and a persistent note library with embedding previews.

FlaskReact PineconeGemini RAGSSEPyMuPDF

Technical Skills

ML / AI

PyTorch HuggingFace sentence-transformers Keras Ollama LangChain

Infrastructure & Serving

torchrun NCCL FastAPI Flask Pinecone Docker

Languages & Tools

Python JavaScript React SQL Git Jupyter