PP

Perfect-Princess Makuwerere

AI Systems · LLM Infrastructure · Benchmarking
Stevens Institute of Technology  ·  New York
About

I am an AI/ML Engineer specializing in LLM training, deployment, optimization, and production engineering. From fine-tuning and quantization to RAG pipelines, agentic workflows, and inference serving, I focus on turning powerful models into scalable, efficient, and reliable products. When I'm away from the keyboard, I am usually running, hiking, or gaming.

Currently
Building LLM deployment pipelines spanning fine-tuning, local inference with llama.cpp and Ollama, and high-throughput serving with vLLM.
Research & Technical Focus
LLM Training Inference Serving Performance Benchmarking Embedding Evaluation Distributed Training Retrieval-Augmented Generation Pipeline Parallelism
Selected Projects
AI Benchmarking Dashboard ↗ GitHub

Local web application for evaluating embedding models and Ollama LLMs on real hardware. Ingests PDFs, generates synthetic QA pairs via OpenRouter, and evaluates up to 6 embedding models simultaneously using MRR, NDCG@K, and Recall@K. LLM module streams live inference while measuring TTFT and throughput with hardware-aware thresholds across NVIDIA, AMD, and CPU.

FastAPIReact sentence-transformersranx Ollamapynvmlt-SNE
LLM Benchmarking CLI ↗ GitHub

CLI tool that answers: which model variant gives the best quality-per-second on this hardware? Measures TTFT, tokens/sec, cosine similarity, and ROUGE-L across Ollama model families. Runs context-length sweeps (512–4096 tokens), surfaces a Pareto-optimal recommendation, and exports JSON, Markdown tables, and scatter/line charts.

PythonClick sentence-transformersROUGE-L Ollama APInvidia-smi
Pipeline Parallelism ↗ GitHub

PyTorch implementation of distributed pipeline parallelism that shards model layers across multiple GPUs. Manages forward activation passing and backward gradient flow between ranks using NCCL and Gloo backends via torchrun. Written as a clean, readable reference for understanding LLM training infrastructure at the systems level.

PyTorchtorchrun NCCLGlooCUDA
Transformer from Scratch ↗ GitHub

GPT-style transformer built from scratch in PyTorch, including multi-head attention and a flash attention implementation. Small enough to run on a laptop while preserving the full mechanics of modern language model architectures — attention, residual streams, and positional encoding included.

PyTorchFlash Attention Multi-head AttentionJupyter
Benchmark-Embeddings ↗ GitHub

Systematic evaluation of embedding models across tasks and datasets. Includes a data scraper for corpus collection, curated evaluation sets, and a notebook-based analysis pipeline for comparing retrieval and semantic similarity performance across HuggingFace model families.

PythonJupyter HuggingFaceRetrieval Metrics
RAG Study Assistant ↗ GitHub

Full-stack RAG application that chunks academic documents into vectors, stores them in Pinecone, and streams structured summaries with intelligent flashcard generation via Server-Sent Events. Supports PDF and TXT input, model selection across Gemini and Together AI, and a persistent note library with embedding previews.

FlaskReact PineconeGemini RAGSSEPyMuPDF
Technical Skills

ML / AI

PyTorch HuggingFace sentence-transformers Keras Ollama LangChain

Infrastructure & Serving

torchrun NCCL FastAPI Flask Pinecone Docker

Languages & Tools

Python JavaScript React SQL Git Jupyter