I am an AI/ML Engineer specializing in LLM training, deployment, optimization, and production engineering. From fine-tuning and quantization to RAG pipelines, agentic workflows, and inference serving, I focus on turning powerful models into scalable, efficient, and reliable products. When I'm away from the keyboard, I am usually running, hiking, or gaming.
Local web application for evaluating embedding models and Ollama LLMs on real hardware. Ingests PDFs, generates synthetic QA pairs via OpenRouter, and evaluates up to 6 embedding models simultaneously using MRR, NDCG@K, and Recall@K. LLM module streams live inference while measuring TTFT and throughput with hardware-aware thresholds across NVIDIA, AMD, and CPU.
CLI tool that answers: which model variant gives the best quality-per-second on this hardware? Measures TTFT, tokens/sec, cosine similarity, and ROUGE-L across Ollama model families. Runs context-length sweeps (512–4096 tokens), surfaces a Pareto-optimal recommendation, and exports JSON, Markdown tables, and scatter/line charts.
PyTorch implementation of distributed pipeline parallelism that shards model layers across multiple GPUs. Manages forward activation passing and backward gradient flow between ranks using NCCL and Gloo backends via torchrun. Written as a clean, readable reference for understanding LLM training infrastructure at the systems level.
GPT-style transformer built from scratch in PyTorch, including multi-head attention and a flash attention implementation. Small enough to run on a laptop while preserving the full mechanics of modern language model architectures — attention, residual streams, and positional encoding included.
Systematic evaluation of embedding models across tasks and datasets. Includes a data scraper for corpus collection, curated evaluation sets, and a notebook-based analysis pipeline for comparing retrieval and semantic similarity performance across HuggingFace model families.
Full-stack RAG application that chunks academic documents into vectors, stores them in Pinecone, and streams structured summaries with intelligent flashcard generation via Server-Sent Events. Supports PDF and TXT input, model selection across Gemini and Together AI, and a persistent note library with embedding previews.