Project

LiteLM: On-Device LLM Inference

Capstone project — GPT-2 inference on a Kria KV260 using AXI4 DMA and custom kernel drivers. 3rd place at the Queen's ECE Showcase.

Role

FPGA co-design, embedded Linux integration

Stack

Zynq UltraScale+, Vivado, Vitis HLS, AXI4, Linux

Repo

ELEC_498-Capstone-LiteLM

The Problem

LLMs usually need cloud GPUs. We wanted to run a small model entirely on edge hardware — no network, low power, booting off an SD card.

What I built

LiteLM runs GPT-2 small (124M parameters, INT8 quantized) on the Kria KV260's programmable logic. Matrix multiplications and attention ops run in the FPGA fabric, while precision-sensitive stuff like layer norm, softmax, and requantization runs on the ARM Cortex-A53 in firmware. That split was the most important architectural decision we made — earlier attempts that did everything in hardware lost too much precision during requantization and produced garbage output.

I worked on the AXI4 DMA data paths between the PL and the ARM cores, and wrote the Linux kernel drivers that manage the accelerator. The firmware side handles loading quantized weights into DDR, streaming token embeddings to the PL via DMA, and reading back the output.

Highlights

3rd Place at the Queen's ECE Showcase for Top Computer Engineering Project.
AXI4 DMA paths: MM2S streams embeddings from DDR to the accelerator, S2MM writes output tokens back. Accelerator has two AXI master ports — one for weights, one for the KV cache.
Custom kernel drivers and device trees for loading the hardware overlay at boot.
Two abandoned architectures (end-to-end GPT-2 and Phi-3 Mini INT4) taught us that requantization in fixed-point logic just doesn't work for this class of model.

What I'd do differently

We jumped into the full design too early, partly because school reports demanded a complete architecture before we really understood the problem. If I did it again, I'd start with just a matrix multiplication accelerator — close the loop on something small, let people build intuition for the tools, then scale up. The accelerator design was also consolidated to one person, which created bottlenecks. Letting everyone own a piece of the design makes better work and more motivated people.

View repo Back to work

Media

Block design PDF Vivado design PDF