Skip to content
View Ammar-Alnagar's full-sized avatar
:copilot:
Deciphering the GPU manuscript.....
:copilot:
Deciphering the GPU manuscript.....

Block or report Ammar-Alnagar

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ammar-Alnagar/README.md

Hey, I'm Ammar

AI Systems Engineer

I build systems that run LLMs in the real world—whether that's on a Raspberry Pi, a B200 cluster, or something in between. My focus: making models actually work within real constraints (latency, cost, hardware, privacy).


Projects I'm Working On

Chimera – Inference Engine

A custom Sglang fork where I experiment with kernel-level optimizations.

  • Exploring FlashAttention memory patterns and adaptive kernel selection
  • Practical focus: better throughput without sacrificing flexibility
  • Currently used in a few production deployments
  • Use CuTeDSL kernels isntead of cutlass cute c++

Helios-Engine – Rust Agent Framework

A lightweight framework for building reliable LLM agents.

  • Async I/O with Tokio, zero-copy patterns where it matters
  • Built for projects that need control without the Python overhead

Zeroum – VLLM based Inference System

zeroum is a fast and easy-to-use library for LLM inference and serving. Based on vLLM but enhanced with a Rust serving layer that bypasses concurrency limits and allows enterprise-level serving with 1/6 the CPU usage of the Python layer.

The Modular Platform is an open and fully-integrated suite of AI libraries and tools that accelerates model serving and scales GenAI deployments. It abstracts away hardware complexity so you can run the most popular open models with industry-leading GPU and CPU performance without any code changes.


Let's Connect

If you're working on inference, kernels, or just trying to ship LLMs without burning a cloud budget—say hi. I'm always up for swapping notes.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⡶⠿⠿⠷⣶⣄⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⡿⠁⠀⠀⢀⣀⡀⠙⣷⡀⠀⠀⠀
⠀⠀⠀⡀⠀⠀⠀⠀⠀⢠⣿⠁⠀⠀⠀⠘⠿⠃⠀⢸⣿⣿⣿⣿
⠀⣠⡿⠛⢷⣦⡀⠀⠀⠈⣿⡄⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⠟
⢰⡿⠁⠀⠀⠙⢿⣦⣤⣤⣼⣿⣄⠀⠀⠀⠀⠀⢴⡟⠛⠋⠁⠀
⣿⠇⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠁⠀⠀⠀⠀⠀⠈⣿⡀⠀⠀⠀
⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⡇⠀⠀⠀
⣿⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⡇⠀⠀⠀
⠸⣷⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠀⠀
⠀⠹⣷⣤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣰⡿⠁⠀⠀⠀⠀
⠀⠀⠀⠉⠙⠛⠿⠶⣶⣶⣶⣶⣶⠶⠿⠟⠛⠉⠀⠀⠀⠀⠀⠀

Pinned Loading

  1. Helios-Engine Helios-Engine Public

    Helios Engine is a powerful and flexible Rust framework for building LLM-powered agents with tool support, chat capabilities, and easy configuration management. Create intelligent agents that can i…

    Rust 43 6

  2. Chimera Chimera Public

    Chimera is a high-performance LLM serving stack spun off from SGLang, with a kernel strategy centered on CuteDSL and CUTLASS 4.x.

    Python

  3. AI-Kernel-learning AI-Kernel-learning Public

    This comprehensive learning repository is designed to transform software engineers into expert AI kernel developers, focusing on the cutting-edge technologies required for developing high-performan…

    Python 59 10

  4. modular-rs modular-rs Public

    The Modular Platform (includes MAX & Mojo)

    Mojo

  5. Axion Axion Public

    Axion is a high-performance LLM serving platform built with Rust that provides OpenAI-compatible APIs for chat completions, embeddings, and reranking. Designed for production environments, Axion de…

    Rust 1

  6. Zeroum Zeroum Public

    zeroum is a fast and easy-to-use library for LLM inference and serving. Based on vLLM but enhanced with a Rust serving layer that bypasses concurrency limits and allows enterprise-level serving wit…

    Python