Skip to content

maeri-project/FEATHER

Repository files navigation

FEATHER

FEATHER: End-to-end deployable reconfigurable AI accelerator.

Open-Architecture, Open-ISA, Open-Compilation Flow

| paper | code | tutorial | slide | ISCA Talk | Deep Dive Talk |

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

License: MIT

What's FEATHER?

FEATHER is the first reconfigurable AI Accelerator that supports (dataflow, layout) co-switching per layer, with the architecture shown below.

FEATHER architecture

Functionality-wise, FEATHER supports arbitrary reordering to ensure arbitrary layout changes.

Arbitrary reordering function

Performance-wise, FEATHER implements Reorder In Reduction (RIR) to hide the reordering latency behind the critical path.

Reorder In Reduction (RIR) to hide the reordering latency

  • For dataflow switching, FEATHER proposes a reconfigurable 2D compute array, termed NEST (Neural Engine for Spatial forwarding and Temporal reduction)
  • For layout switching, FEATHER proposes a reconfigurable NoC, termed BIRRD (Butterfly Interconnect for Reordering in Reduction of Dataflow)
  1. BIRRD supports arbitrary reoredering in functionality.
  2. BIIRD implements data reordering in data reduction (RIR) to hide reordering latency behind the critical path.

News

We augment LayoutLoop with more precise layout modeling, better userability! Now LayoutLoop is integrated into official NVlabs/Timeloop, Check out this PR and this PR

The MINISA toolchain (ISPASS'26) is now bundled in this repository under minisa/ — a parametric ISA, compiler, and verification flow on top of FEATHER+ that achieves 474–9,331× compression over direct micro-configuration.

Repository Contents

This repository hosts the full FEATHER stack — the original ISCA'24 hardware artifacts plus the MINISA ISPASS'26 ISA toolchain. The table below maps every top-level directory to its functionality.

Path Functionality
minisa/ MINISA ISA toolchain. Parametric ISA spec, mapping–layout co-search, trace generation, and per-cycle config-stream expansion for FEATHER+.
End2end_Deployment/ End-to-end FPGA (ZCU104) deployment of FEATHER running ResNet-50 — pre-built bitstream (FEATHER.bit), hardware handoff (FEATHER.hwh), and the orchestration notebook (feather.ipynb).
LayoutLoop/ Layout-aware dataflow DSE. TimeLoop fork augmented with precise layout-based memory modeling and a layout–mapping co-search algorithm. Includes ready-to-run configurations for FEATHER, SIGMA, SIMBA, Eyeriss, Edge-TPU, NVDLA, MTIA-like, Medusa-like, and pre-run search results.
FEATHER_RTL/ ASIC RTL for the FEATHER on-chip computation block plus pre-collected synthesis and PnR reports (Synopsys DC + Cadence Innovus) covering 4×4 through 64×128 PE arrays.
figure/ Architecture and result figures referenced by the README and paper.
results_generation.py One-shot script that regenerates every figure in the ISCA'24 paper from the pre-collected numbers (~3 minutes).
artifact_evaluation_isca.md Step-by-step instructions for the ISCA'24 artifact evaluation (Figures 12, 13, 14).

minisa/ — MINISA ISA Toolchain

The MINISA toolchain is the active software stack for compiling matrix-multiply and convolution workloads down to FEATHER+ control words. The full ISA specification, instruction formats, and 6-stage compilation pipeline are documented in minisa/README.md. At a glance:

What MINISA Provides

  • An 8-instruction, variable-width ISA (3-bit opcode): SetWVNLayout, SetIVNLayout, SetOVNLayout, ExecuteStreaming, ExecuteMapping, Load, Store, plus a reserved Activation opcode. Field widths scale as $O(\log\text{AH} + \log\text{AW} + \log\text{SRAM})$; DMA instructions are a fixed 33 bits.
  • A mapping–layout co-search compiler that lowers $C[M,N] = A[M,K] \times B[K,N]$ (and im2col-converted convolutions) into MINISA traces by jointly choosing tile sizes, VN groupings, PE-column combining, and the $\text{order}_W \times \text{order}_I \times \text{order}_O$ layout permutation that is bank-conflict-free.
  • A two-tier verification harness: Checking Point 1 simulates the ISA trace and compares against numpy.matmul; Checking Point 2 expands the trace to a per-cycle config stream and verifies cycle counts and outputs. 450 / 450 workload-config pairs pass in the bundled evaluation suite.
  • Cycle-accurate performance modeling of the 5-engine async FEATHER+ pipeline (DMA load, weight load, streaming, BIRRD drain, DMA store) and instruction-fetch overhead.
  • GPU/TPU baselines and analysis comparing FEATHER+ against TPUv6e8 (8 × 256×256 PEs) and Ampere+ GPUs across the same workload set, with adaptive workload partitioning.
  • Multi-layer search that resolves $\text{OVN}^{(i)} = \text{IVN}^{(i+1)}$ inter-layer constraints, falling back to off-chip re-layout when no consistent layout exists.
  • An interactive GUI (python -m minisa gui) for visualising the PE array, BIRRD switching, and per-instruction performance.
  • Publication-quality plotting scripts (minisa/figure_drawer/) for the instruction-reduction, latency-breakdown, speedup, and GPU/TPU comparison figures.

Quick Start

conda create --name minisa python=3.13 && conda activate minisa
pip install numpy pandas pyyaml matplotlib

# Single-workload search
python -m minisa search -M 24 -K 48 -N 512 --ah 16 --aw 16 --verify

# Full evaluation: 50 workloads × 9 hardware configs (96+ GB RAM for --jobs 16)
python -m minisa evaluate \
    --csv minisa/MINISA_Evaluation_Setup_Full.csv \
    --out-dir out_eval_full \
    --ah 4,8,16 --aw "4,16,64/8,32,128/16,64,256" \
    --verify --jobs 16

# Generate paper figures
python -m minisa plot \
    --bench-csv out_eval_full/benchmark_summary.csv \
    --inst-csv  out_eval_full/inst_compare.csv \
    --out-dir   out_eval_full

# Interactive visualization
python -m minisa gui

The CLI also exposes search, instcmp (MINISA vs explicit micro-instruction comparison), compare (GPU/TPU analysis), and a JSON template mode for ACT-style layout-constrained search. Supported PE-array configurations are $\text{AH} \in {4, 8, 16}$ paired with three $\text{AW}$ widths each, for nine total designs.

End2end_Deployment/ — FPGA Deployment

A self-contained Vivado bitstream and PYNQ notebook that run weight-shared ResNet-50 inference end-to-end on a Xilinx ZCU104. The artifact is exposed through a hosted Jupyter endpoint so reviewers can replay the FPGA experiment without local board access. See End2end_Deployment/README.md for credentials and the per-layer comparison flow against the Xilinx DPU baseline.

LayoutLoop/ — Dataflow + Layout Design Space Exploration

LayoutLoop extends TimeLoop with realistic layout-based memory modeling, per-dataspace physical ranks, and a layout–mapping co-search algorithm. The folder ships with:

  • A buildable LayoutLoop tree (layoutloop/) — scons -j<N> inside a TimeLoop-compatible toolchain.
  • Configuration packs (configurations/) for FEATHER and seven baselines (SIGMA, SIMBA, Edge-TPU systolic, Eyeriss-like, NVDLA-like, MTIA-like, Medusa-like, etc.) covering both regular and depth-wise convolution constraint sets.
  • Layer-shape collections for ResNet-18/50, MobileNet-V3, BERT (incl. conv-form), AlexNet, and small/large VGG.
  • prerun_results/ containing the searched dataflows and the four CSVs (utilization.csv, cycle.csv, pj_compute.csv, and the merged interleave_layoutloop_search.csv) used to populate Figure 13.
  • A Docker image reference (feather_layoutloop) for one-shot reproducibility.

LayoutLoop has since been upstreamed into NVlabs/Timeloop (PRs #301 and #304).

Scope note. LayoutLoop is an analytical cost model used to compare FEATHER against baseline accelerators that have no deployable compilation flow of their own. It does not emit executable code. For realistic compilation targeting FEATHER+ hardware, use the MINISA toolchain in minisa/.

FEATHER_RTL/ — ASIC Verilog and Synthesis Reports

Verilog sources for the on-chip computation block (NEST + BIRRD + buffers + controller) used for ASIC synthesis and PnR, plus pre-collected report bundles. The headline area / power numbers (1 GHz target, all configurations):

Config Area (µm²) Power (mW)
64×128 36,920,519.7 26,400.00
64×64 18,389,176.2 13,200.00
32×32 2,727,906.7 961.70
16×32 965,665.1 655.55
16×16 475,897.2 323.48
8×8 97,976.5 65.25
4×4 24,694.0 16.28

Reports include feather_top_area.rpt, feather_top_dw_area.rpt, feather_top_power.rpt, and feather_top_timing.rpt per configuration. Re-running synthesis requires Synopsys Design Compiler and Cadence Innovus (end-to-end ≈5 days).

results_generation.py

A single Python file that regenerates every quantitative figure in the ISCA'24 paper from the embedded pre-collected results. Each figure is a top-level function (figure_2(), figure_12(), figure_13(), …); call them individually if you only need one plot. Total runtime for all figures is approximately 3 minutes after the dependencies above are installed.

Artifact Evaluation

For the step-by-step ISCA'24 artifact evaluation flow (Pre-run reproduction, Experiment Sets 1–3, FPGA login, LayoutLoop DSE, ASIC synthesis), see artifact_evaluation_isca.md.

Maintainers

Citations

@inproceedings{tong2024FEATHER,
  author    = {Tong, Jianming and Itagi, Anirudh and Chatarasi, Parsanth and Krishna, Tushar},
  title     = {FEATHER: A Reconfigurable Accelerator with Data Reordering Support
               for Low-Cost On-Chip Dataflow Switching},
  booktitle = {Proceedings of the 51st Annual International Symposium on Computer Architecture},
  series    = {ISCA '24},
  year      = {2024},
  publisher = {Association for Computing Machinery},
  location  = {Argentina},
  keywords  = {flexible accelerator, dataflow-layout coswitching},
}

@inproceedings{tong2026MINISA,
  author    = {Tong, Jianming and Li, Yujie and Jain, Devansh and Mendis, Charith and Krishna, Tushar},
  title     = {MINISA: Minimal Instruction Set Architecture for Next-gen Reconfigurable Inference Accelerator},
  booktitle = {Proceedings of the 34th Annual International Symposium on Performance Analysis of Systems and Software},
  series    = {ISPASS '26},
  year      = {2026},
  location  = {Seoul, Korea},
  keywords  = {minimal instruction set architecture, reconfigurable accelerator, virtual neurons},
}

About

A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors