Chimera —
MoE + Differential Attention
Chimera is a mixture-of-experts model with differential attention, demonstrating sparse expert routing on edge accelerators. Each token is routed to top-k experts via a learned gating network, with differential attention replacing standard softmax attention.
Abstract
Background
Chimera is a moe + differential attention designed for the mixture of experts domain. Chimera is a mixture-of-experts model with differential attention, demonstrating sparse expert routing on edge accelerators. Each token is routed to top-k experts via a learned gating network, with differential attention replacing standard softmax attention.
Approach
The model was compiled from a trained AxonML checkpoint through the Hailo Dataflow Compiler (DFC) targeting Hailo-10H silicon. Post-training INT8 quantization was applied during the DFC compilation pass. The resulting Hailo Executable Format (HEF) binary executes on Hailo's fixed-function dataflow architecture with deterministic latency and zero framework overhead. No runtime model conversion, graph optimization, or JIT compilation is required at the edge.
Results
On production Hailo-10H hardware, Chimera achieves 361 FPS throughput with 2.563 ms hardware latency and 3.299 ms end-to-end latency. Average die temperature during sustained inference was 53.5 °C.
Conclusion
These measurements confirm that Chimera meets the latency and throughput requirements for real-time edge deployment in mixture of experts applications. The model is production-ready as a single HEF binary with no external dependencies beyond the HailoRT runtime.
| Model | Chimera |
| Domain | Mixture of Experts |
| Architecture | MoE + Differential Attention |
| Target silicon | Hailo-10H |
| DFC compiler | 5.3.0 |
| Framework | AxonML (pure-Rust deep-learning framework) |
| Author | Andrew Jewell Sr. · ORCID 0009-0005-2158-7060 |
| Organization | AutomataNexus LLC · Fort Wayne, Indiana |
| Date | April 29, 2026 |
Executive overview
Chimera is a mixture of experts model compiled to Hailo-10H silicon via the AxonML framework and the Hailo Dataflow Compiler. The model runs as a single Hailo Executable Format (HEF) binary with deterministic latency on Hailo's fixed-function dataflow architecture.
Chimera is a mixture-of-experts model with differential attention, demonstrating sparse expert routing on edge accelerators. Each token is routed to top-k experts via a learned gating network, with differential attention replacing standard softmax attention.
Architecture
Architecture: 2 MoE layers with 4 experts each (top-2 routing), differential attention via paired Conv1D heads with subtraction, and feed-forward expert networks with ReLU activation.
Architecture note
All AxonML models targeting Hailo silicon are subject to the constraints of the fixed-function dataflow compiler: no dynamic control flow, no variable-length dimensions, and all activations must be representable in INT8 after calibration. Operations that require softmax hardware (e.g., standard transformer attention) are replaced with hardware-friendly equivalents (depthwise convolution, ReLU gating) where necessary.
Network I/O specification
| Direction | Stream name | Data type | Shape |
|---|---|---|---|
INPUT | input_layer1 | UINT8 | NHWC(128x1x128) |
OUTPUT | dense_conv53 | UINT8 | NC(101) |
Silicon performance
All measurements were taken on production Hailo-10H hardware using hailortcli run with real silicon profiling enabled. Throughput, latency, and thermal figures represent sustained inference under controlled conditions.
| Metric | Measured value |
|---|---|
| Throughput | 360.64 FPS |
| HW Latency (on-die) | 2.563 ms |
| Overall Latency (end-to-end) | 3.299 ms |
| Chip Temperature (min) | 53.0 °C |
| Chip Temperature (avg) | 53.5 °C |
| Chip Temperature (max) | 53.7 °C |
| Target Silicon | Hailo-10H |
| DFC Compiler Version | 5.3.0 |
| HEF Compatibility | HAILO15H, HAILO10H |
Deployment
The compiled HEF binary is deployed to edge devices equipped with Hailo-10H accelerator modules. No runtime model conversion, graph optimization, or JIT compilation is required. The HEF executes directly on the dataflow architecture with deterministic latency guarantees.
| Target silicon | Hailo-10H |
| HEF compatibility | HAILO15H, HAILO10H |
| DFC compiler version | 5.3.0 |
| Quantization | INT8 (post-training, DFC-optimized calibration) |
| Runtime dependency | HailoRT (hailort) — vendor runtime for PCIe / M.2 dispatch |
| Edge platforms | Raspberry Pi 5 + Hailo AI HAT+ · any M.2 key M host with Hailo-10H |
| Interactive dashboard | Open Plotly dashboard |
Deployment procedure
Copy the .hef binary to the target device. The AxonML inference runtime (or the standalone hailortcli run harness) loads the HEF directly into the Hailo-10H dataflow engine over PCIe. No ONNX runtime, TensorFlow Lite interpreter, or Python inference stack is required. Inference begins immediately after HEF load with deterministic per-frame latency.
References
- Jewell, A. (2026). AxonML — Pure-Rust Deep Learning Framework. AutomataNexus LLC. Technical whitepaper.
- Jewell, A. (2026). The Nexus Stack — A Pure-Rust Industrial Technology Stack. AutomataNexus LLC. Technical whitepaper.
- Hailo Technologies Ltd. (2024). Hailo Dataflow Compiler User Guide. DFC v3.x / v5.x documentation.
- Hailo Technologies Ltd. (2024). HailoRT — Hailo Runtime API Reference. HailoRT v4.x documentation.
- Hailo Technologies Ltd. (2024). Hailo-10H Datasheet. Product specification.
AutomataNexus LLC · Fort Wayne, Indiana
Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
Generated from real Hailo-10H silicon measurements · AxonML Framework · April 29, 2026
DISCLAIMER: This document and its contents are proprietary to AutomataNexus LLC. Performance figures are measured on production silicon under controlled conditions. Reproduction or redistribution without written permission is prohibited.