AutomataNexus LLC Whitepaper · April 2026
AxonML · Hailo Silicon Portfolio

AxonML on Hailo
84 models compiled and benchmarked on real silicon.

A comprehensive portfolio of 84 neural network models compiled and benchmarked on Hailo-8 and Hailo-10H fixed-function accelerators, spanning HVAC predictive control, biometric identity, motorsport telemetry, mixture-of-experts, state-space models, novel sequence architectures, and recurrent networks with end-to-end correlation validation against fp32 reference. Every model was measured on production silicon. The document accompanies the AutomataNexus production deployment of these models across active commercial sites.

Author
Andrew Jewell Sr.
Organization
AutomataNexus LLC
Framework
AxonML
Silicon
Hailo-8 · Hailo-10H

Abstract

Background

Edge artificial intelligence demands inference engines that deliver deterministic latency, low power consumption, and high throughput without cloud dependency. Hailo's fixed-function dataflow accelerators — the Hailo-8 (26 TOPS INT8) and Hailo-10H (40 TOPS INT8) — provide dedicated neural-network silicon that executes compiled Hailo Executable Format (HEF) binaries with zero framework overhead. The AxonML framework is a pure-Rust deep-learning system purpose-built for compiling, quantizing, and deploying models to these accelerators across the AutomataNexus production deployment.

Approach

Each model in the portfolio was trained in AxonML, exported to ONNX, and compiled to a Hailo Executable Format (HEF) binary targeting Hailo-8, Hailo-10H, or both, using the Hailo Dataflow Compiler (DFC v3.33.1 for Hailo-8, v5.3.0 for Hailo-10H). Every HEF was deployed to a dual-Pi 5 testbench (Hailo-10H AI HAT+ and Hailo-8 M.2 AI Kit) and benchmarked using hailortcli run with real silicon profiling enabled. Recurrent and novel-architecture models were additionally validated for numerical fidelity by correlating INT8 silicon outputs against fp32 PyTorch reference outputs on matched architectures. No simulation, emulation, or estimated figures are reported — every number in this document was measured on physical hardware.

Results

Across 84 compiled models, throughput ranges from 16.48 FPS (RDT-tiny recurrent-depth transformer at seq=64) to 52,500 FPS (Naiad water-domain Apollo specialist), with a fleet-wide median above 5,000 FPS. Hardware latency ranges from 0.003 ms to 6.292 ms. All 24 Hailo-10H models compiled and ran successfully with thermal readings between 49°C and 60°C. All 60 Hailo-8 models compiled cleanly. Three large language models were additionally evaluated on Hailo-10H via the hailo-ollama pipeline: Llama 3.2 1B at 9.98 tok/s, DeepSeek-R1 1.5B at 7.41 tok/s, and Qwen3 1.7B at 4.84 tok/s.

Recurrent architectures were validated end-to-end against fp32 reference: a single-timestep GRU cell achieved Pearson correlation r=0.917 against fp32 reference, with 100% argmax agreement across 200 test sequences when the small classifier heads were retained in fp32 on the host CPU. A 60-step unrolled GRU encoder reached r=0.9997 (MAE 0.005). A 60-step unrolled LSTM encoder reached r=0.962 with 90.5% all-three-head argmax agreement. Novel AutomataNexus architectures (Trident 1.58-bit ternary, Hydra SSM+attention hybrid, Chimera MoE+differential-attention) were additionally validated on Hailo-10H silicon at 3,476-3,689 FPS and on Hailo-8 silicon at 29,049-34,715 FPS.

Conclusion

These measurements confirm that the AxonML + Hailo pipeline is production-ready for large-scale edge AI deployment. The expanded portfolio spans HVAC predictive control, multi-agent diagnostic suites, biometric identity, autonomous racing telemetry, environmental audio, ancient-language translation, mixture-of-experts, state-space models, recurrent neural networks with end-to-end validation, and large language models — all running on commodity ARM SBCs (Raspberry Pi 5) with Hailo AI HAT+ and Hailo-8 M.2 accelerator modules.

Total compiled models84 (24 Hailo-10H + 60 Hailo-8) + 3 LLMs
Target siliconHailo-8 (26 TOPS) · Hailo-10H (40 TOPS)
CompilationHailo Dataflow Compiler (DFC)
DFC compiler (H8)v3.33.1
DFC compiler (H10)v5.3.0
Training frameworkAxonML (pure-Rust deep-learning framework)
QuantizationINT8 post-training (DFC calibration pass)
Edge platformRaspberry Pi 5 + Hailo AI HAT+ (M.2 key M)
AuthorAndrew Jewell Sr. · ORCID 0009-0005-2158-7060
OrganizationAutomataNexus LLC · Fort Wayne, Indiana
DateApril 2026

Executive overview

The AxonML Hailo silicon portfolio comprises 84 neural network models compiled to Hailo Executable Format (HEF) binaries and benchmarked on production Hailo-8 and Hailo-10H accelerators, plus 3 large language models evaluated via hailo-ollama.

84 models
Compiled portfolio
2 chips
Hailo-8 + Hailo-10H
52,500 FPS
Peak throughput (Naiad, H8)
10 classes
Workload domains
0.003 ms
Min HW latency
6.292 ms
Max HW latency
24 H10 models
Hailo-10H portfolio
60 H8 models
Hailo-8 portfolio

The portfolio spans ten distinct application domains. The largest segment is HVAC predictive control, with site-specific models deployed across Warren Healthcare Campus, Akron Public Library, Huntington University, Hopebridge Therapy Center, First Church of God, and NE Realty Group, plus the eight-model Apollo multi-agent diagnostic suite (Apollo coordinator, six domain specialists, and the Colossus aggregator and Gaia safety validator). The remaining domains include biometric identity (Aegis suite: face, fingerprint, and iris recognition), autonomous racing telemetry (ATLAS on Toyota GR Cup data, dual-target H8 + H10), environmental audio (BirdClef SedNet for avian biodiversity monitoring), ancient-language translation (Nabu Akkadian encoder), anomaly detection (Sentinel), occupancy sensing (Motion Classifier), object detection (Detector), aerial perception (NexusWatch suite: Igigi, Namtar, Shamash, Nisaba), novel sequence architectures (Mamba SSM, Hydra SSM+Attention, Chimera MoE, Trident TCN and 1.58-bit ternary, Sparse Autoencoders), and recurrent networks with end-to-end correlation validation (GRU and LSTM cells and unrolled encoders).

Every model in this portfolio was compiled from a trained AxonML checkpoint through the Hailo Dataflow Compiler, quantized to INT8, and benchmarked with hailortcli run on physical silicon. No estimated, simulated, or theoretical performance numbers appear in this document.

Hailo-10H portfolio

24 models compiled and benchmarked on the Hailo-10H (40 TOPS INT8) accelerator via DFC v5.3.0. The Hailo-10H supports the HAILO15H/HAILO10H architecture family and provides thermal telemetry during inference. Models are organized by family: foundational portfolio (the original 11), recurrent architectures with correlation validation, novel research architectures, and the NexusWatch aerial perception suite.

ModelDomainArchitecture FPSHW Lat (ms)E2E Lat (ms)Avg Temp (°C)
Argus (Aegis) Biometric Security Conv2D Iris Encoder 3,656.71 0.044 0.552 51.4
Ariadne (Aegis) Biometric Security Residual Conv2D Classifier 3,511.15 0.086 0.628 52.1
ATLAS Autonomous Racing Multi-Scale Depthwise Temporal Network 1,766.81 0.696 1.262 53.3
BirdClef SedNet Environmental Audio Sound Event Detection Network 3,247.66 0.728 1.336 57.0
Chimera Mixture of Experts MoE + Differential Attention 360.64 2.563 3.299 53.5
Hydra Hybrid Architecture SSM + Depthwise Attention Hybrid 855.62 0.906 1.615 54.2
Mamba SSM Sequence Modeling Selective State Space Model 2,614.06 0.627 1.342 58.6
Mnemosyne (Aegis) Biometric Security Multi-Scale Conv2D Feature Extractor 2,769.80 0.087 0.866 56.3
Nabu Natural Language Processing Temporal Convolutional Encoder 669.31 1.721 3.219 58.8
Trident 1.58-bit (BitNet b1.58) Large Language Model Ternary Quantized TCN 3,009.91 0.239 0.948 58.1
Trident TCN Large Language Model Temporal Convolutional Network 2,109.46 0.218 1.069 57.2
Table — Hailo-10H silicon benchmark results, measured on production hardware.

Argus (Aegis)

Argus is the iris recognition component of the Aegis biometric suite. It encodes 64×64 near-infrared iris images into discriminative feature vectors for identity verification in access-controlled environments.

3,657 FPS
Throughput
0.044 ms
HW latency
0.552 ms
E2E latency
51.4 °C
Chip temp (avg)
DFC Profiler Report — Argus (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Ariadne (Aegis)

Ariadne is the fingerprint authentication component of the Aegis biometric suite. It classifies fingerprint minutiae patterns from 64×64 grayscale scans into identity embeddings, running entirely on-device for zero-trust physical access control.

3,511 FPS
Throughput
0.086 ms
HW latency
0.628 ms
E2E latency
52.1 °C
Chip temp (avg)
DFC Profiler Report — Ariadne (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

ATLAS

ATLAS (Autonomous Telemetry Learning and Actuation System) is a real-time racing telemetry model trained on Toyota GR86 track data. It predicts optimal throttle, brake, and steering commands from multi-sensor vehicle state inputs at >1,700 FPS on Hailo-10H.

1,767 FPS
Throughput
0.696 ms
HW latency
1.262 ms
E2E latency
53.3 °C
Chip temp (avg)
DFC Profiler Report — ATLAS on Hailo-10H

Per-model whitepaper · Interactive dashboard

BirdClef SedNet

BirdClef SedNet is a 234-species avian sound event detection model trained on the BirdCLEF competition dataset. It classifies mel-spectrogram audio frames into bird species labels for biodiversity monitoring at remote edge sensor stations.

3,248 FPS
Throughput
0.728 ms
HW latency
1.336 ms
E2E latency
57.0 °C
Chip temp (avg)
DFC Profiler Report — BirdClef SedNet on Hailo-10H

Per-model whitepaper · Interactive dashboard

Chimera

Chimera is a mixture-of-experts model with differential attention, demonstrating sparse expert routing on edge accelerators. Each token is routed to top-k experts via a learned gating network, with differential attention replacing standard softmax attention.

361 FPS
Throughput
2.563 ms
HW latency
3.299 ms
E2E latency
53.5 °C
Chip temp (avg)
DFC Profiler Report — Chimera on Hailo-10H

Per-model whitepaper · Interactive dashboard

Hydra

Hydra is a hybrid architecture combining selective state space modeling with depthwise attention. It demonstrates that SSM and attention mechanisms can be unified on fixed-function accelerators when attention is reformulated as depthwise convolution over query-key products.

856 FPS
Throughput
0.906 ms
HW latency
1.615 ms
E2E latency
54.2 °C
Chip temp (avg)
DFC Profiler Report — Hydra on Hailo-10H

Per-model whitepaper · Interactive dashboard

Mamba SSM

Mamba SSM is a novel selective state space model compiled for Hailo silicon. It implements gated depthwise convolutions with parallel selective scan — a hardware-friendly alternative to transformer attention that maintains linear-time sequence processing.

2,614 FPS
Throughput
0.627 ms
HW latency
1.342 ms
E2E latency
58.6 °C
Chip temp (avg)
DFC Profiler Report — Mamba SSM on Hailo-10H

Per-model whitepaper · Interactive dashboard

Mnemosyne (Aegis)

Mnemosyne is the face recognition component of the Aegis biometric authentication suite. It extracts 128-dimensional face embeddings from 64×64 grayscale face crops, enabling sub-millisecond face verification at the network edge without cloud dependency.

2,770 FPS
Throughput
0.087 ms
HW latency
0.866 ms
E2E latency
56.3 °C
Chip temp (avg)
DFC Profiler Report — Mnemosyne (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Nabu

Nabu is a cuneiform Akkadian language encoder trained on transliterated tablet corpora. It processes variable-length token sequences into fixed-dimensional representations for downstream tasks including sign classification, period dating, and genre tagging.

669 FPS
Throughput
1.721 ms
HW latency
3.219 ms
E2E latency
58.8 °C
Chip temp (avg)
DFC Profiler Report — Nabu on Hailo-10H

Per-model whitepaper · Interactive dashboard

Trident 1.58-bit (BitNet b1.58)

Trident 1.58-bit implements BitNet b1.58 ternary quantization ({-1, 0, +1} weights) over the Trident TCN backbone. This reduces weight storage to 1.58 bits per parameter while maintaining prediction quality, achieving near-binary compute efficiency on Hailo's fixed-function multiply-accumulate units.

3,010 FPS
Throughput
0.239 ms
HW latency
0.948 ms
E2E latency
58.1 °C
Chip temp (avg)
DFC Profiler Report — Trident 1.58-bit (BitNet b1.58) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Trident TCN

Trident is a custom 1-billion-parameter language model backbone compiled for edge inference. The TCN variant replaces transformer self-attention with dilated causal convolutions, enabling deterministic O(n) inference on fixed-function neural network accelerators without softmax hardware contention.

2,109 FPS
Throughput
0.218 ms
HW latency
1.069 ms
E2E latency
57.2 °C
Chip temp (avg)
DFC Profiler Report — Trident TCN on Hailo-10H

Per-model whitepaper · Interactive dashboard

Recurrent architectures — correlation-validated

A family of recurrent networks demonstrating that DFC's INT8 quantization preserves recurrent semantics across long unrolls and per-timestep cells. Each result is validated end-to-end against an architecture-matched fp32 reference, with Pearson correlation on the hidden-state output and argmax agreement when downstream classifier heads are evaluated in fp32 on the host CPU. This validation methodology — INT8 NPU body plus fp32 host classifier head — produced 100% argmax agreement on a 200-sequence test set for the GRU cell, demonstrating production-quality recurrence on Hailo silicon.

GRU cell (per-timestep)

A single-timestep GRU cell with Conv2D projections, Sigmoid reset and update gates, Tanh candidate activation, and element-wise gating. Designed for deployment patterns where the recurrence loop runs on the host CPU and each timestep is offloaded to the NPU as an independent forward pass. Calibration: 1024 real activation samples drawn from the production trace dataset.

2,578 FPS
Throughput (per step)
0.917
Pearson r vs fp32
100%
Argmax (fp32 heads, n=200)
0.170
MAE

CPU-driven 120-step recurrence loop: 12.83 sequences/sec end-to-end, 77.9 ms per full sequence. PCIe round-trip overhead bounds practical throughput well below the per-step ceiling. The hybrid INT8/fp32-head deployment pattern recovers full classification accuracy from the INT8 hidden-state range compression.

GRU cell (fused gates)

An architectural variant of the GRU cell that fuses the three gate projections (reset, update, candidate) into a single combined matmul, then splits and applies the gate-specific activations. This restructuring quantizes more cleanly than the original three-projection topology and produces a tighter range distribution per gate. The fused-gate variant is the recommended deployment topology for both Hailo-8 and Hailo-10H, with near-identical correlation quality on both chips.

1,232 FPS
Throughput (per step)
0.9998
Pearson r vs fp32
96.0%
Raw argmax (n=200)
0.0051
MAE

Cross-silicon: Hailo-8 27,027 FPS / r=0.9998 / 97.5% raw argmax. The matched correlation across both chips suggests DFC's quantization scheme is chip-independent in its numerical fidelity; throughput differences are driven by resource allocation, not precision.

GRU 60-step unrolled

A 60-step unrolled GRU encoder that flattens the recurrence into a single feed-forward graph — 60 GRU cells stacked in sequence, with hidden state passed via tensor edges rather than via a host loop. The entire 60-timestep sequence is processed in a single NPU call, eliminating per-step PCIe round-trips. Despite stacking 60 INT8 cells, cumulative quantization error remains bounded.

22.23 FPS
Throughput (full sequence)
0.9997
Pearson r vs fp32
94.0%
Raw argmax
0.005
MAE

19 hardware contexts, 4.4 MB HEF, 78 minute compile via DFC. GRU's single-state recurrence quantizes more cleanly than LSTM's dual h/c state — the near-perfect correlation across 60 stacked cells is an architectural advantage of GRU for long-unroll deployment.

LSTM cell (per-timestep)

A single-timestep LSTM cell with the standard four-gate structure (input, forget, cell, output) implemented via Conv2D projections. Input shape is the concatenation of the current input vector and both prior states (ht-1, ct-1); output is the new state pair (ht, ct) concatenated.

2,578 FPS
Throughput (per step)
0.064 ms
HW latency
133 dim
Input vector
128 dim
Output (h+c)

Cross-silicon: Hailo-8 34,224 FPS for the same architecture — over 13× faster than Hailo-10H. The smaller, single-context Hailo-8 dataflow engine handles the lightweight LSTM cell more efficiently than Hailo-10H's multi-context allocator.

LSTM encoder 60-step unrolled

A 60-step unrolled LSTM encoder taking a (60×1×5) sensor sequence and producing the concatenated final hidden and cell state (1×1×128) in a single NPU call. The full recurrence is graph-flattened — 22 hardware contexts with 742 LCUs mapped across the unroll. The output is consumed by three fp32 prediction heads on the host (imminent fault, warning fault, early fault), each Linear 64→64→8.

27.47 FPS
Throughput (full sequence)
0.962
Pearson r vs fp32
90.5%
All-3-head argmax (n=200)
0.015
MAE

Per-head argmax agreement: imminent 92.0%, warning 98.0%, early 100.0%. Disagreement concentrates on close-margin predictions in the imminent-fault classifier where INT8 range compression can swap near-tied logits. The recommended production path for HVAC fault classification remains the TCN architectural rewrite (~7,000 FPS, full classification fidelity), with the unrolled-LSTM technique reserved for retrofit cases requiring exact LSTM semantics. Cross-silicon: Hailo-8 89.58 FPS for the same architecture.

Research transformer architectures

Compact transformer-class models compiled to Hailo-10H, demonstrating AxonML-trained transformer architectures running on production silicon. These entries cover small reference architectures (GPT-2 tiny, Phi tiny), a recurrent-depth transformer (RDT-tiny / Huginn), and the SAE feature-dictionary work used in interpretability research.

GPT-2 tiny

A GPT-2-style decoder-only transformer at the smallest published configuration, compiled through DFC and benchmarked on Hailo-10H. The Hailo-8 cross-target compile is 2.7× faster than the Hailo-10H result — a recurring pattern across small transformer models.

1,743 FPS
Throughput
H8 4,757 FPS
Cross-silicon
270K params
Model size
352 KB
HEF size

Phi tiny

A Phi-architecture decoder-only transformer at a tiny configuration with rotary position embeddings, compiled through DFC and benchmarked on Hailo-10H.

1,633 FPS
Throughput
H8 2,385 FPS
Cross-silicon
270K params
Model size
352 KB
HEF size

RDT-tiny (Huginn recurrent-depth)

A recurrent-depth transformer following the Huginn architecture: a prelude of 2 standard decoder layers, a core of 4×4 iterated recurrent-depth applications (16 effective layer applications through weight reuse), and a coda of 2 standard layers for a total of 20 effective decoder applications at 109M parameters. Compiles to 27 hardware contexts on Hailo-10H.

16.48 FPS
Throughput (seq=64)
~61 ms
Latency per pass
109M params
Model size
27 contexts
HW partition

RDT-mid (Oracle-7B distill target)

The production-scale recurrent-depth transformer: 4 prelude layers, an 8-layer shared core iterated K times (default K=8, tunable per-request), and a 4-layer coda. 873M parameters at d=2048 with 32 attention heads and grouped-query attention (8 KV heads). Distilled from Oracle-7B (DeepSeek-R1 1.5B finetune) as a test-time-compute alternative: spend extra NPU iterations instead of extra parameters. The same core weights are reused across all K iterations, making the model VRAM-efficient relative to its effective depth.

873M params
Model size
K×8 layers
Effective depth (tunable)
d=2048
Hidden dimension
Hailo-10H
Target

Trident-Blog distill (1.58-bit compact)

Compact 4-layer 1.58-bit transformer distilled from the Trident-Coder model for edge-native code completion. Conv2D representation with d=128, intermediate=512, and batch-normalized residual connections. Designed as the smallest viable model that still produces syntactically correct code completions at the edge without network connectivity.

2.1 MB
ONNX size
4 layers
Depth
d=128
Hidden dimension
H8 + H10
Dual target

Sparse Autoencoders (SAE)

Sparse autoencoder models trained in AxonML for feature-dictionary learning and mechanistic interpretability research. SAEs decompose intermediate representations from larger models into sparse, interpretable feature directions, enabling downstream analysis of what individual features encode. The architecture comprises an encoder projection, a top-k sparsity gate, and a decoder reconstruction; only k of N feature directions are active for any given input, producing the sparse-dictionary representation that gives the family its name.

3,676 FPS
Throughput
4,412 KB
HEF size
Hailo-10H
Target
INT8
Quantization

Vision baselines & detection

Standard vision architectures compiled to validate the full Conv2D pipeline on both Hailo-8 and Hailo-10H. These serve as reference points for the Conv2D pipeline on both targets.

BlazeFace (face detection)

Lightweight depthwise-separable face detector following the MediaPipe BlazeFace architecture. 5 depthwise-separable convolution stages at 128×128 input resolution with dual classification and bounding-box regression heads. Trained in AxonML for the Aegis biometric pipeline's face-localization front-end.

128×128
Input resolution
47 KB
ONNX size
DW-Sep
Architecture
H8 + H10
Dual target

ResNet-18 (CIFAR-10)

Standard 18-layer residual network trained on CIFAR-10 (32×32 RGB, 10 classes). Validates skip-connection handling through the quantization pipeline. 11.2M parameters with batch normalization fused into convolutions during optimization.

32×32
Input resolution
11.2M params
Model size
18 layers
Depth
H8 + H10
Dual target

VGG-11 (CIFAR-10)

11-layer VGG architecture with batch normalization, trained on CIFAR-10. Pure sequential Conv+BN+ReLU+Pool tower with no skip connections—a stress test for deep sequential quantization accuracy and NPU memory bandwidth utilization.

32×32
Input resolution
9.2M params
Model size
11 layers
Depth
H8 + H10
Dual target

NexusWatch perception suite

Four perception models running on the live NexusWatch deployment Pi (Hailo-10H, HailoRT 5.3.0). The suite handles long-range optical perception with motion, shape, and anomaly classification over a wide-area sensor field. All four models run concurrently on the same device with shared HailoRT VDevice resources.

Igigi (detector)

The detection front-end of the NexusWatch perception pipeline. Identifies candidate aerial objects in the optical field and produces bounding-box detections with confidence scores consumed by the downstream Namtar / Shamash / Nisaba classifiers.

243 FPS
Throughput
Live
Production status
Hailo-10H
Target
Detection
Role

Namtar (anomaly)

Anomaly classifier over the Igigi detection stream. Flags detections whose feature signature deviates from the learned distribution of expected objects in the sensor field, providing the first-line filter for novelty detection.

4,489 FPS
Throughput
Live
Production status
Hailo-10H
Target
Anomaly
Role

Shamash (motion)

Motion classifier characterizing the trajectory dynamics of detected objects. Produces motion signatures used downstream to distinguish ballistic, powered, and aerodynamic motion patterns.

4,072 FPS
Throughput
Live
Production status
Hailo-10H
Target
Motion
Role

Nisaba (shape)

Shape classifier producing geometric descriptors of detected objects. Combined with the Namtar anomaly score and Shamash motion signature, the Nisaba shape descriptor provides the third dimension in the perception suite's classification space.

4,243 FPS
Throughput
Live
Production status
Hailo-10H
Target
Shape
Role

Hailo-8 Apollo suite

The Apollo suite comprises 8 purpose-built HVAC predictive control architectures, each targeting a specific aspect of commercial building thermal management: supply air prediction, airflow optimization, wind-chill compensation, multi-zone coordination, geothermal optimization, hydronic flow prediction, and combustion efficiency modeling.

ModelDomain FPSHW Lat (ms)
Apollo HVAC Predictive Control 1,475.02 0.706
Aquilo HVAC Predictive Control 11,596.20 0.109
Boreas HVAC Predictive Control 5,391.30 0.223
Colossus HVAC Predictive Control 457.25 2.209
Gaia HVAC Predictive Control 394.12 2.559
Naiad HVAC Predictive Control 34,660.90 0.044
Vulcan HVAC Predictive Control 10,145.30 0.121
Zephyrus HVAC Predictive Control 10,483.50 0.118
Table — Apollo suite silicon benchmarks on Hailo-8.

Apollo

Apollo is the flagship predictive control model for commercial HVAC systems. It forecasts supply air temperature, zone demand, and equipment staging from multi-sensor inputs including outdoor air, return air, zone temperatures, and occupancy signals.

1,475 FPS
Throughput
0.706 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Apollo on Hailo-8

Per-model whitepaper · Interactive dashboard

Aquilo

Aquilo is a lightweight thermal prediction model optimized for constrained edge deployment. It predicts supply air temperature trends from minimal sensor inputs with sub-0.11ms latency.

11,596 FPS
Throughput
0.109 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Aquilo on Hailo-8

Per-model whitepaper · Interactive dashboard

Boreas

Boreas specializes in outdoor air temperature and wind-chill compensation for HVAC economizer control. It models the relationship between ambient conditions and building thermal load.

5,391 FPS
Throughput
0.223 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Boreas on Hailo-8

Per-model whitepaper · Interactive dashboard

Colossus

Colossus is a deep multi-zone thermal model for large commercial buildings. It simultaneously predicts thermal trajectories across multiple HVAC zones, enabling coordinated control strategies.

457 FPS
Throughput
2.209 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Colossus on Hailo-8

Per-model whitepaper · Interactive dashboard

Gaia

Gaia models ground-source heat pump systems, predicting loop temperatures and COP from ground conditions, building load, and historical cycling patterns.

394 FPS
Throughput
2.559 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Gaia on Hailo-8

Per-model whitepaper · Interactive dashboard

Naiad

Naiad predicts hydronic system flow rates and delta-T from pump speed, valve position, and temperature sensor inputs. Ultra-lightweight for deployment on the smallest controllers.

34,661 FPS
Throughput
0.044 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Naiad on Hailo-8

Per-model whitepaper · Interactive dashboard

Vulcan

Vulcan models gas-fired heating equipment efficiency, predicting flue temperature, combustion efficiency, and heat exchanger fouling from operational parameters.

10,145 FPS
Throughput
0.121 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Vulcan on Hailo-8

Per-model whitepaper · Interactive dashboard

Zephyrus

Zephyrus predicts supply and return airflow rates from fan speed, static pressure, and duct configuration inputs for VAV and constant-volume air handling systems.

10,484 FPS
Throughput
0.118 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Zephyrus on Hailo-8

Per-model whitepaper · Interactive dashboard

Hailo-8 fleet models

Site-specific HVAC predictive control models trained on operational data from production building automation systems. Each model is compiled as a dedicated HEF binary and deployed to the Raspberry Pi 5 controller at its respective site.

Fleet deployment overview

The AutomataNexus production fleet runs deployed LSTM and GRU predictive models across twelve commercial and residential sites: Akron Public Library, Byrna Ammunition Production, Element Labs, First Church of God, Hopebridge Therapy Center, Heritage Huntington, Heritage Warren Healthcare Campus, NE Realty Group, St. Jude Church and School, Peabody Retirement Community, Taylor University, and a residential lake-house deployment. Each site runs site-specific LSTM and GRU model pairs of architecture matched to the equipment under control, compiled to Hailo-8 HEFs through DFC v3.33.1 and deployed to a Raspberry Pi 5 controller in the mechanical room. Models in the broader fleet are not itemized in this document; the three sites featured below represent the most complex equipment-control surfaces in the deployed fleet and are documented in detail because their LSTM and GRU implementations exercise the full range of the recurrent-architecture validation methodology described in the Hailo-10H portfolio section.

FCOG Mechroom

The First Church of God mechanical room runs predictive control over a substantial heating and cooling plant: 2 chillers, 4 boilers, 6 pumps, and 4 variable-frequency drives (VFDs). The site's LSTM and GRU pair predicts equipment failure modes across the cross-coupled chilled-water and hot-water loops, with the VFDs providing fine-grained control surfaces and the pump array distributing flow across both loops.

Enlil (FCOG Mechroom LSTM)

Multi-horizon LSTM predictor for the FCOG Mechroom equipment cluster. Predicts equipment failure modes across chillers, boilers, pumps, and VFDs at three horizons (5/15/30 min) based on real-time sensor telemetry from the mechanical room.

418 FPS
Throughput
1.488 ms
HW latency
Hailo-8
Target
LSTM
Architecture
DFC Profiler Report — FCOG Mechroom LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Enki (FCOG Mechroom GRU)

GRU-based anomaly detector for the FCOG Mechroom equipment cluster. Companion model to the LSTM predictor; the GRU produces real-time anomaly scores while the LSTM produces forward-looking failure predictions, with both models running concurrently on the same Pi 5 controller.

50,583 FPS
Throughput
0.057 ms
HW latency
Hailo-8
Target
GRU
Architecture
DFC Profiler Report — FCOG Mechroom GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Taylor Greenhouse

The Taylor University greenhouse facility runs predictive control over an environmental management system: 1 supply fan, 2 exhaust fans, 4 fan coils, temperature and relative humidity sensors, a UV sensor, retractable roof and side vents, and 2 hot-water wall radiant heaters. The site's LSTM and GRU pair handles a unique mix of HVAC airflow control, radiant heating, and physical envelope manipulation (roof and vent positioning) in response to environmental conditions.

Ninhursag (Taylor Greenhouse LSTM)

Multi-horizon LSTM predictor for the Taylor Greenhouse environmental control system. Integrates airflow telemetry from supply and exhaust fans, fan-coil performance, radiant-heater output, and envelope-position state (roof and vent actuators) with the temperature, relative humidity, and UV sensor stream to predict environmental excursions and equipment failures.

418 FPS
Throughput
1.488 ms
HW latency
Hailo-8
Target
LSTM
Architecture
DFC Profiler Report — Taylor Greenhouse LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Dumuzi (Taylor Greenhouse GRU)

GRU-based anomaly detector for the Taylor Greenhouse environmental control system. Companion model to the LSTM predictor; provides real-time anomaly scoring across the airflow, radiant, and envelope subsystems while the LSTM produces multi-horizon failure predictions.

54,144 FPS
Throughput
0.057 ms
HW latency
Hailo-8
Target
GRU
Architecture
DFC Profiler Report — Taylor Greenhouse GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Peabody Mechroom

The Peabody Retirement Community mechanical room is the largest equipment-control surface in the fleet: 3 cooling towers, 2 boilers, 1 heat exchanger, 6 pumps, 3 VFDs, and a condenser-loop heat exchanger. The site's LSTM and GRU pair predicts and monitors a multi-loop hydronic system spanning condenser water, chilled water, and heating hot water with cross-coupled flow paths through the heat exchangers.

Gibil (Peabody Mechroom LSTM)

Multi-horizon LSTM predictor for the Peabody Mechroom equipment cluster. Predicts failure modes across cooling towers, boilers, pumps, VFDs, and the dual heat exchangers at three horizons (5/15/30 min). The condenser-loop heat exchanger introduces additional coupling between the cooling-tower loop and the chilled-water loop that the model accounts for in its prediction surface.

418 FPS
Throughput
1.488 ms
HW latency
Hailo-8
Target
LSTM
Architecture
DFC Profiler Report — Peabody Mechroom LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Nammu (Peabody Mechroom GRU)

GRU-based anomaly detector for the Peabody Mechroom equipment cluster. Companion model to the LSTM predictor; produces real-time anomaly scores across the multi-loop hydronic system while the LSTM produces forward-looking failure predictions.

59,695 FPS
Throughput
0.057 ms
HW latency
Hailo-8
Target
GRU
Architecture
DFC Profiler Report — Peabody Mechroom GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Hailo-8 special models

Sentinel (anomaly detection), Motion Classifier (occupancy), Detector (object detection), Atropos (TCN replacement for GRU), and ATLAS (autonomous racing) — special-purpose models that serve cross-cutting concerns across the fleet or demonstrate novel architecture compilations. A companion cross-silicon section follows, documenting Hailo-8 entries for models also compiled to Hailo-10H.

ModelDomain FPSHW Lat (ms)
ATLAS Autonomous Racing 1,091.67 1.015
Atropos HVAC Predictive Control 6,982.23 0.178
Detector Object Detection 319.02 6.292
Motion Classifier Occupancy Detection 36,300.80 0.003
Sentinel Anomaly Detection 21,136.50 0.005
Table — Special-purpose models on Hailo-8.

Atropos

Atropos was originally an LSTM/GRU recurrent model re-architected as a TCN for Hailo compatibility. It maintains the same input/output contract while replacing recurrent gates with dilated causal convolutions.

6,982 FPS
Throughput
0.178 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Atropos on Hailo-8

Per-model whitepaper · Interactive dashboard

Detector

General-purpose object detection model for occupancy counting and zone activity monitoring.

319 FPS
Throughput
6.292 ms
HW latency
N/A ms
E2E latency

Per-model whitepaper · Interactive dashboard

Motion Classifier

Binary motion/occupancy classifier for zone-level occupancy detection. Achieves 36,000+ FPS for negligible inference overhead alongside primary HVAC models.

36,301 FPS
Throughput
0.003 ms
HW latency
N/A ms
E2E latency

Per-model whitepaper · Interactive dashboard

Sentinel

Sentinel is a lightweight anomaly detection model that flags abnormal HVAC operating conditions from multi-sensor inputs. Running at 19,700+ FPS, it provides continuous real-time monitoring.

21,136 FPS
Throughput
0.005 ms
HW latency
N/A ms
E2E latency
DFC Profiler Report — Sentinel on Hailo-8

Per-model whitepaper · Interactive dashboard

Hailo-8 cross-silicon entries

Models from the Hailo-10H portfolio that were additionally compiled and benchmarked on Hailo-8. Throughput numbers below are measured on the same dual-Pi 5 testbench with HailoRT 4.20.0 (Hailo-8 path) and HailoRT 5.3.0 (Hailo-10H path). The cross-silicon comparison surfaces a recurring finding: small models with single-context allocation often run faster on Hailo-8 than on Hailo-10H, despite Hailo-10H's higher nominal TOPS rating. This is consistent across the seven models listed below.

ModelDomain H8 FPSH10 FPSH8 / H10 ratio
Trident TCN LLM backbone 20,133 2,109 9.5×
Trident 1.58-bit BitNet ternary LM 19,388 3,010 6.4×
LSTM cell Recurrent (per-step) 34,224 2,578 13.3×
LSTM encoder 60-step Recurrent (unrolled) 89.58 27.47 3.3×
GRU cell (fused) Recurrent (per-step) 27,027 1,232 21.9×
GPT-2 tiny Transformer 4,757 1,743 2.7×
Phi tiny Transformer + RoPE 2,385 1,633 1.5×
Table — Cross-silicon throughput comparison for models compiled to both Hailo-8 and Hailo-10H. Measured on Pi 5 dual-testbench, HailoRT 4.20.0 / 5.3.0.

Interpretation

The Hailo-8 dataflow engine is simpler than Hailo-10H's: where Hailo-10H must partition most non-trivial models across multiple hardware contexts (with associated context-switch overhead between partition boundaries), Hailo-8 fits these smaller models in a single context with no partitioning required. For workloads in this size range, the Hailo-8 architecture's lower allocation overhead more than compensates for its lower nominal TOPS rating. The conventional choice of Hailo-10H for transformer and recurrent workloads is therefore not always the throughput-optimal choice — particularly for the kinds of small, fits-in-single-context models used in industrial deployments where the dataflow accelerator is shared across many model classes.

For the GRU cell (fused) the cross-silicon ratio is the most extreme at 21.9×. Numerical fidelity is identical across both chips (Pearson 0.9998 on each), confirming that the throughput delta is a partitioning effect rather than a precision tradeoff — Hailo-8 deployment for fused-gate recurrent cells does not cost any quality.

LLM benchmarks

Three large language models were evaluated on the Hailo-10H accelerator via the hailo-ollama integration pipeline. These models run natively on the Hailo-10H NPU, demonstrating that edge LLM inference is viable on fixed-function accelerator silicon.

ModelParametersTokens/secNotes
Llama 3.21B9.98Meta Llama 3.2 1B via hailo-ollama
DeepSeek-R11.5B7.41DeepSeek R1 distilled 1.5B via hailo-ollama
Qwen31.7B4.84Alibaba Qwen3 1.7B via hailo-ollama
Table — LLM inference benchmarks on Hailo-10H via hailo-ollama.
9.98 tok/s
Llama 3.2 1B
7.41 tok/s
DeepSeek-R1 1.5B
4.84 tok/s
Qwen3 1.7B

Pipeline

The hailo-ollama pipeline provides an Ollama-compatible REST API backed by the Hailo-10H NPU for token generation. Models are loaded in GGUF format, with compute-intensive matrix multiplications offloaded to the NPU via the HailoRT runtime. The host CPU handles tokenization, sampling, and KV-cache management while the NPU executes the forward pass for each generated token.

These benchmarks demonstrate that sub-2B parameter LLMs achieve interactive token generation rates on dedicated edge accelerator hardware, enabling local AI assistants, on-device code generation, and conversational interfaces without cloud dependency.

Note

LLM benchmarks measure tokens-per-second generation rate during sustained autoregressive decoding. Unlike the compiled HEF models in the rest of this portfolio, LLMs run through the hailo-ollama runtime rather than direct HEF execution. Throughput is limited by sequential token generation rather than batch inference.

Compilation pipeline

The 84 models in this portfolio were compiled from trained AxonML checkpoints through the Hailo Dataflow Compiler (DFC) and validated on production silicon via HailoRT. This section documents the compilation pipeline and the architectural constraints that inform AxonML's model design decisions for Hailo targets.

Compilation pipeline

Each model in the portfolio follows the same compilation path: trained AxonML checkpoint → ONNX export → DFC compile (v3.33.1 for Hailo-8, v5.3.0 for Hailo-10H) with INT8 post-training quantization and DFC's calibration pass → HEF binary → deployment via HailoRT (4.20.0 on Hailo-8, 5.3.0 on Hailo-10H) on the dual-Pi 5 testbench. Throughput and latency measurements were taken with hailortcli run against each compiled HEF on the corresponding silicon target.

Compiler versions

DFC 3.33.1Hailo-8 compilation. Targets the HAILO8 architecture. The Hailo-8 portfolio in this document was compiled with this version.
DFC 5.3.0Hailo-10H compilation. Targets the HAILO15H/HAILO10H architecture family. The Hailo-10H portfolio in this document was compiled with this version.

Architecture constraints

The compilation pipeline enforces a family of architectural constraints inherited from the underlying Hailo dataflow silicon. These constraints inform AxonML's architectural decisions during model design:

  • No dynamic control flow. All tensor shapes must be statically known at compile time. Variable-length sequences require padding to a fixed maximum length, and data-dependent branching is unsupported.
  • Softmax is a hardware-bounded resource. The dataflow engine includes dedicated softmax units; transformer self-attention, which uses softmax in the attention kernel, contends for these units. Replacing softmax attention with depthwise convolution or sigmoid gating yields equivalent accuracy with better hardware utilization for throughput-bound workloads.
  • INT8 quantization mandatory. All activations and weights are quantized to INT8 during the compilation pass. Models must be calibration-friendly; architectures with extreme dynamic ranges may require quantization-aware training. The fused-gate GRU variant in this portfolio illustrates how minor architectural rewrites can substantially improve quantization quality.
  • No in-place operations. Operations like in-place addition or in-place ReLU must be rewritten as explicit allocations for the dataflow graph. AxonML's exporter enforces this transformation automatically during ONNX export.
  • Single-batch inference. The dataflow architecture processes one sample at a time through the fixed-function pipeline. Throughput is achieved via pipeline parallelism across layers, not batch parallelism.
  • Depthwise-separable convolutions preferred. Standard dense convolutions with large kernel sizes may not fit the hardware's multiply-accumulate budget. Depthwise separable convolutions decompose the operation into hardware-friendly stages and are the preferred construction for new AxonML architectures targeting Hailo silicon.
  • Recurrent gates require unrolling or per-step offload. Native GRU and LSTM op variants are not directly compiled; recurrent networks must be either unrolled to a fixed sequence length (graph-flattened recurrence) or compiled as per-timestep cells with the recurrence loop on the host. Both approaches are documented in this portfolio with full correlation validation.

Deployment

Compiled HEF binaries deploy to edge controllers equipped with Hailo accelerator modules. The deployment pipeline is fully automated via the NexusDeploy system.

Edge hardware platform

SBCRaspberry Pi 5 (4GB / 8GB)
AcceleratorHailo AI HAT+ (M.2 key M, PCIe Gen 3)
Hailo-8 module26 TOPS INT8, ~2.5W typical power
Hailo-10H module40 TOPS INT8, ~3.5W typical power
RuntimeHailoRT (hailort) for PCIe/M.2 dispatch
OSRaspberry Pi OS (Debian Bookworm, kernel 6.6+)

Deployment pipeline

  1. Train model in AxonML (pure-Rust, CPU/GPU training).
  2. Export to ONNX via AxonML's ONNX serializer.
  3. Compile ONNX to HEF via DFC (v3.33.1 for H8, v5.3.0 for H10).
  4. Validate HEF with hailortcli parse-hef (I/O shape, compiler version, compatibility).
  5. Benchmark on production silicon with hailortcli run.
  6. Deploy HEF to target controller via nexusdeploy (rsync + systemd reload).
  7. Monitor inference via NexusWatch fleet dashboard.

Fleet topology

The current deployment fleet consists of Raspberry Pi 5 controllers distributed across commercial HVAC sites in Indiana and Ohio. Each controller runs one or more site-specific HEF models alongside the Talos control daemon, NexusEdge UI, and AegisDB telemetry store. Models execute on the Hailo accelerator asynchronously; the control loop reads predictions at ~1 Hz while the NPU runs inference at thousands of FPS, enabling real-time anomaly detection and predictive staging overlaid on the control cycle.

Zero-dependency edge inference

Each HEF binary is a self-contained executable for the Hailo dataflow engine. No Python runtime, ONNX interpreter, TensorFlow Lite delegate, or model-serving framework is required on the edge device. The only runtime dependency is HailoRT, which provides the PCIe transport layer between the ARM host and the Hailo NPU.

References

  1. Jewell, A. (2026). AxonML — A Pure-Rust Deep Learning Framework for Edge Silicon. AutomataNexus LLC. Technical whitepaper.
  2. Jewell, A. (2026). NexusEdge — A Tauri-Based Orchestration Layer for Industrial HVAC Building Automation on Raspberry Pi Controllers. Zenodo. https://doi.org/10.5281/zenodo.19892139.
  3. Jewell, A. (2026). The Nexus Stack — A Pure-Rust Industrial Technology Stack for Building Automation. AutomataNexus LLC. Technical whitepaper.
  4. Jewell, A. (2026). Trident — A 1.58-bit Ternary Language Model for Edge Inference. AutomataNexus LLC. Technical whitepaper.
  5. Jewell, A. (2026). Aegis — Multi-Modal Biometric Authentication on Edge Silicon. AutomataNexus LLC. Technical whitepaper.
  6. Hailo Technologies Ltd. (2024). Hailo Dataflow Compiler User Guide. DFC v3.x / v5.x documentation.
  7. Hailo Technologies Ltd. (2024). HailoRT — Hailo Runtime API Reference. HailoRT v4.x / v5.x documentation.
  8. Hailo Technologies Ltd. (2024). Hailo-8 Datasheet. Product specification. 26 TOPS INT8.
  9. Hailo Technologies Ltd. (2024). Hailo-10H Datasheet. Product specification. 40 TOPS INT8.
  10. Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764.
  11. Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
  12. Geiping, J. et al. (2025). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. Huginn architecture reference for RDT-tiny.
  13. Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  14. Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271. TCN reference for the Trident TCN, Atropos, and HVAC fleet predictors.
  15. Raspberry Pi Ltd. (2024). Raspberry Pi 5 Datasheet. Product specification.

AutomataNexus LLC · Fort Wayne, Indiana
Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
Generated from real Hailo-8 and Hailo-10H silicon measurements · AxonML Framework · April 2026

DISCLAIMER: This document and its contents are proprietary to AutomataNexus LLC. Performance figures are measured on production silicon under controlled conditions. Reproduction or redistribution without written permission is prohibited.