AutomataNexus LLC Whitepaper · April 2026

AxonML · Hailo Silicon Portfolio

AxonML on Hailo —
84 models compiled and benchmarked on real silicon.

A comprehensive portfolio of 84 neural network models compiled and benchmarked on Hailo-8 and Hailo-10H fixed-function accelerators, spanning HVAC predictive control, biometric identity, motorsport telemetry, mixture-of-experts, state-space models, novel sequence architectures, and recurrent networks with end-to-end correlation validation against fp32 reference. Every model was measured on production silicon. The document accompanies the AutomataNexus production deployment of these models across active commercial sites.

Author

Andrew Jewell Sr.

Organization

AutomataNexus LLC

Framework

AxonML

Silicon

Hailo-8 · Hailo-10H

Front matterAbstract

Abstract

Background

Edge artificial intelligence demands inference engines that deliver deterministic latency, low power consumption, and high throughput without cloud dependency. Hailo's fixed-function dataflow accelerators — the Hailo-8 (26 TOPS INT8) and Hailo-10H (40 TOPS INT8) — provide dedicated neural-network silicon that executes compiled Hailo Executable Format (HEF) binaries with zero framework overhead. The AxonML framework is a pure-Rust deep-learning system purpose-built for compiling, quantizing, and deploying models to these accelerators across the AutomataNexus production deployment.

Approach

Each model in the portfolio was trained in AxonML, exported to ONNX, and compiled to a Hailo Executable Format (HEF) binary targeting Hailo-8, Hailo-10H, or both, using the Hailo Dataflow Compiler (DFC v3.33.1 for Hailo-8, v5.3.0 for Hailo-10H). Every HEF was deployed to a dual-Pi 5 testbench (Hailo-10H AI HAT+ and Hailo-8 M.2 AI Kit) and benchmarked using hailortcli run with real silicon profiling enabled. Recurrent and novel-architecture models were additionally validated for numerical fidelity by correlating INT8 silicon outputs against fp32 PyTorch reference outputs on matched architectures. No simulation, emulation, or estimated figures are reported — every number in this document was measured on physical hardware.

Results

Across 84 compiled models, throughput ranges from 16.48 FPS (RDT-tiny recurrent-depth transformer at seq=64) to 52,500 FPS (Naiad water-domain Apollo specialist), with a fleet-wide median above 5,000 FPS. Hardware latency ranges from 0.003 ms to 6.292 ms. All 24 Hailo-10H models compiled and ran successfully with thermal readings between 49°C and 60°C. All 60 Hailo-8 models compiled cleanly. Three large language models were additionally evaluated on Hailo-10H via the hailo-ollama pipeline: Llama 3.2 1B at 9.98 tok/s, DeepSeek-R1 1.5B at 7.41 tok/s, and Qwen3 1.7B at 4.84 tok/s.

Recurrent architectures were validated end-to-end against fp32 reference: a single-timestep GRU cell achieved Pearson correlation r=0.917 against fp32 reference, with 100% argmax agreement across 200 test sequences when the small classifier heads were retained in fp32 on the host CPU. A 60-step unrolled GRU encoder reached r=0.9997 (MAE 0.005). A 60-step unrolled LSTM encoder reached r=0.962 with 90.5% all-three-head argmax agreement. Novel AutomataNexus architectures (Trident 1.58-bit ternary, Hydra SSM+attention hybrid, Chimera MoE+differential-attention) were additionally validated on Hailo-10H silicon at 3,476-3,689 FPS and on Hailo-8 silicon at 29,049-34,715 FPS.

Conclusion

These measurements confirm that the AxonML + Hailo pipeline is production-ready for large-scale edge AI deployment. The expanded portfolio spans HVAC predictive control, multi-agent diagnostic suites, biometric identity, autonomous racing telemetry, environmental audio, ancient-language translation, mixture-of-experts, state-space models, recurrent neural networks with end-to-end validation, and large language models — all running on commodity ARM SBCs (Raspberry Pi 5) with Hailo AI HAT+ and Hailo-8 M.2 accelerator modules.

Total compiled models	84 (24 Hailo-10H + 60 Hailo-8) + 3 LLMs
Target silicon	Hailo-8 (26 TOPS) · Hailo-10H (40 TOPS)
Compilation	Hailo Dataflow Compiler (DFC)
DFC compiler (H8)	v3.33.1
DFC compiler (H10)	v5.3.0
Training framework	AxonML (pure-Rust deep-learning framework)
Quantization	INT8 post-training (DFC calibration pass)
Edge platform	Raspberry Pi 5 + Hailo AI HAT+ (M.2 key M)
Author	Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
Organization	AutomataNexus LLC · Fort Wayne, Indiana
Date	April 2026

Part I · Overview03

Executive overview

The AxonML Hailo silicon portfolio comprises 84 neural network models compiled to Hailo Executable Format (HEF) binaries and benchmarked on production Hailo-8 and Hailo-10H accelerators, plus 3 large language models evaluated via hailo-ollama.

84_models

Compiled portfolio

2_chips

Hailo-8 + Hailo-10H

52,500_FPS

Peak throughput (Naiad, H8)

10_classes

Workload domains

0.003_ms

Min HW latency

6.292_ms

Max HW latency

24_{H10 models}

Hailo-10H portfolio

60_{H8 models}

Hailo-8 portfolio

The portfolio spans ten distinct application domains. The largest segment is HVAC predictive control, with site-specific models deployed across Warren Healthcare Campus, Akron Public Library, Huntington University, Hopebridge Therapy Center, First Church of God, and NE Realty Group, plus the eight-model Apollo multi-agent diagnostic suite (Apollo coordinator, six domain specialists, and the Colossus aggregator and Gaia safety validator). The remaining domains include biometric identity (Aegis suite: face, fingerprint, and iris recognition), autonomous racing telemetry (ATLAS on Toyota GR Cup data, dual-target H8 + H10), environmental audio (BirdClef SedNet for avian biodiversity monitoring), ancient-language translation (Nabu Akkadian encoder), anomaly detection (Sentinel), occupancy sensing (Motion Classifier), object detection (Detector), aerial perception (NexusWatch suite: Igigi, Namtar, Shamash, Nisaba), novel sequence architectures (Mamba SSM, Hydra SSM+Attention, Chimera MoE, Trident TCN and 1.58-bit ternary, Sparse Autoencoders), and recurrent networks with end-to-end correlation validation (GRU and LSTM cells and unrolled encoders).

Every model in this portfolio was compiled from a trained AxonML checkpoint through the Hailo Dataflow Compiler, quantized to INT8, and benchmarked with hailortcli run on physical silicon. No estimated, simulated, or theoretical performance numbers appear in this document.

Part II · Hailo-10H04

Hailo-10H portfolio

24 models compiled and benchmarked on the Hailo-10H (40 TOPS INT8) accelerator via DFC v5.3.0. The Hailo-10H supports the HAILO15H/HAILO10H architecture family and provides thermal telemetry during inference. Models are organized by family: foundational portfolio (the original 11), recurrent architectures with correlation validation, novel research architectures, and the NexusWatch aerial perception suite.

Table — Hailo-10H silicon benchmark results, measured on production hardware.
Model	Domain	Architecture	FPS	HW Lat (ms)	E2E Lat (ms)	Avg Temp (°C)
Argus (Aegis)	Biometric Security	Conv2D Iris Encoder	3,656.71	0.044	0.552	51.4
Ariadne (Aegis)	Biometric Security	Residual Conv2D Classifier	3,511.15	0.086	0.628	52.1
ATLAS	Autonomous Racing	Multi-Scale Depthwise Temporal Network	1,766.81	0.696	1.262	53.3
BirdClef SedNet	Environmental Audio	Sound Event Detection Network	3,247.66	0.728	1.336	57.0
Chimera	Mixture of Experts	MoE + Differential Attention	360.64	2.563	3.299	53.5
Hydra	Hybrid Architecture	SSM + Depthwise Attention Hybrid	855.62	0.906	1.615	54.2
Mamba SSM	Sequence Modeling	Selective State Space Model	2,614.06	0.627	1.342	58.6
Mnemosyne (Aegis)	Biometric Security	Multi-Scale Conv2D Feature Extractor	2,769.80	0.087	0.866	56.3
Nabu	Natural Language Processing	Temporal Convolutional Encoder	669.31	1.721	3.219	58.8
Trident 1.58-bit (BitNet b1.58)	Large Language Model	Ternary Quantized TCN	3,009.91	0.239	0.948	58.1
Trident TCN	Large Language Model	Temporal Convolutional Network	2,109.46	0.218	1.069	57.2

Argus (Aegis)

Argus is the iris recognition component of the Aegis biometric suite. It encodes 64×64 near-infrared iris images into discriminative feature vectors for identity verification in access-controlled environments.

3,657_FPS

Throughput

0.044_ms

HW latency

0.552_ms

E2E latency

51.4_°C

Chip temp (avg)

DFC Profiler Report — Argus (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Ariadne (Aegis)

Ariadne is the fingerprint authentication component of the Aegis biometric suite. It classifies fingerprint minutiae patterns from 64×64 grayscale scans into identity embeddings, running entirely on-device for zero-trust physical access control.

3,511_FPS

Throughput

0.086_ms

HW latency

0.628_ms

E2E latency

52.1_°C

Chip temp (avg)

DFC Profiler Report — Ariadne (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

ATLAS

ATLAS (Autonomous Telemetry Learning and Actuation System) is a real-time racing telemetry model trained on Toyota GR86 track data. It predicts optimal throttle, brake, and steering commands from multi-sensor vehicle state inputs at >1,700 FPS on Hailo-10H.

1,767_FPS

Throughput

0.696_ms

HW latency

1.262_ms

E2E latency

53.3_°C

Chip temp (avg)

DFC Profiler Report — ATLAS on Hailo-10H

Per-model whitepaper · Interactive dashboard

BirdClef SedNet

BirdClef SedNet is a 234-species avian sound event detection model trained on the BirdCLEF competition dataset. It classifies mel-spectrogram audio frames into bird species labels for biodiversity monitoring at remote edge sensor stations.

3,248_FPS

Throughput

0.728_ms

HW latency

1.336_ms

E2E latency

57.0_°C

Chip temp (avg)

DFC Profiler Report — BirdClef SedNet on Hailo-10H

Per-model whitepaper · Interactive dashboard

Chimera

Chimera is a mixture-of-experts model with differential attention, demonstrating sparse expert routing on edge accelerators. Each token is routed to top-k experts via a learned gating network, with differential attention replacing standard softmax attention.

361_FPS

Throughput

2.563_ms

HW latency

3.299_ms

E2E latency

53.5_°C

Chip temp (avg)

DFC Profiler Report — Chimera on Hailo-10H

Per-model whitepaper · Interactive dashboard

Hydra

Hydra is a hybrid architecture combining selective state space modeling with depthwise attention. It demonstrates that SSM and attention mechanisms can be unified on fixed-function accelerators when attention is reformulated as depthwise convolution over query-key products.

856_FPS

Throughput

0.906_ms

HW latency

1.615_ms

E2E latency

54.2_°C

Chip temp (avg)

DFC Profiler Report — Hydra on Hailo-10H

Per-model whitepaper · Interactive dashboard

Mamba SSM

Mamba SSM is a novel selective state space model compiled for Hailo silicon. It implements gated depthwise convolutions with parallel selective scan — a hardware-friendly alternative to transformer attention that maintains linear-time sequence processing.

2,614_FPS

Throughput

0.627_ms

HW latency

1.342_ms

E2E latency

58.6_°C

Chip temp (avg)

DFC Profiler Report — Mamba SSM on Hailo-10H

Per-model whitepaper · Interactive dashboard

Mnemosyne (Aegis)

Mnemosyne is the face recognition component of the Aegis biometric authentication suite. It extracts 128-dimensional face embeddings from 64×64 grayscale face crops, enabling sub-millisecond face verification at the network edge without cloud dependency.

2,770_FPS

Throughput

0.087_ms

HW latency

0.866_ms

E2E latency

56.3_°C

Chip temp (avg)

DFC Profiler Report — Mnemosyne (Aegis) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Nabu

Nabu is a cuneiform Akkadian language encoder trained on transliterated tablet corpora. It processes variable-length token sequences into fixed-dimensional representations for downstream tasks including sign classification, period dating, and genre tagging.

669_FPS

Throughput

1.721_ms

HW latency

3.219_ms

E2E latency

58.8_°C

Chip temp (avg)

DFC Profiler Report — Nabu on Hailo-10H

Per-model whitepaper · Interactive dashboard

Trident 1.58-bit (BitNet b1.58)

Trident 1.58-bit implements BitNet b1.58 ternary quantization ({-1, 0, +1} weights) over the Trident TCN backbone. This reduces weight storage to 1.58 bits per parameter while maintaining prediction quality, achieving near-binary compute efficiency on Hailo's fixed-function multiply-accumulate units.

3,010_FPS

Throughput

0.239_ms

HW latency

0.948_ms

E2E latency

58.1_°C

Chip temp (avg)

DFC Profiler Report — Trident 1.58-bit (BitNet b1.58) on Hailo-10H

Per-model whitepaper · Interactive dashboard

Trident TCN

Trident is a custom 1-billion-parameter language model backbone compiled for edge inference. The TCN variant replaces transformer self-attention with dilated causal convolutions, enabling deterministic O(n) inference on fixed-function neural network accelerators without softmax hardware contention.

2,109_FPS

Throughput

0.218_ms

HW latency

1.069_ms

E2E latency

57.2_°C

Chip temp (avg)

DFC Profiler Report — Trident TCN on Hailo-10H

Per-model whitepaper · Interactive dashboard

Recurrent architectures — correlation-validated

A family of recurrent networks demonstrating that DFC's INT8 quantization preserves recurrent semantics across long unrolls and per-timestep cells. Each result is validated end-to-end against an architecture-matched fp32 reference, with Pearson correlation on the hidden-state output and argmax agreement when downstream classifier heads are evaluated in fp32 on the host CPU. This validation methodology — INT8 NPU body plus fp32 host classifier head — produced 100% argmax agreement on a 200-sequence test set for the GRU cell, demonstrating production-quality recurrence on Hailo silicon.

GRU cell (per-timestep)

A single-timestep GRU cell with Conv2D projections, Sigmoid reset and update gates, Tanh candidate activation, and element-wise gating. Designed for deployment patterns where the recurrence loop runs on the host CPU and each timestep is offloaded to the NPU as an independent forward pass. Calibration: 1024 real activation samples drawn from the production trace dataset.

2,578_FPS

Throughput (per step)

0.917

Pearson r vs fp32

100_%

Argmax (fp32 heads, n=200)

0.170

MAE

CPU-driven 120-step recurrence loop: 12.83 sequences/sec end-to-end, 77.9 ms per full sequence. PCIe round-trip overhead bounds practical throughput well below the per-step ceiling. The hybrid INT8/fp32-head deployment pattern recovers full classification accuracy from the INT8 hidden-state range compression.

GRU cell (fused gates)

An architectural variant of the GRU cell that fuses the three gate projections (reset, update, candidate) into a single combined matmul, then splits and applies the gate-specific activations. This restructuring quantizes more cleanly than the original three-projection topology and produces a tighter range distribution per gate. The fused-gate variant is the recommended deployment topology for both Hailo-8 and Hailo-10H, with near-identical correlation quality on both chips.

1,232_FPS

Throughput (per step)

0.9998

Pearson r vs fp32

96.0_%

Raw argmax (n=200)

0.0051

MAE

Cross-silicon: Hailo-8 27,027 FPS / r=0.9998 / 97.5% raw argmax. The matched correlation across both chips suggests DFC's quantization scheme is chip-independent in its numerical fidelity; throughput differences are driven by resource allocation, not precision.

GRU 60-step unrolled

A 60-step unrolled GRU encoder that flattens the recurrence into a single feed-forward graph — 60 GRU cells stacked in sequence, with hidden state passed via tensor edges rather than via a host loop. The entire 60-timestep sequence is processed in a single NPU call, eliminating per-step PCIe round-trips. Despite stacking 60 INT8 cells, cumulative quantization error remains bounded.

22.23_FPS

Throughput (full sequence)

0.9997

Pearson r vs fp32

94.0_%

Raw argmax

0.005

MAE

19 hardware contexts, 4.4 MB HEF, 78 minute compile via DFC. GRU's single-state recurrence quantizes more cleanly than LSTM's dual h/c state — the near-perfect correlation across 60 stacked cells is an architectural advantage of GRU for long-unroll deployment.

LSTM cell (per-timestep)

A single-timestep LSTM cell with the standard four-gate structure (input, forget, cell, output) implemented via Conv2D projections. Input shape is the concatenation of the current input vector and both prior states (h_t-1, c_t-1); output is the new state pair (h_t, c_t) concatenated.

2,578_FPS

Throughput (per step)

0.064_ms

HW latency

133_dim

Input vector

128_dim

Output (h+c)

Cross-silicon: Hailo-8 34,224 FPS for the same architecture — over 13× faster than Hailo-10H. The smaller, single-context Hailo-8 dataflow engine handles the lightweight LSTM cell more efficiently than Hailo-10H's multi-context allocator.

LSTM encoder 60-step unrolled

A 60-step unrolled LSTM encoder taking a (60×1×5) sensor sequence and producing the concatenated final hidden and cell state (1×1×128) in a single NPU call. The full recurrence is graph-flattened — 22 hardware contexts with 742 LCUs mapped across the unroll. The output is consumed by three fp32 prediction heads on the host (imminent fault, warning fault, early fault), each Linear 64→64→8.

27.47_FPS

Throughput (full sequence)

0.962

Pearson r vs fp32

90.5_%

All-3-head argmax (n=200)

0.015

MAE

Per-head argmax agreement: imminent 92.0%, warning 98.0%, early 100.0%. Disagreement concentrates on close-margin predictions in the imminent-fault classifier where INT8 range compression can swap near-tied logits. The recommended production path for HVAC fault classification remains the TCN architectural rewrite (~7,000 FPS, full classification fidelity), with the unrolled-LSTM technique reserved for retrofit cases requiring exact LSTM semantics. Cross-silicon: Hailo-8 89.58 FPS for the same architecture.

Research transformer architectures

Compact transformer-class models compiled to Hailo-10H, demonstrating AxonML-trained transformer architectures running on production silicon. These entries cover small reference architectures (GPT-2 tiny, Phi tiny), a recurrent-depth transformer (RDT-tiny / Huginn), and the SAE feature-dictionary work used in interpretability research.

GPT-2 tiny

A GPT-2-style decoder-only transformer at the smallest published configuration, compiled through DFC and benchmarked on Hailo-10H. The Hailo-8 cross-target compile is 2.7× faster than the Hailo-10H result — a recurring pattern across small transformer models.

1,743_FPS

Throughput

H8 4,757_FPS

Cross-silicon

270K_params

Model size

352_KB

HEF size

Phi tiny

A Phi-architecture decoder-only transformer at a tiny configuration with rotary position embeddings, compiled through DFC and benchmarked on Hailo-10H.

1,633_FPS

Throughput

H8 2,385_FPS

Cross-silicon

270K_params

Model size

352_KB

HEF size

RDT-tiny (Huginn recurrent-depth)

A recurrent-depth transformer following the Huginn architecture: a prelude of 2 standard decoder layers, a core of 4×4 iterated recurrent-depth applications (16 effective layer applications through weight reuse), and a coda of 2 standard layers for a total of 20 effective decoder applications at 109M parameters. Compiles to 27 hardware contexts on Hailo-10H.

16.48_FPS

Throughput (seq=64)

~61_ms

Latency per pass

109M_params

Model size

27_contexts

HW partition

RDT-mid (Oracle-7B distill target)

The production-scale recurrent-depth transformer: 4 prelude layers, an 8-layer shared core iterated K times (default K=8, tunable per-request), and a 4-layer coda. 873M parameters at d=2048 with 32 attention heads and grouped-query attention (8 KV heads). Distilled from Oracle-7B (DeepSeek-R1 1.5B finetune) as a test-time-compute alternative: spend extra NPU iterations instead of extra parameters. The same core weights are reused across all K iterations, making the model VRAM-efficient relative to its effective depth.

873M_params

Model size

K×8_layers

Effective depth (tunable)

d=2048

Hidden dimension

Hailo-10H

Target

Trident-Blog distill (1.58-bit compact)

Compact 4-layer 1.58-bit transformer distilled from the Trident-Coder model for edge-native code completion. Conv2D representation with d=128, intermediate=512, and batch-normalized residual connections. Designed as the smallest viable model that still produces syntactically correct code completions at the edge without network connectivity.

2.1_MB

ONNX size

4_layers

Depth

d=128

Hidden dimension

H8 + H10

Dual target

Sparse Autoencoders (SAE)

Sparse autoencoder models trained in AxonML for feature-dictionary learning and mechanistic interpretability research. SAEs decompose intermediate representations from larger models into sparse, interpretable feature directions, enabling downstream analysis of what individual features encode. The architecture comprises an encoder projection, a top-k sparsity gate, and a decoder reconstruction; only k of N feature directions are active for any given input, producing the sparse-dictionary representation that gives the family its name.

3,676_FPS

Throughput

4,412_KB

HEF size

Hailo-10H

Target

INT8

Quantization

Vision baselines & detection

Standard vision architectures compiled to validate the full Conv2D pipeline on both Hailo-8 and Hailo-10H. These serve as reference points for the Conv2D pipeline on both targets.

BlazeFace (face detection)

Lightweight depthwise-separable face detector following the MediaPipe BlazeFace architecture. 5 depthwise-separable convolution stages at 128×128 input resolution with dual classification and bounding-box regression heads. Trained in AxonML for the Aegis biometric pipeline's face-localization front-end.

128×128

Input resolution

47_KB

ONNX size

DW-Sep

Architecture

H8 + H10

Dual target

ResNet-18 (CIFAR-10)

Standard 18-layer residual network trained on CIFAR-10 (32×32 RGB, 10 classes). Validates skip-connection handling through the quantization pipeline. 11.2M parameters with batch normalization fused into convolutions during optimization.

32×32

Input resolution

11.2M_params

Model size

18_layers

Depth

H8 + H10

Dual target

VGG-11 (CIFAR-10)

11-layer VGG architecture with batch normalization, trained on CIFAR-10. Pure sequential Conv+BN+ReLU+Pool tower with no skip connections—a stress test for deep sequential quantization accuracy and NPU memory bandwidth utilization.

32×32

Input resolution

9.2M_params

Model size

11_layers

Depth

H8 + H10

Dual target

NexusWatch perception suite

Four perception models running on the live NexusWatch deployment Pi (Hailo-10H, HailoRT 5.3.0). The suite handles long-range optical perception with motion, shape, and anomaly classification over a wide-area sensor field. All four models run concurrently on the same device with shared HailoRT VDevice resources.

Igigi (detector)

The detection front-end of the NexusWatch perception pipeline. Identifies candidate aerial objects in the optical field and produces bounding-box detections with confidence scores consumed by the downstream Namtar / Shamash / Nisaba classifiers.

243_FPS

Throughput

Live

Production status

Hailo-10H

Target

Detection

Role

Namtar (anomaly)

Anomaly classifier over the Igigi detection stream. Flags detections whose feature signature deviates from the learned distribution of expected objects in the sensor field, providing the first-line filter for novelty detection.

4,489_FPS

Throughput

Live

Production status

Hailo-10H

Target

Anomaly

Role

Shamash (motion)

Motion classifier characterizing the trajectory dynamics of detected objects. Produces motion signatures used downstream to distinguish ballistic, powered, and aerodynamic motion patterns.

4,072_FPS

Throughput

Live

Production status

Hailo-10H

Target

Motion

Role

Nisaba (shape)

Shape classifier producing geometric descriptors of detected objects. Combined with the Namtar anomaly score and Shamash motion signature, the Nisaba shape descriptor provides the third dimension in the perception suite's classification space.

4,243_FPS

Throughput

Live

Production status

Hailo-10H

Target

Shape

Role

Part III · Hailo-805

Hailo-8 Apollo suite

The Apollo suite comprises 8 purpose-built HVAC predictive control architectures, each targeting a specific aspect of commercial building thermal management: supply air prediction, airflow optimization, wind-chill compensation, multi-zone coordination, geothermal optimization, hydronic flow prediction, and combustion efficiency modeling.

Table — Apollo suite silicon benchmarks on Hailo-8.
Model	Domain	FPS	HW Lat (ms)
Apollo	HVAC Predictive Control	1,475.02	0.706
Aquilo	HVAC Predictive Control	11,596.20	0.109
Boreas	HVAC Predictive Control	5,391.30	0.223
Colossus	HVAC Predictive Control	457.25	2.209
Gaia	HVAC Predictive Control	394.12	2.559
Naiad	HVAC Predictive Control	34,660.90	0.044
Vulcan	HVAC Predictive Control	10,145.30	0.121
Zephyrus	HVAC Predictive Control	10,483.50	0.118

Apollo

Apollo is the flagship predictive control model for commercial HVAC systems. It forecasts supply air temperature, zone demand, and equipment staging from multi-sensor inputs including outdoor air, return air, zone temperatures, and occupancy signals.

1,475_FPS

Throughput

0.706_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Apollo on Hailo-8

Per-model whitepaper · Interactive dashboard

Aquilo

Aquilo is a lightweight thermal prediction model optimized for constrained edge deployment. It predicts supply air temperature trends from minimal sensor inputs with sub-0.11ms latency.

11,596_FPS

Throughput

0.109_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Aquilo on Hailo-8

Per-model whitepaper · Interactive dashboard

Boreas

Boreas specializes in outdoor air temperature and wind-chill compensation for HVAC economizer control. It models the relationship between ambient conditions and building thermal load.

5,391_FPS

Throughput

0.223_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Boreas on Hailo-8

Per-model whitepaper · Interactive dashboard

Colossus

Colossus is a deep multi-zone thermal model for large commercial buildings. It simultaneously predicts thermal trajectories across multiple HVAC zones, enabling coordinated control strategies.

457_FPS

Throughput

2.209_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Colossus on Hailo-8

Per-model whitepaper · Interactive dashboard

Gaia

Gaia models ground-source heat pump systems, predicting loop temperatures and COP from ground conditions, building load, and historical cycling patterns.

394_FPS

Throughput

2.559_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Gaia on Hailo-8

Per-model whitepaper · Interactive dashboard

Naiad

Naiad predicts hydronic system flow rates and delta-T from pump speed, valve position, and temperature sensor inputs. Ultra-lightweight for deployment on the smallest controllers.

34,661_FPS

Throughput

0.044_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Naiad on Hailo-8

Per-model whitepaper · Interactive dashboard

Vulcan

Vulcan models gas-fired heating equipment efficiency, predicting flue temperature, combustion efficiency, and heat exchanger fouling from operational parameters.

10,145_FPS

Throughput

0.121_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Vulcan on Hailo-8

Per-model whitepaper · Interactive dashboard

Zephyrus

Zephyrus predicts supply and return airflow rates from fan speed, static pressure, and duct configuration inputs for VAV and constant-volume air handling systems.

10,484_FPS

Throughput

0.118_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Zephyrus on Hailo-8

Per-model whitepaper · Interactive dashboard

Part III · Hailo-806

Hailo-8 fleet models

Site-specific HVAC predictive control models trained on operational data from production building automation systems. Each model is compiled as a dedicated HEF binary and deployed to the Raspberry Pi 5 controller at its respective site.

Fleet deployment overview

The AutomataNexus production fleet runs deployed LSTM and GRU predictive models across twelve commercial and residential sites: Akron Public Library, Byrna Ammunition Production, Element Labs, First Church of God, Hopebridge Therapy Center, Heritage Huntington, Heritage Warren Healthcare Campus, NE Realty Group, St. Jude Church and School, Peabody Retirement Community, Taylor University, and a residential lake-house deployment. Each site runs site-specific LSTM and GRU model pairs of architecture matched to the equipment under control, compiled to Hailo-8 HEFs through DFC v3.33.1 and deployed to a Raspberry Pi 5 controller in the mechanical room. Models in the broader fleet are not itemized in this document; the three sites featured below represent the most complex equipment-control surfaces in the deployed fleet and are documented in detail because their LSTM and GRU implementations exercise the full range of the recurrent-architecture validation methodology described in the Hailo-10H portfolio section.

FCOG Mechroom

The First Church of God mechanical room runs predictive control over a substantial heating and cooling plant: 2 chillers, 4 boilers, 6 pumps, and 4 variable-frequency drives (VFDs). The site's LSTM and GRU pair predicts equipment failure modes across the cross-coupled chilled-water and hot-water loops, with the VFDs providing fine-grained control surfaces and the pump array distributing flow across both loops.

Enlil (FCOG Mechroom LSTM)

Multi-horizon LSTM predictor for the FCOG Mechroom equipment cluster. Predicts equipment failure modes across chillers, boilers, pumps, and VFDs at three horizons (5/15/30 min) based on real-time sensor telemetry from the mechanical room.

418_FPS

Throughput

1.488_ms

HW latency

Hailo-8

Target

LSTM

Architecture

DFC Profiler Report — FCOG Mechroom LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Enki (FCOG Mechroom GRU)

GRU-based anomaly detector for the FCOG Mechroom equipment cluster. Companion model to the LSTM predictor; the GRU produces real-time anomaly scores while the LSTM produces forward-looking failure predictions, with both models running concurrently on the same Pi 5 controller.

50,583_FPS

Throughput

0.057_ms

HW latency

Hailo-8

Target

GRU

Architecture

DFC Profiler Report — FCOG Mechroom GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Taylor Greenhouse

The Taylor University greenhouse facility runs predictive control over an environmental management system: 1 supply fan, 2 exhaust fans, 4 fan coils, temperature and relative humidity sensors, a UV sensor, retractable roof and side vents, and 2 hot-water wall radiant heaters. The site's LSTM and GRU pair handles a unique mix of HVAC airflow control, radiant heating, and physical envelope manipulation (roof and vent positioning) in response to environmental conditions.

Ninhursag (Taylor Greenhouse LSTM)

Multi-horizon LSTM predictor for the Taylor Greenhouse environmental control system. Integrates airflow telemetry from supply and exhaust fans, fan-coil performance, radiant-heater output, and envelope-position state (roof and vent actuators) with the temperature, relative humidity, and UV sensor stream to predict environmental excursions and equipment failures.

418_FPS

Throughput

1.488_ms

HW latency

Hailo-8

Target

LSTM

Architecture

DFC Profiler Report — Taylor Greenhouse LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Dumuzi (Taylor Greenhouse GRU)

GRU-based anomaly detector for the Taylor Greenhouse environmental control system. Companion model to the LSTM predictor; provides real-time anomaly scoring across the airflow, radiant, and envelope subsystems while the LSTM produces multi-horizon failure predictions.

54,144_FPS

Throughput

0.057_ms

HW latency

Hailo-8

Target

GRU

Architecture

DFC Profiler Report — Taylor Greenhouse GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Peabody Mechroom

The Peabody Retirement Community mechanical room is the largest equipment-control surface in the fleet: 3 cooling towers, 2 boilers, 1 heat exchanger, 6 pumps, 3 VFDs, and a condenser-loop heat exchanger. The site's LSTM and GRU pair predicts and monitors a multi-loop hydronic system spanning condenser water, chilled water, and heating hot water with cross-coupled flow paths through the heat exchangers.

Gibil (Peabody Mechroom LSTM)

Multi-horizon LSTM predictor for the Peabody Mechroom equipment cluster. Predicts failure modes across cooling towers, boilers, pumps, VFDs, and the dual heat exchangers at three horizons (5/15/30 min). The condenser-loop heat exchanger introduces additional coupling between the cooling-tower loop and the chilled-water loop that the model accounts for in its prediction surface.

418_FPS

Throughput

1.488_ms

HW latency

Hailo-8

Target

LSTM

Architecture

DFC Profiler Report — Peabody Mechroom LSTM on Hailo-8

Per-model whitepaper · Interactive dashboard

Nammu (Peabody Mechroom GRU)

GRU-based anomaly detector for the Peabody Mechroom equipment cluster. Companion model to the LSTM predictor; produces real-time anomaly scores across the multi-loop hydronic system while the LSTM produces forward-looking failure predictions.

59,695_FPS

Throughput

0.057_ms

HW latency

Hailo-8

Target

GRU

Architecture

DFC Profiler Report — Peabody Mechroom GRU on Hailo-8

Per-model whitepaper · Interactive dashboard

Part III · Hailo-807

Hailo-8 special models

Sentinel (anomaly detection), Motion Classifier (occupancy), Detector (object detection), Atropos (TCN replacement for GRU), and ATLAS (autonomous racing) — special-purpose models that serve cross-cutting concerns across the fleet or demonstrate novel architecture compilations. A companion cross-silicon section follows, documenting Hailo-8 entries for models also compiled to Hailo-10H.

Table — Special-purpose models on Hailo-8.
Model	Domain	FPS	HW Lat (ms)
ATLAS	Autonomous Racing	1,091.67	1.015
Atropos	HVAC Predictive Control	6,982.23	0.178
Detector	Object Detection	319.02	6.292
Motion Classifier	Occupancy Detection	36,300.80	0.003
Sentinel	Anomaly Detection	21,136.50	0.005

Atropos

Atropos was originally an LSTM/GRU recurrent model re-architected as a TCN for Hailo compatibility. It maintains the same input/output contract while replacing recurrent gates with dilated causal convolutions.

6,982_FPS

Throughput

0.178_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Atropos on Hailo-8

Per-model whitepaper · Interactive dashboard

Detector

General-purpose object detection model for occupancy counting and zone activity monitoring.

319_FPS

Throughput

6.292_ms

HW latency

N/A_ms

E2E latency

Per-model whitepaper · Interactive dashboard

Motion Classifier

Binary motion/occupancy classifier for zone-level occupancy detection. Achieves 36,000+ FPS for negligible inference overhead alongside primary HVAC models.

36,301_FPS

Throughput

0.003_ms

HW latency

N/A_ms

E2E latency

Per-model whitepaper · Interactive dashboard

Sentinel

Sentinel is a lightweight anomaly detection model that flags abnormal HVAC operating conditions from multi-sensor inputs. Running at 19,700+ FPS, it provides continuous real-time monitoring.

21,136_FPS

Throughput

0.005_ms

HW latency

N/A_ms

E2E latency

DFC Profiler Report — Sentinel on Hailo-8

Per-model whitepaper · Interactive dashboard

Part III · Hailo-808

Hailo-8 cross-silicon entries

Models from the Hailo-10H portfolio that were additionally compiled and benchmarked on Hailo-8. Throughput numbers below are measured on the same dual-Pi 5 testbench with HailoRT 4.20.0 (Hailo-8 path) and HailoRT 5.3.0 (Hailo-10H path). The cross-silicon comparison surfaces a recurring finding: small models with single-context allocation often run faster on Hailo-8 than on Hailo-10H, despite Hailo-10H's higher nominal TOPS rating. This is consistent across the seven models listed below.

Table — Cross-silicon throughput comparison for models compiled to both Hailo-8 and Hailo-10H. Measured on Pi 5 dual-testbench, HailoRT 4.20.0 / 5.3.0.
Model	Domain	H8 FPS	H10 FPS	H8 / H10 ratio
Trident TCN	LLM backbone	20,133	2,109	9.5×
Trident 1.58-bit	BitNet ternary LM	19,388	3,010	6.4×
LSTM cell	Recurrent (per-step)	34,224	2,578	13.3×
LSTM encoder 60-step	Recurrent (unrolled)	89.58	27.47	3.3×
GRU cell (fused)	Recurrent (per-step)	27,027	1,232	21.9×
GPT-2 tiny	Transformer	4,757	1,743	2.7×
Phi tiny	Transformer + RoPE	2,385	1,633	1.5×

Interpretation

The Hailo-8 dataflow engine is simpler than Hailo-10H's: where Hailo-10H must partition most non-trivial models across multiple hardware contexts (with associated context-switch overhead between partition boundaries), Hailo-8 fits these smaller models in a single context with no partitioning required. For workloads in this size range, the Hailo-8 architecture's lower allocation overhead more than compensates for its lower nominal TOPS rating. The conventional choice of Hailo-10H for transformer and recurrent workloads is therefore not always the throughput-optimal choice — particularly for the kinds of small, fits-in-single-context models used in industrial deployments where the dataflow accelerator is shared across many model classes.

For the GRU cell (fused) the cross-silicon ratio is the most extreme at 21.9×. Numerical fidelity is identical across both chips (Pearson 0.9998 on each), confirming that the throughput delta is a partitioning effect rather than a precision tradeoff — Hailo-8 deployment for fused-gate recurrent cells does not cost any quality.

Part IV · LLMs09

LLM benchmarks

Three large language models were evaluated on the Hailo-10H accelerator via the hailo-ollama integration pipeline. These models run natively on the Hailo-10H NPU, demonstrating that edge LLM inference is viable on fixed-function accelerator silicon.

Table — LLM inference benchmarks on Hailo-10H via hailo-ollama.
Model	Parameters	Tokens/sec	Notes
Llama 3.2	1B	9.98	Meta Llama 3.2 1B via hailo-ollama
DeepSeek-R1	1.5B	7.41	DeepSeek R1 distilled 1.5B via hailo-ollama
Qwen3	1.7B	4.84	Alibaba Qwen3 1.7B via hailo-ollama

9.98_tok/s

Llama 3.2 1B

7.41_tok/s

DeepSeek-R1 1.5B

4.84_tok/s

Qwen3 1.7B

Pipeline

The hailo-ollama pipeline provides an Ollama-compatible REST API backed by the Hailo-10H NPU for token generation. Models are loaded in GGUF format, with compute-intensive matrix multiplications offloaded to the NPU via the HailoRT runtime. The host CPU handles tokenization, sampling, and KV-cache management while the NPU executes the forward pass for each generated token.

These benchmarks demonstrate that sub-2B parameter LLMs achieve interactive token generation rates on dedicated edge accelerator hardware, enabling local AI assistants, on-device code generation, and conversational interfaces without cloud dependency.

Note

LLM benchmarks measure tokens-per-second generation rate during sustained autoregressive decoding. Unlike the compiled HEF models in the rest of this portfolio, LLMs run through the hailo-ollama runtime rather than direct HEF execution. Throughput is limited by sequential token generation rather than batch inference.

Part V · Infrastructure10

Compilation pipeline

The 84 models in this portfolio were compiled from trained AxonML checkpoints through the Hailo Dataflow Compiler (DFC) and validated on production silicon via HailoRT. This section documents the compilation pipeline and the architectural constraints that inform AxonML's model design decisions for Hailo targets.

Compilation pipeline

Each model in the portfolio follows the same compilation path: trained AxonML checkpoint → ONNX export → DFC compile (v3.33.1 for Hailo-8, v5.3.0 for Hailo-10H) with INT8 post-training quantization and DFC's calibration pass → HEF binary → deployment via HailoRT (4.20.0 on Hailo-8, 5.3.0 on Hailo-10H) on the dual-Pi 5 testbench. Throughput and latency measurements were taken with hailortcli run against each compiled HEF on the corresponding silicon target.

Compiler versions

DFC 3.33.1	Hailo-8 compilation. Targets the HAILO8 architecture. The Hailo-8 portfolio in this document was compiled with this version.
DFC 5.3.0	Hailo-10H compilation. Targets the HAILO15H/HAILO10H architecture family. The Hailo-10H portfolio in this document was compiled with this version.

Architecture constraints

The compilation pipeline enforces a family of architectural constraints inherited from the underlying Hailo dataflow silicon. These constraints inform AxonML's architectural decisions during model design:

No dynamic control flow. All tensor shapes must be statically known at compile time. Variable-length sequences require padding to a fixed maximum length, and data-dependent branching is unsupported.
Softmax is a hardware-bounded resource. The dataflow engine includes dedicated softmax units; transformer self-attention, which uses softmax in the attention kernel, contends for these units. Replacing softmax attention with depthwise convolution or sigmoid gating yields equivalent accuracy with better hardware utilization for throughput-bound workloads.
INT8 quantization mandatory. All activations and weights are quantized to INT8 during the compilation pass. Models must be calibration-friendly; architectures with extreme dynamic ranges may require quantization-aware training. The fused-gate GRU variant in this portfolio illustrates how minor architectural rewrites can substantially improve quantization quality.
No in-place operations. Operations like in-place addition or in-place ReLU must be rewritten as explicit allocations for the dataflow graph. AxonML's exporter enforces this transformation automatically during ONNX export.
Single-batch inference. The dataflow architecture processes one sample at a time through the fixed-function pipeline. Throughput is achieved via pipeline parallelism across layers, not batch parallelism.
Depthwise-separable convolutions preferred. Standard dense convolutions with large kernel sizes may not fit the hardware's multiply-accumulate budget. Depthwise separable convolutions decompose the operation into hardware-friendly stages and are the preferred construction for new AxonML architectures targeting Hailo silicon.
Recurrent gates require unrolling or per-step offload. Native GRU and LSTM op variants are not directly compiled; recurrent networks must be either unrolled to a fixed sequence length (graph-flattened recurrence) or compiled as per-timestep cells with the recurrence loop on the host. Both approaches are documented in this portfolio with full correlation validation.

Part V · Infrastructure11

Deployment

Compiled HEF binaries deploy to edge controllers equipped with Hailo accelerator modules. The deployment pipeline is fully automated via the NexusDeploy system.

Edge hardware platform

SBC	Raspberry Pi 5 (4GB / 8GB)
Accelerator	Hailo AI HAT+ (M.2 key M, PCIe Gen 3)
Hailo-8 module	26 TOPS INT8, ~2.5W typical power
Hailo-10H module	40 TOPS INT8, ~3.5W typical power
Runtime	HailoRT (hailort) for PCIe/M.2 dispatch
OS	Raspberry Pi OS (Debian Bookworm, kernel 6.6+)

Deployment pipeline

Train model in AxonML (pure-Rust, CPU/GPU training).
Export to ONNX via AxonML's ONNX serializer.
Compile ONNX to HEF via DFC (v3.33.1 for H8, v5.3.0 for H10).
Validate HEF with hailortcli parse-hef (I/O shape, compiler version, compatibility).
Benchmark on production silicon with hailortcli run.
Deploy HEF to target controller via nexusdeploy (rsync + systemd reload).
Monitor inference via NexusWatch fleet dashboard.

Fleet topology

The current deployment fleet consists of Raspberry Pi 5 controllers distributed across commercial HVAC sites in Indiana and Ohio. Each controller runs one or more site-specific HEF models alongside the Talos control daemon, NexusEdge UI, and AegisDB telemetry store. Models execute on the Hailo accelerator asynchronously; the control loop reads predictions at ~1 Hz while the NPU runs inference at thousands of FPS, enabling real-time anomaly detection and predictive staging overlaid on the control cycle.

Zero-dependency edge inference

Each HEF binary is a self-contained executable for the Hailo dataflow engine. No Python runtime, ONNX interpreter, TensorFlow Lite delegate, or model-serving framework is required on the edge device. The only runtime dependency is HailoRT, which provides the PCIe transport layer between the ARM host and the Hailo NPU.

Appendices12

References

Jewell, A. (2026). AxonML — A Pure-Rust Deep Learning Framework for Edge Silicon. AutomataNexus LLC. Technical whitepaper.
Jewell, A. (2026). NexusEdge — A Tauri-Based Orchestration Layer for Industrial HVAC Building Automation on Raspberry Pi Controllers. Zenodo. https://doi.org/10.5281/zenodo.19892139.
Jewell, A. (2026). The Nexus Stack — A Pure-Rust Industrial Technology Stack for Building Automation. AutomataNexus LLC. Technical whitepaper.
Jewell, A. (2026). Trident — A 1.58-bit Ternary Language Model for Edge Inference. AutomataNexus LLC. Technical whitepaper.
Jewell, A. (2026). Aegis — Multi-Modal Biometric Authentication on Edge Silicon. AutomataNexus LLC. Technical whitepaper.
Hailo Technologies Ltd. (2024). Hailo Dataflow Compiler User Guide. DFC v3.x / v5.x documentation.
Hailo Technologies Ltd. (2024). HailoRT — Hailo Runtime API Reference. HailoRT v4.x / v5.x documentation.
Hailo Technologies Ltd. (2024). Hailo-8 Datasheet. Product specification. 26 TOPS INT8.
Hailo Technologies Ltd. (2024). Hailo-10H Datasheet. Product specification. 40 TOPS INT8.
Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764.
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
Geiping, J. et al. (2025). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. Huginn architecture reference for RDT-tiny.
Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271. TCN reference for the Trident TCN, Atropos, and HVAC fleet predictors.
Raspberry Pi Ltd. (2024). Raspberry Pi 5 Datasheet. Product specification.

AutomataNexus LLC · Fort Wayne, Indiana
Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
Generated from real Hailo-8 and Hailo-10H silicon measurements · AxonML Framework · April 2026

DISCLAIMER: This document and its contents are proprietary to AutomataNexus LLC. Performance figures are measured on production silicon under controlled conditions. Reproduction or redistribution without written permission is prohibited.

AxonML on Hailo —84 models compiled and benchmarked on real silicon.

Abstract

Background

Approach

Results

Conclusion

Executive overview

Hailo-10H portfolio

Argus (Aegis)

Ariadne (Aegis)

ATLAS

BirdClef SedNet

Chimera

Hydra

Mamba SSM

Mnemosyne (Aegis)

Nabu

Trident 1.58-bit (BitNet b1.58)

Trident TCN

Recurrent architectures — correlation-validated

GRU cell (per-timestep)

GRU cell (fused gates)

GRU 60-step unrolled

LSTM cell (per-timestep)

LSTM encoder 60-step unrolled

Research transformer architectures

GPT-2 tiny

Phi tiny

RDT-tiny (Huginn recurrent-depth)

RDT-mid (Oracle-7B distill target)

Trident-Blog distill (1.58-bit compact)

Sparse Autoencoders (SAE)

Vision baselines & detection

BlazeFace (face detection)

ResNet-18 (CIFAR-10)

VGG-11 (CIFAR-10)

NexusWatch perception suite

Igigi (detector)

Namtar (anomaly)

Shamash (motion)

Nisaba (shape)

Hailo-8 Apollo suite

Apollo

Aquilo

Boreas

Colossus

Gaia

Naiad

Vulcan

Zephyrus

Hailo-8 fleet models

Fleet deployment overview

FCOG Mechroom

Enlil (FCOG Mechroom LSTM)

Enki (FCOG Mechroom GRU)

Taylor Greenhouse

Ninhursag (Taylor Greenhouse LSTM)

Dumuzi (Taylor Greenhouse GRU)

Peabody Mechroom

Gibil (Peabody Mechroom LSTM)

Nammu (Peabody Mechroom GRU)

Hailo-8 special models

Atropos

Detector

Motion Classifier

Sentinel

Hailo-8 cross-silicon entries

Interpretation

LLM benchmarks

Pipeline

Note

Compilation pipeline

Compilation pipeline

Compiler versions

Architecture constraints

Deployment

Edge hardware platform

Deployment pipeline

Fleet topology

Zero-dependency edge inference

AxonML on Hailo —
84 models compiled and benchmarked on real silicon.