AutomataNexus LLCWhitepaper · May 2026
Large Language Model

Trident 1.58-bit (BitNet b1.58)
Ternary Quantized Conv2D Transformer

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

Author
Andrew Jewell Sr.
Organization
AutomataNexus LLC
Framework
AxonML (Rust)
Silicon
Hailo-10H

Abstract

Background

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

Approach

The model was trained in AxonML (a pure-Rust deep learning framework) and compiled through the Hailo Dataflow Compiler (DFC 5.3.0) targeting Hailo-10H silicon. Post-training INT8 quantization was applied during the DFC compilation pass with production telemetry calibration data. The resulting Hailo Executable Format (HEF) binary executes on Hailo’s fixed-function dataflow architecture with deterministic latency and zero framework overhead at the edge.

Results

On production hardware (Hailo-10H (NexusWatch, FW 5.3.0)), Trident 1.58-bit (BitNet b1.58) achieves 3,362 FPS with 48.7 °C average die temperature (max 49.1 °C).

Conclusion

Trident 1.58-bit (BitNet b1.58) is production-ready as a single HEF binary deployed to edge devices with no external dependencies beyond the HailoRT vendor runtime. The model meets real-time latency requirements for its target large language model application.

ModelTrident 1.58-bit (BitNet b1.58)
DomainLarge Language Model
ArchitectureTernary Quantized Conv2D Transformer
Target siliconHailo-10H
Measured onHailo-10H (NexusWatch, FW 5.3.0)
DFC compiler5.3.0
FrameworkAxonML v0.6 (pure-Rust, CUDA + CPU backends)
AuthorAndrew Jewell Sr. · ORCID 0009-0005-2158-7060
OrganizationAutomataNexus LLC · Fort Wayne, Indiana

Executive overview

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

3,362 FPS
Throughput
HW Latency
48.7 °C
Die Temp
10H
Target

Network I/O

Input: token embeddings [batch, d=1024, seq, 1]. Output: logits [batch, vocab, seq, 1].

Architecture

Ternary Quantized Conv2D Transformer

12-layer Conv2D transformer with ternary {-1, 0, +1} weights. Each layer: QKV projection (1x1 conv, d→3d), output projection (1x1 conv, d→d), SwiGLU MLP approximation (1x1 conv d→4d, ReLU gate, 1x1 conv 4d→d), batch-normalized residuals. Split-halves RoPE positional encoding folded into the weight matrices at export time. The ternary weight constraint reduces multiply-accumulate to addition/subtraction, which maps efficiently to Hailo fixed-function compute units.

Compilation constraints

All AxonML models targeting Hailo silicon are compiled under the fixed-function dataflow constraints: no dynamic control flow, no variable-length dimensions, all activations representable in INT8 after calibration, and no operations requiring dedicated softmax hardware (replaced with ReLU gating or depthwise convolution equivalents where necessary).

Silicon performance

Measured on production hardware via hailortcli benchmark with 5-second sustained inference. Device: Hailo-10H (NexusWatch, FW 5.3.0).

MetricMeasured Value
FPS (streaming)3,361.85
Die Temperature (mean)48.67 °C
Die Temperature (min)46.52 °C
Die Temperature (max)49.10 °C
QuantizationINT8 (post-training, DFC calibration)
DFC Compiler5.3.0
HailoRT5.3.0
Measured OnHailo-10H (NexusWatch, FW 5.3.0)
Table 03-1 — Production silicon measurements, 5s sustained inference.

Deployment

Deployed as a single HEF binary. No ONNX runtime, TensorFlow Lite, or Python inference stack required at the edge.

Target siliconHailo-10H
Measured onHailo-10H (NexusWatch, FW 5.3.0)
DFC compiler5.3.0
QuantizationINT8 (post-training, production telemetry calibration)
RuntimeHailoRT (vendor runtime)
Edge platformRaspberry Pi 5 + Hailo AI HAT+ (M.2 Key M)
Deployment procedure

Copy the .hef binary to the target device. hailortcli run loads the HEF directly into the Hailo-10H dataflow engine over PCIe. Inference begins immediately with deterministic per-frame latency. No model conversion, graph optimization, or warmup phase required.

References

  1. Jewell, A. (2026). AxonML: A Pure-Rust Deep Learning Framework for Edge Inference. AutomataNexus LLC. Technical whitepaper.
  2. Hailo Technologies Ltd. (2024). Hailo Dataflow Compiler User Guide. DFC v5.3.0.
  3. Hailo Technologies Ltd. (2024). Hailo-10H Product Datasheet.

AutomataNexus LLC · Fort Wayne, Indiana · andrew.jewellsr@automatanexus.com
Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
May 2026 · All rights reserved.