AutomataNexus LLCWhitepaper · May 2026

Large Language Model

Trident 1.58-bit (BitNet b1.58) —
Ternary Quantized Conv2D Transformer

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

Author

Andrew Jewell Sr.

Organization

AutomataNexus LLC

Framework

AxonML (Rust)

Silicon

Hailo-10H

Front matterAbstract

Abstract

Background

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

Approach

The model was trained in AxonML (a pure-Rust deep learning framework) and compiled through the Hailo Dataflow Compiler (DFC 5.3.0) targeting Hailo-10H silicon. Post-training INT8 quantization was applied during the DFC compilation pass with production telemetry calibration data. The resulting Hailo Executable Format (HEF) binary executes on Hailo’s fixed-function dataflow architecture with deterministic latency and zero framework overhead at the edge.

Results

On production hardware (Hailo-10H (NexusWatch, FW 5.3.0)), Trident 1.58-bit (BitNet b1.58) achieves 3,362 FPS with 48.7 °C average die temperature (max 49.1 °C).

Conclusion

Trident 1.58-bit (BitNet b1.58) is production-ready as a single HEF binary deployed to edge devices with no external dependencies beyond the HailoRT vendor runtime. The model meets real-time latency requirements for its target large language model application.

Model	Trident 1.58-bit (BitNet b1.58)
Domain	Large Language Model
Architecture	Ternary Quantized Conv2D Transformer
Target silicon	Hailo-10H
Measured on	Hailo-10H (NexusWatch, FW 5.3.0)
DFC compiler	5.3.0
Framework	AxonML v0.6 (pure-Rust, CUDA + CPU backends)
Author	Andrew Jewell Sr. · ORCID 0009-0005-2158-7060
Organization	AutomataNexus LLC · Fort Wayne, Indiana

← PreviousCover Next →Executive overview

Part I · Model01

Executive overview

The flagship AxonML LLM on Hailo-10H silicon. Trident uses BitNet b1.58 ternary weights trained from scratch in the AxonML framework. 12-layer transformer (d=1024, intermediate=3072) compiled via DFC 5.3.0. Achieves 3,362 FPS at 48.7 degrees C — proving that billion-scale language model architectures can execute on Hailo silicon.

3,362_FPS

Throughput

—

HW Latency

48.7_°C

Die Temp

10H

Target

Network I/O

Input: token embeddings [batch, d=1024, seq, 1]. Output: logits [batch, vocab, seq, 1].

← PreviousAbstract Next →Architecture

Part I · Model02

Architecture

Ternary Quantized Conv2D Transformer

12-layer Conv2D transformer with ternary {-1, 0, +1} weights. Each layer: QKV projection (1x1 conv, d→3d), output projection (1x1 conv, d→d), SwiGLU MLP approximation (1x1 conv d→4d, ReLU gate, 1x1 conv 4d→d), batch-normalized residuals. Split-halves RoPE positional encoding folded into the weight matrices at export time. The ternary weight constraint reduces multiply-accumulate to addition/subtraction, which maps efficiently to Hailo fixed-function compute units.

Compilation constraints

All AxonML models targeting Hailo silicon are compiled under the fixed-function dataflow constraints: no dynamic control flow, no variable-length dimensions, all activations representable in INT8 after calibration, and no operations requiring dedicated softmax hardware (replaced with ReLU gating or depthwise convolution equivalents where necessary).

← PreviousExecutive overview Next →Silicon performance

Part II · Silicon03

Silicon performance

Measured on production hardware via hailortcli benchmark with 5-second sustained inference. Device: Hailo-10H (NexusWatch, FW 5.3.0).

Table 03-1 — Production silicon measurements, 5s sustained inference.
Metric	Measured Value
FPS (streaming)	3,361.85
Die Temperature (mean)	48.67 °C
Die Temperature (min)	46.52 °C
Die Temperature (max)	49.10 °C
Quantization	INT8 (post-training, DFC calibration)
DFC Compiler	5.3.0
HailoRT	5.3.0
Measured On	Hailo-10H (NexusWatch, FW 5.3.0)

← PreviousArchitecture Next →Deployment

Part II · Silicon04

Deployment

Deployed as a single HEF binary. No ONNX runtime, TensorFlow Lite, or Python inference stack required at the edge.

Target silicon	Hailo-10H
Measured on	Hailo-10H (NexusWatch, FW 5.3.0)
DFC compiler	5.3.0
Quantization	INT8 (post-training, production telemetry calibration)
Runtime	HailoRT (vendor runtime)
Edge platform	Raspberry Pi 5 + Hailo AI HAT+ (M.2 Key M)

Deployment procedure

Copy the .hef binary to the target device. hailortcli run loads the HEF directly into the Hailo-10H dataflow engine over PCIe. Inference begins immediately with deterministic per-frame latency. No model conversion, graph optimization, or warmup phase required.

← PreviousSilicon performance Next →References

Back matter05

References

Jewell, A. (2026). AxonML: A Pure-Rust Deep Learning Framework for Edge Inference. AutomataNexus LLC. Technical whitepaper.
Hailo Technologies Ltd. (2024). Hailo Dataflow Compiler User Guide. DFC v5.3.0.
Hailo Technologies Ltd. (2024). Hailo-10H Product Datasheet.

← PreviousDeployment Back to start →Cover

Trident 1.58-bit (BitNet b1.58) —Ternary Quantized Conv2D Transformer

Abstract

Background

Approach

Results

Conclusion

Executive overview

Network I/O

Architecture

Compilation constraints

Silicon performance

Deployment

Deployment procedure

References

Trident 1.58-bit (BitNet b1.58) —
Ternary Quantized Conv2D Transformer