I've doing a bit of ML work in rust lately and have come across a few different libraries that provide tensor operations and other building blocks. The three most popular:
- Candle - Minimalist ML framework from hugging face. It supports low level tensor operations and some higher level operations like layernorm, softmax etc.
- Burn - More of a full stack ML framework that you can use for training and inference.
- tch-rs - Rust bindings on top of the C++ torch API.
And then we also have:
- Ndarray - Standard rust crate for general elements and numerics.
One of the things I ran into when I was working on gpt-rs was that the rust tensor operations were dramatically slower than pytorch. Which to be fair is expected given that pytorch has a decade of extremely low level optimizations but it begged the question: which native rust library had the most performant tensor operations?
So I decided to run a benchmarking against, Candle, Burn and Ndarray. I skipped tch-rs because it's not native rust.
PS. you can find all of the code and benchmarks here.
Let's dig in.
Rust Tensor Libraries Benchmark Results
Test Environment
- Hardware: Macbook M2 CPU, No GPU benchmarks included
- Data Type: f32
- Optimization: Release mode with LTO enabled
- Measurement: Criterion framework from the criterion crate with 95% confidence intervals
- Limited Operations: Core operations only, no neural network layers
- System Dependent: Results may vary across different hardware
Overall Performance Overview
First, a tldr. Different frameworks were good at different things. There was no one clear winner. Each framework also has it's own pros and cons, for example, if you want to build a training pipeline you're better off doing that with Burn than Candle because Burn already has all of the building blocks in place. It probably doesn't make sense to rebuild backprop, SGD and more just to use Candle.
| Operation | Winner | Performance Advantage | 
|---|---|---|
| Tensor Creation | NDArray | ~4.5x faster than Burn, ~8.2x faster than Candle | 
| Matrix Multiplication | Candle | ~1.7x faster than Burn, ~3.9x faster than NDArray | 
| Element-wise Operations | NDArray/Candle | Virtually identical performance | 
| Reduction Operations | Candle | ~1.7x faster than NDArray/Burn | 
| Vector Operations | NDArray | ~2.1x faster than Burn | 
Now into the details.
1. Tensor Creation (512×512 Random Tensors)
Performance for creating random tensors:
| Library | Mean Time (μs) | Std Dev (μs) | Relative Performance | 
|---|---|---|---|
| NDArray | 317.3 | 63.2 | 1.00x (baseline) | 
| Burn | 1,435.9 | 172.0 | 4.53x slower | 
| Candle | 2,605.6 | 85.7 | 8.22x slower | 
NDArray significantly outperforms both Burn and Candle for tensor creation, outperforming Burn by 4.5x and Candle by an impressive 8.2x. I was actually pretty surprised by this. I would have expected Burn and/or Candle to have optimized BLAS routines and even some assembly operations but it doesn't look like it. Or at least it didn't make a difference. The other potential cause here is how the random numbers are generated. With NdArray, I used the ndarray_rand crate while Candle and Burn have their own wrappers if not implementations of a random number generator.
2. Matrix Multiplication (512×512 × 512×512)
Performance for matrix multiplication:
| Library | Mean Time (μs) | Std Dev (μs) | Relative Performance | 
|---|---|---|---|
| Candle | 674.8 | 75.7 | 1.00x (baseline) | 
| Burn | 1,144.0 | 190.3 | 1.70x slower | 
| NDArray | 2,663.8 | 105.3 | 3.95x slower | 
Candle dominates matrix multiplication. Candle outperforms Burn by 1.7x and NDArray by nearly 4x in matrix multiplication tasks. More importantly, Candle achieved approximately 397M FLOPS compared to the theoretical 268M FLOPS for the operation—indicating highly optimized GEMM (General Matrix Multiply) implementations, likely leveraging optimized BLAS libraries. Matmul is obviously crucial to deep learning payloads so it's not surprising that Candle and Burn have invested in it (at least relative to NDArray).
3. Vector Operations (Dot Product)
Performance for vector dot products (100K elements):
| Library | Mean Time (μs) | Performance Notes | 
|---|---|---|
| NDArray | ~11.2 | Optimized vector ops | 
| Burn | ~23.9 | 2.1x slower | 
I only ran dot products for NDArray and Burn because Candle didn't support native dot product calculations and that felt a little unfair to write my own and then compare it. But overall, Burn is must slower than NDArray here which is interesting given that Burn is much faster than NDArray in MatMul. I would have thought that some of the optimizations that improved matmul would have also improved dot product.
4. Element-wise Addition (512×512 + 512×512)
Performance for element-wise addition:
| Library | Mean Time (μs) | Std Dev (μs) | Relative Performance | 
|---|---|---|---|
| Candle | 30.7 | 2.0 | 1.00x (baseline) | 
| NDArray | 30.9 | 0.9 | 1.01x slower | 
| Burn | 31.3 | 0.8 | 1.02x slower | 
All three libraries show nearly identical performance for element-wise operations.
5. Reduction Operations (Sum)
Performance for tensor sum operations (256×256 tensors):
| Library | Mean Time (μs) | Performance Notes | 
|---|---|---|
| Candle | ~4.2 | Fastest reduction | 
| NDArray | ~7.3 | Moderate performance | 
| Burn | ~7.8 | Slowest reduction | 
Performance Scaling Analysis
Matrix Multiplication Scaling (64×64 to 512×512)
The libraries show different scaling characteristics:
- Candle: Excellent scaling, maintains performance advantage
- Burn: Good scaling but consistently slower than Candle
- NDArray: Poor scaling for larger matrices
Element-wise Operations Scaling
All libraries scale similarly for element-wise operations, maintaining competitive performance across different tensor sizes.
Memory and Throughput Analysis
Tensor Creation Throughput (512×512 matrices)
| Library | Elements/sec | Throughput Efficiency | 
|---|---|---|
| NDArray | 831M elements/sec | Highest throughput | 
| Burn | 183M elements/sec | Moderate throughput | 
| Candle | 101M elements/sec | Lowest throughput | 
Matrix Multiplication FLOPS (512×512 × 512×512)
Theoretical FLOPS for 512×512 matrix multiplication: ~268M FLOPS
| Library | Actual FLOPS | Efficiency | 
|---|---|---|
| Candle | ~397M FLOPS | Best efficiency | 
| Burn | ~234M FLOPS | Moderate efficiency | 
| NDArray | ~101M FLOPS | Poor efficiency | 
Conclusion
Each library has distinct performance characteristics:
- Candle excels at compute-intensive operations like matrix multiplication
- NDArray dominates memory-intensive operations like tensor creation
- Burn provides consistent, balanced performance with additional safety features
The choice really depends on your specific use case, with Candle being ideal for ML inference, NDArray for data processing, and Burn for comprehensive ML training pipelines. It would be great to be able to interop between these easily but each one has it's own tensor implementation so you would need a translation layer to convert Candle Tensor types to Burn Tensors.
As the ecosystem matures, we'll likely see consolidation around fewer libraries, each with clearer use case definitions and wider support.