# Modern Machine Learning Model Deployment on FPGA for KamLAND-Zen

#### Zepeng Li University of Hawaii at Manoa

ML4FE workshop May 19 2025

# FastML for rare events search experiments



- Rare event search experiments: dark matter, neutrino, and neutrinoless double beta decay.
- Experiments built with low background materials and underground.
- FastML could help to extract rare events and suppress background!
- Less stringent latency requirement and possibility for complicated network.

# KamLAND-Zen detector

- Particles interact in the liquid scintillator and deposit energy. Energy is converted into light and detected by photo-multipliers.
- Energy resolution: 6.7%/sqrt(E (MeV))
- Vertex resolution: ~13.7 cm





# Machine learning based background rejection in KamLAND-Zen

- Signal is 2.46 MeV electron events
- Primary backgrounds:
  - 2vbb decays
  - Long-lived cosmic muon spallation
- Minor backgrounds
  - Radioactive background
  - Solar neutrinos
  - Short-lived cosmic muon spallation



# Machine learning based background rejection in KamLAND-Zen

- A novel deep learning model to distinguish backgrounds and signals
- KamNet takes a time-series of 2-D hit maps and returns a single-valued KamNetScore
- Convolutional-LSTM (Long-Short Term Memory) Layer with attention module
  - Learns to identify and focus in on important sections of the event
- Spherical Convolution
  - Utilizes spherical symmetry to learn complex features







# KamLAND2-Zen



Zr Nd Ca Ge<sup>Mo</sup>  $10^{-1}$ KamLAND-Zen (<sup>136</sup>Xe) IH KamLAND2-Zen  $10^{-2}$ **Target** is NH <m<sub>\$\$</sub> > ~ 20 meV  $10^{-3}$ in 5 year  $10^{-2}$  $10^{-3}$  $10^{-1}$  $10^{-4}$ 50 100 150 m<sub>lightest</sub> (eV) A We aim to increase light collection by more than 5 times!  $\sigma$  (2.6MeV) = 4%  $\rightarrow$  2%

From H. Ozaki

Other options(Scintillation film, Imaging detector, pressurized xenon ..) in development

25

We are developing new electronics with wide dynamic range!

We could explore modern ML algorithms deployed on advanced FPGA in the front-end for real-time trigger/data processing.

# Modern ML models on FPGA using CGRA4ML







#### Parameterizable Coarse-Grained Reconfigurable Array

PE: processing element Off-chip memory for parameters storage

#### 7

https://arxiv.org/pdf/2408.15561

# Modern ML models on FPGA using CGRA4ML

Model development and optimization in python



- Workflow is similar to HLS4ML: network development and optimization in python using Tensorflow/QKeras.
- CGRA is reconfigurable for different tasks and FPGAs.
- Xilinx softwares for synthesis and verification.

### Results

| Model             | ResNet-50 | PointNet |
|-------------------|-----------|----------|
| Bits              | 4         | 4        |
| PEs               | (7,96)    | (32,32)  |
| Frequency (MHz)   | 250       | 250      |
| FFs               | 101706    | 69277    |
| LUTs              | 82200     | 100076   |
| BRAMs             | 6         | 4.5      |
| Static Power (W)  | 0.700     | 0.700    |
| Dynamic Power (W) | 3.847     | 3.840    |
| Total Power (W)   | 4.547     | 4.540    |
| GOPs/W            | 37.3      | 56.8     |

### TABLE IIIImplementation of ResNet-50 and Pointnet on ZCU104 FPGA

# PointNET event reconstruction

- Data can be thought of as a <u>point</u>
  <u>cloud</u>
  - x, y, z, t, and q of each PMT
  - Geometric semantics
  - Invariant to permutations (x, y, z encoded)
- Use the PointNET architecture (Qi et al 2017)





# Model Quantization in cgra4ml

- Need to compress model
- Need to understand optimal hardware parameters
- Training QKeras establishes
  baseline for comparing
  hardware and software
  compression

| <pre>class UserModel(XModel):<br/>definit(self, sys_bits, x_int_bits, *args, **kwargs):<br/>super(`, x_int_bits, *args, **kwargs)<br/>(variable) b0: Any</pre> |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| self.b0 = XBundle(                                                                                                                                             |
| # core=XDense(                                                                                                                                                 |
| # k int bits=0,                                                                                                                                                |
| # b int bits=0,                                                                                                                                                |
| # units=64.                                                                                                                                                    |
| <pre># act=XActivation(sys bits=sys bits, o int bits=0, type='relu', slope=0)</pre>                                                                            |
| #)                                                                                                                                                             |
| core=XConvBN(                                                                                                                                                  |
| k int bits=0,                                                                                                                                                  |
| b int bits=0,                                                                                                                                                  |
| filters=64,                                                                                                                                                    |
| kernel size=1,                                                                                                                                                 |
| act=XActivation(sys_bits=sys_bits, o_int_bits=0, type='relu', slope=0)                                                                                         |
| ),                                                                                                                                                             |
|                                                                                                                                                                |
|                                                                                                                                                                |

cgra port of PointNET

# Software to Hardware: Key Tools

- cgra4ml
  - Converts Tensorflow model for vivado synthesis
  - Open-source and available on <u>Github</u>
- Vivado (AMD)
  - Model export verification and simulation
  - Synthesizes FPGA representation
- Vitis (AMD)
  - C-wrapper for FPGA model verification





Images courtesy of AMD

### PointNET cgra Port Accuracy Results



## **Reconstruction Results Summarized**

| Experiment         | X Error (cm) | Y Error (cm) | Z Error (cm) | E Error<br>(MeV) |
|--------------------|--------------|--------------|--------------|------------------|
| Traditional<br>KLZ | 17           | 17           | 17           | N/A              |
| QKeras             | 20           | 21           | 21           | 0.06             |
| cgra4ml            | 34           | 34           | 36           | 0.06             |

- The pointnet on FPGA achieves vertex reconstruction accuracy slightly worse than offline reconstruction.
- Accuracy is good enough for a position-aware trigger in KamLAND-Zen.

## RFSoC4x2

- ZYNQ Ultrascale+ FPGA in lab
- AMD Kit



Image courtesy of AMD

#### https://ml4physicalsciences.github.io/2024/file s/NeurIPS\_ML4PS\_2024\_153.pdf



Vivado synthesis of model

| Version   | Latency<br>(ms/batch) @ 20<br>runs |  |  |
|-----------|------------------------------------|--|--|
| Trained   | 6980.9                             |  |  |
| Untrained | 6996                               |  |  |
| ~136.3 ms |                                    |  |  |

~436.3 ms inference per event

- A simpler model (<<1 million parameters) without losing much accuracy.
- Use PMT cluster as a single point instead of a single PMT as input.
- Optimize quantization



- CGRA4ML provide a framework for modern ML model deployment on FPGA.
- PointNET is an effective way of reconstructing detector physics in KLZ.
- We can deploy PointNET onto an FPGA to make single-event inference that opens possibility of position-ware trigger.

## HLS4ML



Layer-by-layer implementation of neural networks in HLS4ML. Multiplier could be reused inside a layer.

# ZYNQ heterogeneous SoC



- System on Chip: Arm CPU + FPGA
- Advanced eXtensible Interface (AXI) provides for high bandwidth and low latency connections between elements.