Armadillo ECDSA-113

WhoCares · #1 03-18-2026, 16:00

Perhaps we'd better upgrade CUDA toolkit from 12.x to 13.1.

For learning purposes, I asked an AI to optimize the GPU kernel function pollard_kernel(), mainly targeting the NVIDIA GeForce RTX 5090.

The optimization goal was to reduce register usage from 96 registers to 64 registers. This increases SM occupancy, allowing the number of blocks that can run concurrently on a single SM to increase from 5 to 8, yielding a theoretical performance improvement of around one third.

The actual performance gain should be evaluated using NVIDIA Nsight Compute together with real benchmark data.

By leveraging the SMRS compiler feature introduced in NVIDIA CUDA Toolkit 13.0, spilled registers can be replaced with accesses to shared memory, making it possible to ultimately achieve the 64-register optimization target.

Quote:

ptxas info : Compiling entry function '_Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi' for 'sm_120'
ptxas info : Function properties for _Z14pollard_kernelP6worm_tP4dp_tPjjiyiyi
200 bytes stack frame, -36 bytes spill stores, -28 bytes spill loads
ptxas info : Used 64 registers, used 1 barriers, 200 bytes cumulative stack size, 7168 bytes smem
ptxas info : Compile time = 0.000 ms
ptxas info : Function properties for _Z10ec_canon_x4fe_t
0 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z10ld_madd_z1RK4fe_tS1_S1_S1_RS_S2_S2_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z10reset_wormiyR14worm_context_tP6walk_t
0 bytes stack frame, 124 bytes spill stores, 120 bytes spill loads
ptxas info : Function properties for _Z12iterate_stepiyR14worm_context_tjP4dp_tPjiP6walk_ty
0 bytes stack frame, 32 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for _Z12prepare_stepR4fe_tS0_R4sc_tS2_
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
ptxas info : Function properties for _Z6fe_inv4fe_t
0 bytes stack frame, 0 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _Z9record_dpRK4fe_tRK4sc_tS4_P4dp_tPji
0 bytes stack frame, 0 bytes spill stores, 8 bytes spill loads
tmpxft_00006c68_00000000-7_solver_fast.compute_120.cudafe1.cpp
[100%] Linking CUDA executable solver_fast.exe
[100%] Built target solver_fast

cjack · #2 03-18-2026, 19:08

Quote:

Originally Posted by WhoCares

Perhaps we'd better upgrade CUDA toolkit from 12.x to 13.1.

For learning purposes, I asked an AI to optimize the GPU kernel function pollard_kernel(), mainly targeting the NVIDIA GeForce RTX 5090.

The optimization goal was to reduce register usage from 96 registers to 64 registers. This increases SM occupancy, allowing the number of blocks that can run concurrently on a single SM to increase from 5 to 8, yielding a theoretical performance improvement of around one third.

The actual performance gain should be evaluated using NVIDIA Nsight Compute together with real benchmark data.

By leveraging the SMRS compiler feature introduced in NVIDIA CUDA Toolkit 13.0, spilled registers can be replaced with accesses to shared memory, making it possible to ultimately achieve the 64-register optimization target.

Hey WhoCares,

Thanks for the detailed optimization work!
About the register optimization proposal — we actually went down this exact rabbit hole. Here's what we found on real silicon:

The kernel is throughput-bound, not latency-bound. We built and benchmarked an optimization called "Thunderstrike" (OPT-3) that used ILP to fuse operations and improve parallelism. Result on RTX 5090: 0% speedup. The ALU pipeline is already saturated by the 112 sequential GF(2^113) squarings in ec_canon_x (40% of step cost) and the 8 multiplications in fe_inv via Itoh-Tsujii (50% of step cost). More warps via higher occupancy just queue up behind the same ALU — there are no idle cycles to fill.

A few specific concerns with the 64-register approach:

fe_mul alone needs ~80 registers (table-free XOR accumulation across 113-bit field elements). Forcing 64 via __launch_bounds__ guarantees massive spills. Even with __noinline__ on hot functions like fe_inv and ec_canon_x, the call overhead and lost register context hurt throughput on the critical path.

CUDA 13.x / SMRS: we'd love to test it!.

The ptxas output shows negative spill values (-36 bytes spill stores, -28 bytes spill loads). Negative spills are unusual and suggest the compiler is reporting redirected spills rather than actual elimination. Without real Nsight Compute profiling data, it's hard to confirm whether this translates to actual throughput gains.

What we WILL adopt from your proposals for the next certificate:

Single-pass DP retrieval (clean simplification, ~1-2%)
cudaOccupancyMaxActiveBlocksPerMultiprocessor for self-tuning grid size
Benchmarking L1 cache vs shared memory for the walk table (your bank conflict analysis was spot-on theoretically)
Bottom line: fe_inv and ec_canon_x consume 90% of the step cost and are algorithmically irreducible for Koblitz curve canonicalization. No amount of occupancy optimization can reduce these costs.

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Replacing ECDSA in Target (arma)	Mynotos	General Discussion	3	11-22-2019 00:49

The Following User Gave Reputation+1 to WhoCares For This Useful Post:
cjack (03-18-2026)

The Following 4 Users Say Thank You to WhoCares For This Useful Post:
cjack (03-18-2026), niculaita (03-19-2026), nulli (03-20-2026), wx69wx2023 (03-18-2026)

The Following 2 Users Say Thank You to cjack For This Useful Post:
niculaita (03-18-2026), nulli (03-20-2026)