Compiler Options

Compiler Options#

brainevent exposes nvcc optimization options directly through load_cuda_inline (and load_cuda_file / load_cuda_dir, which forward all keyword arguments).

Optimization Level#

The optimization_level parameter controls the -O<n> flag passed to nvcc. It applies to both host-side C++ code (via the underlying host compiler) and device-side PTX generation.

import brainevent

# Default: -O3 (recommended for production)
mod = brainevent.load_cuda_inline(
    name="my_kernels",
    cuda_sources=CUDA_SRC,
    functions={"my_func": ["arg", "ret", "stream"]},
)

# Debug build: -O0 (preserves variable values for cuda-gdb / Nsight)
mod = brainevent.load_cuda_inline(
    ...,
    optimization_level=0,
)

# Explicit production build: -O3
mod = brainevent.load_cuda_inline(
    ...,
    optimization_level=3,
)

Each value corresponds directly to an nvcc -O flag:

Level	Effect
`0`	No optimization. Useful for debugging with cuda-gdb or Nsight Compute.
`1`	Basic optimizations.
`2`	Standard optimizations.
`3`	Aggressive optimizations, including auto-vectorization and loop unrolling (default).

Fast Math#

The use_fast_math flag passes --use_fast_math to nvcc. This is a compound flag that enables several device-code optimizations:

-ftz=true — flush denormal floats to zero
-prec-div=false — approximate (faster) division
-prec-sqrt=false — approximate (faster) square root
-fmad=true — fused multiply-add (FMA) contraction

These trade IEEE 754 compliance for speed. Typical speed-up is 10–30 % on floating-point-heavy kernels (GEMM, reductions, activations).

# Enable fast math — safe for most ML kernels
mod = brainevent.load_cuda_inline(
    name="my_fast_kernels",
    cuda_sources=CUDA_SRC,
    functions={"my_func": ["arg", "ret", "stream"]},
    use_fast_math=True,
)

Warning

use_fast_math can change numerical results. Division and square root may differ from IEEE 754 by a few ULPs. Denormals are flushed to zero, which can cause underflow for very small values. Validate your kernel results before enabling this in production.

Extra Flags#

For flags not covered by the named parameters, use extra_cuda_cflags:

mod = brainevent.load_cuda_inline(
    ...,
    extra_cuda_cflags=[
        "--generate-line-info",         # source-line info for Nsight profiling
        "-maxrregcount=64",             # cap register usage to raise occupancy
        "--ptxas-options=-v",           # verbose PTX assembler stats
        "-Xcompiler", "-march=native",  # native CPU optimisation for host code
    ],
)

CUDA Graph Support#

CUDA Graphs (called command buffers in XLA) let the GPU driver record a sequence of kernel launches once and replay the recording cheaply, eliminating CPU-side kernel-launch overhead on every call. JAX captures CUDA graphs automatically for kernels registered with the COMMAND_BUFFER_COMPATIBLE XLA trait.

Pass allow_cuda_graph=True to opt in (this is the default):

mod = brainevent.load_cuda_inline(
    name="my_kernels",
    cuda_sources=CUDA_SRC,
    functions={"my_func": ["arg", "ret", "stream"]},
    allow_cuda_graph=True,   # default — no need to pass explicitly
)

Opt out only for kernels with host-side side effects during replay:

mod = brainevent.load_cuda_inline(
    ...,
    allow_cuda_graph=False,
)

Warning

Set allow_cuda_graph=False if the kernel has host-side side effects during replay: dynamic memory allocation (cudaMalloc), host callbacks, or non-deterministic resource usage. Plain element-wise, reduction, and GEMM kernels are all safe with the default True.

Combining All Options#

mod = brainevent.load_cuda_inline(
    name="peak_perf",
    cuda_sources=CUDA_SRC,
    functions={"my_func": ["arg", "ret", "stream"]},
    optimization_level=3,    # -O3 (default)
    use_fast_math=True,      # ~10-30% faster FP ops, relaxed precision
    # allow_cuda_graph=True  # default — no need to pass explicitly
)

Caching Behaviour#

optimization_level and use_fast_math are part of the cache key. Changing them for the same source triggers a recompilation and stores a separate cached binary — no need for force_rebuild=True.

allow_cuda_graph is a registration-only flag and does not affect the cache key or trigger a rebuild.