LLM-based HipKittens/HIP kernel generator for AMD GPUs, targeting KernelBench GEMM problems.
cd /root/HipGenerator
# Set API key
export LLM_GATEWAY_KEY='your_key_here'
# Test a single problem
python run_loop.py --problem /path/to/problem.py --max-attempts 3
# Run batch GEMM test
./batch_test_gemm.shSuccessfully developed prompts that enable LLMs to generate correct HipKittens GEMM kernels with:
- Adaptive tile sizing: 256x256 for large matrices, 128x128 for smaller ones
- Shared memory with double buffering
- Instruction scheduling optimizations
| Problem | Speedup | Status |
|---|---|---|
| 1_Square_matrix_multiplication_ | 0.85x | PASS |
| 2_Standard_matrix_multiplication_ | 0.66x | PASS |
| 6_Matmul_with_large_K_dimension_ | 0.00x | Failed |
| 7_Matmul_with_small_K_dimension_ | 0.68x | PASS |
| 8_Matmul_with_irregular_shapes_ | 0.91x | PASS |
| 9_Tall_skinny_matrix_multiplication_ | 0.54x | PASS |
| 16_Matmul_with_transposed_A | 0.52x | PASS |
| 17_Matmul_with_transposed_B | 0.57x | PASS |
| 12_Gemm_Multiply_LeakyReLU | 0.50x | PASS |
| 29_Matmul_Mish_Mish | 0.00x | Failed |
| 40_Matmul_Scaling_ResidualAdd | 0.00x | Failed |
| 76_Gemm_Add_ReLU | 0.62x | PASS |
| 86_Matmul_Divide_GELU | 0.49x | PASS |
Summary: 10/13 tests pass (77%), best performance 0.91x
| Version | Description | Performance |
|---|---|---|
| Baseline | Global-to-register | 0.16x |
| + Shared memory | Double buffering | 0.40x |
| + sched_barrier | Instruction scheduling | 0.60x |
| + Bt caching | Avoid redundant transpose | 0.85x |
| + Adaptive tiles | 256/128 block selection | 0.91x peak |
HipGenerator/
├── run_loop.py # Main generation loop
├── generate.py # LLM code generator
├── eval.py # Evaluator with profiling
├── batch_test_gemm.sh # Batch test script
├── test_gemm_reference.py # Reference kernel implementation
├── prompts/
│ ├── config.json # Prompt selection rules
│ ├── hipkittens_gemm.txt # Main GEMM prompt
│ ├── hipkittens_base.txt # Base prompt
│ └── elementwise_bf16.txt # Element-wise ops prompt
└── results/ # Test outputs (gitignored)
python run_loop.py --problem /path/to/problem.py --max-attempts 3./batch_test_gemm.shThis tests all GEMM-related problems from KernelBench Level1 and Level2.
python eval.py --code results/problem_name/code_1.py --problem /path/to/problem.py| Variable | Required | Default | Description |
|---|---|---|---|
LLM_GATEWAY_KEY |
Yes | - | API key for LLM gateway |
PYTORCH_ROCM_ARCH |
No | gfx950 | GPU architecture |
if min(M, N) >= 512:
BLOCK_SIZE = 256 # Large matrices
else:
BLOCK_SIZE = 128 # Smaller matricesextern __shared__ alignment_dummy __shm[];
shared_allocator al((int*)&__shm[0]);
ST_A (&As)[2][2] = al.allocate<ST_A, 2, 2>(); // 2 buffers x 2 halves__builtin_amdgcn_sched_barrier(0);
__builtin_amdgcn_s_setprio(1);
mma_ABt(c_accum, a_reg, b_reg, c_accum);
__builtin_amdgcn_s_setprio(0);// After global loads
asm volatile("s_waitcnt vmcnt(0)\n");
__builtin_amdgcn_s_barrier();
// After shared->register loads
asm volatile("s_waitcnt lgkmcnt(0)\n");Prompts are auto-selected via prompts/config.json:
{
"default_prompt": "hipkittens_base.txt",
"patterns": {
"matmul|gemm": "hipkittens_gemm.txt",
"relu|sigmoid": "elementwise_bf16.txt"
}
}- Large K dimension (K > 4096): Some accuracy issues
- Level2 nn.Linear problems: Some produce NaN due to weight initialization
- Batched GEMM: Not supported (uses base prompt)
- No PyTorch GEMM:
torch.mm,torch.matmul,torch.bmm,F.linearare forbidden - nn.Linear handling: Use
weight.contiguous()(no transpose),mma_ABt(x, weight)= x @ weight.T - BF16 conversion: Use
hip_bfloat16(float_val)andstatic_cast<float>(bf16_val)
- GEMM kernels use HipKittens MFMA instructions (gfx950/CDNA4)
- Results are stored in
results/directory - See
test_gemm_reference.pyfor working kernel example