Optimizing llama.cpp / ik_llama for Zen5 (Epyc & Threadripper) w/ CUDA (Blackwell)

I managed to get a very significant performance improvement with a number of compilation changes. These optimizations mostly work between both and I find that I get significantly better prefill and generation with ik_llama.

This is a resource to provide tips and potentially overlooked details.
While much of this guide may be usable for a full setup, it is not a complete and comprehensive guide.

1. Use Zen5 & O3 Build Arguments

This will make compilation take a lot longer, but it’s very much worth it, set these environment variables when compiling anything (BLIS, llama, etc) to enable O3 compilation optimizations, OpenMP acceleration, and Zen5 CPU instructions/optimizations:

export CFLAGS="-O3 -fopenmp -march=znver5"
export CXXFLAGS="-O3 -fopenmp -march=znver5"

2. Compile a BLAS library, specifically AOCL-BLIS from AMD

This allows more compute optimizations using AMD’s optimized fork of BLIS that is a more modern BLAS-compatible library.

Note: CBLAS compatibility is required for llama.

git clone https://github.com/amd/blis
cd blis
./configure --enable-cblas auto
sudo make install

3. Compile llama with compilation optimizations, BLIS, and CUDA optimizations

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build \
  --DGGML_CUDA=ON \
  --DGGML_CUDA_FA_ALL_QUANTS=ON \
  --DGGML_CUDA_FORCE_CUBLAS=ON \
  --DGGML_BLAS=ON --DGGML_BLAS_VENDOR=FLAME \
  --DGGML_NATIVE=ON \
  --DGGML_NUMA_MIRROR=ON \
  --DGGML_CUDA_IQK_FORCE_BF16=ON \
  --DGGML_CUDA_F16=ON \
  --DGGML_CUDA_MAX_CONTEXTS=256 \
  --DGGML_CUDA_MIN_BATCH_OFFLOAD=256 \
  --DGGML_AVX512=1 \
  --DGGML_AVX512_VBMI=1 \
  --DGGML_AVX512_VNNI=1 \
  --DGGML_AVX512_BF16=1 \
  --DCMAKE_C_FLAGS="-O3 -march=znver5" \
  --DCMAKE_CXX_FLAGS="-O3 -march=znver5" \

# Note: CUDA_DMMV_X, CUDA_MMV_Y, CUDA_PEER_MAX_BATCH_SIZE may also offer benefit.  I had CUDA sync stability issues that may or may not have been caused by higher values, YMMV.

cmake --build build --config Release --clean-first

# Note: Don’t forget to set a -j flag with your CPU core count to compile in threads, too high of a value without sufficient RAM will fail.

4. Run ik_llama with optimizatons

Your start command will vary, but here are some useful arguments:

  --jina \
  -mla 3 \ # MLA 3, often fastest for supported models
  -flash-attn on \ # flash attention
  -ub 2048 \ # physical batching max, consumes VRAM
  -b 8192 \ # batching size
  -rtr \ # runtime tensor repack optimizations
  -ns 2 \ # increase number of parallel sequences
  --mlock \ # keep model in RAM, don’t swap to disk
  --cache-ram 400000 \ # how many tokens that can be cached to RAM as a supplement to VRAM
  --cont-batching \ # enable continuous batching
  --no-context-shift \ # disable trimming context start
  --threads 32 # set this at or near your CPU core count

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *