I managed to get a very significant performance improvement with a number of compilation changes. These optimizations mostly work between both and I find that I get significantly better prefill and generation with ik_llama.
This is a resource to provide tips and potentially overlooked details.
While much of this guide may be usable for a full setup, it is not a complete and comprehensive guide.
1. Use Zen5 & O3 Build Arguments
This will make compilation take a lot longer, but it’s very much worth it, set these environment variables when compiling anything (BLIS, llama, etc) to enable O3 compilation optimizations, OpenMP acceleration, and Zen5 CPU instructions/optimizations:
export CFLAGS="-O3 -fopenmp -march=znver5"
export CXXFLAGS="-O3 -fopenmp -march=znver5"
2. Compile a BLAS library, specifically AOCL-BLIS from AMD
This allows more compute optimizations using AMD’s optimized fork of BLIS that is a more modern BLAS-compatible library.
Note: CBLAS compatibility is required for llama.
git clone https://github.com/amd/blis
cd blis
./configure --enable-cblas auto
sudo make install
3. Compile llama with compilation optimizations, BLIS, and CUDA optimizations
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
--DGGML_CUDA=ON \
--DGGML_CUDA_FA_ALL_QUANTS=ON \
--DGGML_CUDA_FORCE_CUBLAS=ON \
--DGGML_BLAS=ON --DGGML_BLAS_VENDOR=FLAME \
--DGGML_NATIVE=ON \
--DGGML_NUMA_MIRROR=ON \
--DGGML_CUDA_IQK_FORCE_BF16=ON \
--DGGML_CUDA_F16=ON \
--DGGML_CUDA_MAX_CONTEXTS=256 \
--DGGML_CUDA_MIN_BATCH_OFFLOAD=256 \
--DGGML_AVX512=1 \
--DGGML_AVX512_VBMI=1 \
--DGGML_AVX512_VNNI=1 \
--DGGML_AVX512_BF16=1 \
--DCMAKE_C_FLAGS="-O3 -march=znver5" \
--DCMAKE_CXX_FLAGS="-O3 -march=znver5" \
# Note: CUDA_DMMV_X, CUDA_MMV_Y, CUDA_PEER_MAX_BATCH_SIZE may also offer benefit. I had CUDA sync stability issues that may or may not have been caused by higher values, YMMV.
cmake --build build --config Release --clean-first
# Note: Don’t forget to set a -j flag with your CPU core count to compile in threads, too high of a value without sufficient RAM will fail.
4. Run ik_llama with optimizatons
Your start command will vary, but here are some useful arguments:
--jina \
-mla 3 \ # MLA 3, often fastest for supported models
-flash-attn on \ # flash attention
-ub 2048 \ # physical batching max, consumes VRAM
-b 8192 \ # batching size
-rtr \ # runtime tensor repack optimizations
-ns 2 \ # increase number of parallel sequences
--mlock \ # keep model in RAM, don’t swap to disk
--cache-ram 400000 \ # how many tokens that can be cached to RAM as a supplement to VRAM
--cont-batching \ # enable continuous batching
--no-context-shift \ # disable trimming context start
--threads 32 # set this at or near your CPU core count







