aitorrent/dolphin-2.9.2-qwen2-72b-gguf

aitorrent/dolphin-2.9.2-qwen2-72b-gguf
Photo by Damian Patkowski / Unsplash

Choosing the Right Model File: A Guide

When selecting a model file, it's essential to consider your system's resources and the tradeoffs between speed and quality. A helpful resource for understanding the performance differences between various models is Artefact2's write-up, which includes informative charts.

Step 1: Determine Your System's Capabilities

First, assess your system's RAM and VRAM to determine the maximum model size you can run. For optimal speed, aim to fit the entire model on your GPU's VRAM, choosing a file size 1-2GB smaller than your GPU's total VRAM. For maximum quality, combine your system RAM and GPU VRAM, and select a file size 1-2GB smaller than the total.

Step 2: Decide Between I-Quant and K-Quant

Next, decide between an 'I-quant' or a 'K-quant' model. If you prefer a straightforward choice, opt for a K-quant model (e.g., Q5_K_M). For more customization, refer to the llama.cpp feature matrix, which outlines the differences between I-quant and K-quant models.

I-Quant vs. K-Quant: Key Considerations

  • I-quants (e.g., IQ3_M) offer better performance for their size, especially for models below Q4, and are compatible with cuBLAS (Nvidia), rocBLAS (AMD), CPU, and Apple Metal. However, they may be slower than K-quants on CPU and Apple Metal.
  • K-quants are a good choice if you don't want to delve into the details, but note that I-quants are not compatible with Vulcan (AMD).

By considering your system's capabilities and the tradeoffs between I-quant and K-quant models, you can choose the optimal file for your needs.

Credit: Original description by Artefact2.