bartowski/MadWizard-SFT-v2-Mistral-7b-v0.3-GGUF-torrent
This is a collection of quantized versions of the MadWizard-SFT-v2-Mistral-7b-v0.3 language model, optimized for different hardware configurations and performance requirements. The models are based on the Lumpen1/MadWizard-SFT-v2-Mistral-7b-v0.3 model and have been quantized using the imatrix option in llama.cpp release b3152.
The models are available in various sizes and quantization levels, ranging from Q8_1 with a file size of 7.95GB to IQ2_XS with a file size of 2.20GB. The quantization levels are denoted as QX_K_X or IQX_X, where X represents the quantization level, K is for K-quant (older but compatible with all hardware), and I is for I-quant (newer but not compatible with Vulcan).
To download a specific model, use the huggingface-cli tool with the following command:
huggingface-cli download bartowski/MadWizard-SFT-v2-Mistral-7b-v0.3-GGUF --include "MadWizard-SFT-v2-Mistral-7b-v0.3-Q4_K_M.gguf" --local-dir ./
Replace "MadWizard-SFT-v2-Mistral-7b-v0.3-Q4_K_M.gguf" with the desired model file name.
To choose the appropriate model for your hardware and performance needs, consider the following:
- Determine your available RAM and VRAM: Add your system RAM and GPU VRAM together to get the total available memory.
- Choose a model size: Aim for a model file size 1-2GB smaller than your total available memory for the best performance.
- Select a quantization level: For the best quality, choose a higher quantization level (e.g., Q8_1 or Q8_0). For lower memory requirements, choose a lower quantization level (e.g., Q3_K_L or IQ3_M).
- Decide between K-quant and I-quant: If you're running cuBLAS (Nvidia) or rocBLAS (AMD), consider using an I-quant for models below Q4. For CPU, Apple Metal, or Vulcan, use a K-quant.
For more detailed information on the quantization levels and performance comparisons, refer to the write-up by Artefact2 and the llama.cpp feature matrix.