aitorrent/Meta-Llama-3-8B-Instruct-GGUF-torrent
Llamacpp imatrix Quantizations of Meta-Llama-3-8B-Instruct
Using llama.cpp commit ffe6665 for quantization.
Original model: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
All quants made using imatrix option with dataset provided by Kalomaze here
Prompt format
Model Details
- Model Name: Meta-Llama-3-8B-Instruct
- Model Type: Large Language Model (LLM)
- Model Size: 8 billion parameters
- Quantization Options:
- Q8_0: Extremely high quality, generally unneeded but max available quant. (8.54GB)
- Q6_K: Very high quality, near perfect, recommended. (6.59GB)
- Q5_K_M: High quality, recommended. (5.73GB)
- Q5_K_S: High quality, recommended. (5.59GB)
- Q4_K_M: Good quality, uses about 4.83 bits per weight, recommended. (4.92GB)
- Q4_K_S: Slightly lower quality with more space savings, recommended. (4.69GB)
- IQ4_NL: Decent quality, slightly smaller than Q4_K_S with similar performance, recommended. (4.67GB)
- IQ4_XS: Decent quality, smaller than Q4_K_S with similar performance, recommended. (4.44GB)
- Q3_K_L: Lower quality but usable, good for low RAM availability. (4.32GB)
- Q3_K_M: Even lower quality. (4.01GB)
- IQ3_M: Medium-low quality, new method with decent performance comparable to Q3_K_M. (3.78GB)
- IQ3_S: Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance. (3.68GB)
- Q3_K_S: Low quality, not recommended. (3.66GB)
- IQ3_XS: Lower quality, new method with decent performance, slightly better than Q3_K_S. (3.51GB)
- IQ3_XXS: Lower quality, new method with decent performance, comparable to Q3 quants. (3.27GB)
- Q2_K: Very low quality but surprisingly usable. (3.17GB)
- IQ2_M: Very low quality, uses SOTA techniques to also be surprisingly usable. (2.94GB)
- IQ2_S: Very low quality, uses SOTA techniques to be usable. (2.75GB)
- IQ2_XS: Very low quality, uses SOTA techniques to be usable. (2.60GB)
- IQ2_XXS: Very low quality, uses SOTA techniques to be usable. (2.39GB)
- IQ1_M: Extremely low quality, not recommended. (2.16GB)
- IQ1_S: Extremely low quality, not recommended. (2.01GB)
Choosing the Right Quantization
- Determine Your RAM and VRAM Availability:
- GPU VRAM: For maximum speed, choose a quant with a file size 1-2GB smaller than your GPU's VRAM.
- System RAM + GPU VRAM: For maximum quality, choose a quant with a file size 1-2GB smaller than the total RAM and VRAM.
- Decide Between 'I-quant' and 'K-quant':
- K-quants (e.g., Q5_K_M): Suitable for most users, offering a balance between quality and size.
- I-quants (e.g., IQ3_M): Newer, offering better performance for their size, especially below Q4. Suitable for cuBLAS (Nvidia) or rocBLAS (AMD) users.
Additional Information
- Feature Chart: For detailed comparisons of quantization options, refer to the llama.cpp feature matrix.
- Compatibility: I-quants are not compatible with Vulcan (AMD) and will be slower on CPU and Apple Metal.