aitorrent/Meta-Llama-3-8B-Instruct-GGUF-torrent

aitorrent/Meta-Llama-3-8B-Instruct-GGUF-torrent
Photo by Paz Arando / Unsplash

Llamacpp imatrix Quantizations of Meta-Llama-3-8B-Instruct

Using llama.cpp commit ffe6665 for quantization.

Original model: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

All quants made using imatrix option with dataset provided by Kalomaze here

Prompt format

Model Details

  • Model Name: Meta-Llama-3-8B-Instruct
  • Model Type: Large Language Model (LLM)
  • Model Size: 8 billion parameters
  • Quantization Options:
    • Q8_0: Extremely high quality, generally unneeded but max available quant. (8.54GB)
    • Q6_K: Very high quality, near perfect, recommended. (6.59GB)
    • Q5_K_M: High quality, recommended. (5.73GB)
    • Q5_K_S: High quality, recommended. (5.59GB)
    • Q4_K_M: Good quality, uses about 4.83 bits per weight, recommended. (4.92GB)
    • Q4_K_S: Slightly lower quality with more space savings, recommended. (4.69GB)
    • IQ4_NL: Decent quality, slightly smaller than Q4_K_S with similar performance, recommended. (4.67GB)
    • IQ4_XS: Decent quality, smaller than Q4_K_S with similar performance, recommended. (4.44GB)
    • Q3_K_L: Lower quality but usable, good for low RAM availability. (4.32GB)
    • Q3_K_M: Even lower quality. (4.01GB)
    • IQ3_M: Medium-low quality, new method with decent performance comparable to Q3_K_M. (3.78GB)
    • IQ3_S: Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance. (3.68GB)
    • Q3_K_S: Low quality, not recommended. (3.66GB)
    • IQ3_XS: Lower quality, new method with decent performance, slightly better than Q3_K_S. (3.51GB)
    • IQ3_XXS: Lower quality, new method with decent performance, comparable to Q3 quants. (3.27GB)
    • Q2_K: Very low quality but surprisingly usable. (3.17GB)
    • IQ2_M: Very low quality, uses SOTA techniques to also be surprisingly usable. (2.94GB)
    • IQ2_S: Very low quality, uses SOTA techniques to be usable. (2.75GB)
    • IQ2_XS: Very low quality, uses SOTA techniques to be usable. (2.60GB)
    • IQ2_XXS: Very low quality, uses SOTA techniques to be usable. (2.39GB)
    • IQ1_M: Extremely low quality, not recommended. (2.16GB)
    • IQ1_S: Extremely low quality, not recommended. (2.01GB)

Choosing the Right Quantization

  1. Determine Your RAM and VRAM Availability:
    • GPU VRAM: For maximum speed, choose a quant with a file size 1-2GB smaller than your GPU's VRAM.
    • System RAM + GPU VRAM: For maximum quality, choose a quant with a file size 1-2GB smaller than the total RAM and VRAM.
  2. Decide Between 'I-quant' and 'K-quant':
    • K-quants (e.g., Q5_K_M): Suitable for most users, offering a balance between quality and size.
    • I-quants (e.g., IQ3_M): Newer, offering better performance for their size, especially below Q4. Suitable for cuBLAS (Nvidia) or rocBLAS (AMD) users.

Additional Information

  • Feature Chart: For detailed comparisons of quantization options, refer to the llama.cpp feature matrix.
  • Compatibility: I-quants are not compatible with Vulcan (AMD) and will be slower on CPU and Apple Metal.