aitorrent/Meta-Llama-3-70B-Instruct-GGUF-torrent

aitorrent/Meta-Llama-3-70B-Instruct-GGUF-torrent
Photo by Dong Cheng / Unsplash

Llama.cpp Imatrix Quantizations of Meta-Llama-3-70B-Instruct

This repository provides various quantizations of the Meta-Llama-3-70B-Instruct model using the llama.cpp release b2777 and the imatrix option with a dataset provided by Kalomaze.

Original Model:
https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Quantization Options:

We offer a range of quantization options to suit different use cases and hardware configurations. The file sizes and descriptions are as follows:

  • Q8_0: Extremely high quality, generally unneeded but max available quant (74.97GB)
  • Q6_K: Very high quality, near perfect, recommended (57.88GB)
  • Q5_K_M: High quality, recommended (49.94GB)
  • Q5_K_S: High quality, recommended (48.65GB)
  • Q4_K_M: Good quality, uses about 4.83 bits per weight, recommended (42.52GB)
  • Q4_K_S: Slightly lower quality with more space savings, recommended (40.34GB)
  • IQ4_NL: Decent quality, slightly smaller than Q4_K_S with similar performance, recommended (40.05GB)
  • IQ4_XS: Decent quality, smaller than Q4_K_S with similar performance, recommended (37.90GB)
  • Q3_K_L: Lower quality but usable, good for low RAM availability (37.14GB)
  • Q3_K_M: Even lower quality (34.26GB)
  • IQ3_M: Medium-low quality, new method with decent performance comparable to Q3_K_M (31.93GB)
  • IQ3_S: Lower quality, new method with decent performance, recommended over Q3_K_S quant, same size with better performance (30.91GB)
  • Q3_K_S: Low quality, not recommended (30.91GB)
  • IQ3_XS: Lower quality, new method with decent performance, slightly better than Q3_K_S (29.30GB)
  • IQ3_XXS: Lower quality, new method with decent performance, comparable to Q3 quants (27.46GB)
  • Q2_K: Very low quality but surprisingly usable (26.37GB)
  • IQ2_M: Very low quality, uses SOTA techniques to also be surprisingly usable (24.11GB)
  • IQ2_S: Very low quality, uses SOTA techniques to be usable (22.24GB)
  • IQ2_XS: Very low quality, uses SOTA techniques to be usable (21.14GB)
  • IQ2_XXS: Very low quality, uses SOTA techniques to be usable (19.09GB)
  • IQ1_M: Extremely low quality, not recommended (16.75GB)
  • IQ1_S: Extremely low quality, not recommended (15.34GB)

Downloading using Hugging Face CLI:

To download a specific file, use the following command:

huggingface-cli download bartowski/Meta-Llama-3-70B-Instruct-GGUF --include "Meta-Llama-3-70B-Instruct-Q4_K_M.gguf" --local-dir ./ --local-dir-use-symlinks False

Choosing the Right Quantization:

To determine which quantization to use, consider the following factors:

  • Model size: Choose a quantization that fits within your available RAM and/or VRAM.
  • Speed vs. quality: If you want the absolute maximum quality, choose a larger quantization. If you want faster performance, choose a smaller quantization.
  • Hardware compatibility: If you're using an Nvidia GPU, choose a K-quant. If you're using an AMD GPU or CPU, consider an I-quant.

For more information on choosing the right quantization, refer to the original description.