aitorrent/dolphin-2.9.2-qwen2-7b-gguf
Llama.cpp Quantizations of Dolphin-2.9.2-qwen2-7b
This model is a quantized version of the original Dolphin-2.9.2-qwen2-7b model, optimized using llama.cpp release b2965. The original model can be found on Hugging Face at https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-7b.
Quantization Details
All quantizations were made using the imatrix option with a dataset from [insert dataset source]. The prompt format for this model is as follows:
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
Downloading the Model
To download the model using huggingface-cli, first ensure you have the latest version installed:
pip install -U "huggingface_hub[cli]"
Then, target the specific file you want to download:
huggingface-cli download bartowski/dolphin-2.9.2-qwen2-7b-GGUF --include "dolphin-2.9.2-qwen2-7b-Q4_K_M.gguf" --local-dir ./
If the model is larger than 50GB, it will be split into multiple files. To download all files to a local folder, run:
huggingface-cli download bartowski/dolphin-2.9.2-qwen2-7b-GGUF --include "dolphin-2.9.2-qwen2-7b-Q8_0.gguf/*" --local-dir dolphin-2.9.2-qwen2-7b-Q8_0
Choosing the Right Quantization
To determine which quantization to use, consider the following factors:
- Model size: Choose a quantization with a file size 1-2GB smaller than your GPU's VRAM for optimal performance.
- Quality vs. speed: If you prioritize quality, add your system RAM and GPU VRAM together and choose a quantization with a file size 1-2GB smaller than the total.
- I-quant vs. K-quant: K-quants (e.g., Q5_K_M) are a good starting point, while I-quants (e.g., IQ3_M) offer better performance for their size, especially for cuBLAS (Nvidia) or rocBLAS (AMD) users.
For more information on the feature matrix and performance charts, see [Artefact2's write-up](insert link).
Original Credit
This description is based on the original work by Llamacpp.