aitorrent/dolphin-2.9.2-qwen2-7b-gguf

Last updated on Jun 18, 2024

Llama.cpp Quantizations of Dolphin-2.9.2-qwen2-7b

This model is a quantized version of the original Dolphin-2.9.2-qwen2-7b model, optimized using llama.cpp release b2965. The original model can be found on Hugging Face at https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-7b.

Quantization Details

All quantizations were made using the imatrix option with a dataset from [insert dataset source]. The prompt format for this model is as follows:

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Downloading the Model

To download the model using huggingface-cli, first ensure you have the latest version installed:

pip install -U "huggingface_hub[cli]"

Then, target the specific file you want to download:

huggingface-cli download bartowski/dolphin-2.9.2-qwen2-7b-GGUF --include "dolphin-2.9.2-qwen2-7b-Q4_K_M.gguf" --local-dir ./

If the model is larger than 50GB, it will be split into multiple files. To download all files to a local folder, run:

huggingface-cli download bartowski/dolphin-2.9.2-qwen2-7b-GGUF --include "dolphin-2.9.2-qwen2-7b-Q8_0.gguf/*" --local-dir dolphin-2.9.2-qwen2-7b-Q8_0

Choosing the Right Quantization

To determine which quantization to use, consider the following factors:

Model size: Choose a quantization with a file size 1-2GB smaller than your GPU's VRAM for optimal performance.
Quality vs. speed: If you prioritize quality, add your system RAM and GPU VRAM together and choose a quantization with a file size 1-2GB smaller than the total.
I-quant vs. K-quant: K-quants (e.g., Q5_K_M) are a good starting point, while I-quants (e.g., IQ3_M) offer better performance for their size, especially for cuBLAS (Nvidia) or rocBLAS (AMD) users.

For more information on the feature matrix and performance charts, see [Artefact2's write-up](insert link).

Original Credit

This description is based on the original work by Llamacpp.