Work with GPT-Fast on Windows an extended Setup Guide
General Setup
This is an additional setup guide for the GPT-Fast repository - you will need to clone this repository to your machine and the run all commands from there.
I'm running this on a Windows 11 machine with a RTX 4090 using Python 3.10.11 and pip 23.3.1. You can install all of this in a venv if you want to - note that if you are using conda then you will have to be a bit more carefull in the Installing PyTorch chapter. Note: I recommend you have at least 16 GB of GPU memory available as this if you are running the smallest supported model (meta-llama/Llama-2-7b-chat-hf) as I couldn't get the 4-Bit (both int4 and int4-gptq) quantization to run, so I had to stick with 8-Bit quantization. If you have the choice I would recommend either a RTX 4090 or one of the RTX Ada Generation 4500 and up - I'd imagine that a RTX 5000+ (on the Ampere architecture) should probably still outperform the RTX 4090 - but Ada architecture cards, with the new Tensor core generation, seem to be faster.
Please follow this script from top down and only skip steps if you are 100% certain that you have setup it up before or the included test work correctly.
General Requirements
For this to work you will need to have submitted a request for access to the Llama 2 weights and models over at https://ai.meta.com/resources/models-and-libraries/llama-downloads/.
Next you will need an account with Hugging Face, which uses the same E-Mail address as you used for the above request - you can create a free account with Hugging Face over at https://huggingface.co/join. This is quite fast, usually within an hour you will receive an E-Mail.
Next go to this repository and request access https://huggingface.co/meta-llama/Llama-2-7b - this can take up to 2 days, but usually should only take around an hour.
Installing PyTorch
What is PyTorch?
PyTorch is designed to provide flexibility and ease in building deep learning projects. PyTorch offers tools and functionalities to perform tensor computations with GPU acceleration, automatic differentiation for building and training neural networks, and a dynamic computational graph that allows for easy debugging and experimentation.
Why do I need this for this?
Well the whole premise of this excerise is to follow the example implementation of the Accelerating Generative AI with PyTorch blog post - which you can find here.
Setup
Go to https://pytorch.org/get-started/locally and select your platform accordingly.
For me the command looks like this - note that the torch install is easily above 2.5 GB:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Note that this also installs the following packages:
- mpmath
- urllib3
- typing-extensions
- sympy
- pillow
- numpy
- networkx
- MarkupSafe
- idna
- fsspec
- filelock
- charset-normalizer
- certifi
- requests
- jinja2
- torch
- torchvision
- torchaudio
After the install has finished test that PyTorch was successfully installed by running the following script:
# Save as testPyTorch.py
import torch
x = torch.rand(5, 3)
print(x)
# True if CUDA is enabled
print(torch.cuda.is_available())
python .\testPyTorch.py
If you see a True in the log then CUDA is enabled as well.
Sentencepiece and Huggingface_hub
What is sentencepiece and why do we need it?
SentencePiece is an unsupervised text tokenizer and detokenizer, that is used to tokenize (split up) the input prompt and pass it to the LLM.
For a deeper look at SentencePiece I found this article quite helpful. If you don't know anything about tokenizers in general I would recommend the following page https://huggingface.co/docs/transformers/tokenizer_summary.
Link to the original paper introducing SentencePiece can be found here.
What is hugginface_hub and why do we need it?
The Hugging Face Hub is an open-source platform provided by Hugging Face that serves as a centralized repository for sharing and exploring models, datasets, and other resources related to natural language processing (NLP) and artificial intelligence. It's essentially a platform where users can discover, publish, and use pre-trained models and datasets. It is where we will get the Llama 2 weights from.
Setup
pip install sentencepiece huggingface_hub
Note that this also installs the following packages:
- pyyaml
- packaging
- fsspec
- colorama
- tqdm
After the install has finished test that sentencepiece was successfully installed by running the following script:
# Save as testSentencepiece.py
import requests
import sentencepiece as spm
import os
# Get example text to train a tokenizer
file_path = 'botchan.txt'
response = requests.get('https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt')
if response.status_code == 200:
with open(file_path, 'wb') as file:
file.write(response.content)
print(f"File downloaded and saved as '{file_path}'")
else:
print('Failed to download the file')
# Example based off of https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb#scrollTo=ee9W6wGnVteW
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# Encode
print(sp.encode_as_pieces('This is a test'))
# Decode
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
# Clean up
def delete_file(fp):
if os.path.exists(fp):
os.remove(fp)
print(f"File '{fp}' deleted successfully")
else:
print(f"File '{fp}' does not exist")
delete_file(file_path)
delete_file('m.model')
delete_file('m.vocab')
python .\testSentencepiece.py
If you see the following two lines in the log, next to a lot of other output, it was successfull:
# ['▁This', '▁is', '▁a', '▁t', 'est']
# This is a test
Login with the Huggingface Hub
Go to https://huggingface.co/settings/tokens, click New token, give it a name like GPTFast, leave the role at read and click Generate a token then copy that token to a safe place for now - make sure to never share this token with anybody!
Then run the following and when prompted for a token paste in the value from the previous step and feel free to have it added as a git credential:
huggingface-cli login
A successful login here also serves as the test for a correct setup.
Choosing your model
Please note the following download sizes for the different models - note because of meta data files and check-pointing the full download could be a lot larger:
- openlm-research/open_llama_7b is ~ 13 GiB (14 GB)
- meta-llama/Llama-2-7b-chat-hf is ~ 13 GiB (14 GB)
- meta-llama/Llama-2-13b-chat-hf ~ 24 GiB (26 GB)
- meta-llama/Llama-2-70b-chat-hf ~ 128 GiB (137 GB)
- codellama/CodeLlama-7b-Python-hf ~ 12 GiB (13 GB)
- codellama/CodeLlama-34b-Python-hf ~ 22 GiB (24 GB)
I will be using the meta-llama/Llama-2-7b-chat-hf model as it lightweight and can easily run on my system.
Downloading the model and setting everything up
Now we go to download your model of choice from huggingface - you will need the full model repository and model name from the previous step and your huggingface token (if you have lost your token you can go to https://huggingface.co/settings/tokens to retrieve it).
python .\scripts\download.py --repo_id <MODEL-REPO/MODEL-NAME> --hf_token <TOKEN> # e.g. --repo_id meta-llama/Llama-2-7b-chat-hf
python .\scripts\convert_hf_checkpoint.py --checkpoint_dir ./checkpoints/<MODEL-REPO>/<MODEL-NAME> # e.g. --checkpoint_dir ./checkpoints/meta-llama/Llama-2-7b-chat-hf
python .\quantize.py --checkpoint_path ./checkpoints/<MODEL-REPO>/<MODEL-NAME>/model.pth --mode int8 # e.g. --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth
Running a Test
This is the final test to ensure that all previous steps were successful:
python generate.py --compile --checkpoint_path ./checkpoints/<MODEL-REPO>/<MODEL-NAME>/model.pth --prompt "Hello, my name is <NAME>" # e.g. --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth
Here is the response I got:
"Nice to meet you, David! How are you doing?"
The response took around 4 seconds - main time consumption seems to be the loading of the actual model - so the next step would be to keep it loaded and run a continuous workload.