← Back
Guides

Serving Llama 3.1-8B on Compute - How to Guide

In this guide, you'll find a step-by-step walkthrough on how to set up and run Llama 3.1-8B seamlessly on Hivenet's Compute.

Published on
December 17, 2024
by
Mamoutou Diarra
Requirements
  • 1 Compute X-Small instance with Ubuntu v24.04 LTS template
  • A valid HuggingFace token with access to Llama 3.1
Dependencies
  • pip
  • miniconda
  • vllm
Installing dependencies

To avoid using ephemeral storage, It is important to install all your dependencies inside the /home/ubuntu/workspace folder.

To install the dependencies, first ssh to your Compute instance:

ssh -i ~/.ssh/id_rsa -o "ProxyCommand=ssh bastion@ssh.hivecompute.ai %h" ubuntu@d348351b-a04c-4b98-9d1a-2e474623395b.ssh.hivecompute.ai

Installing miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O install_miniconda.sh
bash install_miniconda.sh -b -p /home/ubuntu/workspace/opt/conda
export PATH=/home/ubuntu/workspace/opt/conda/bin:$PATH
conda init

After installing miniconda, you need to disconnect (CTRL+D) and reconnect for the changes to take effect

Installing vllm

pip install vllm

Serving Llama 3.1 8B

It is important to note that Llama 3.1 models have a maximum context length of 128K tokens. The RTX4090 has 24GB of vRAM, in BF16 precision 16GB of this memory will be taken by the model parameters and 8GB will be left for the KV cache. with 8GB, the context length cannot exceed 59K tokens, which is pretty sufficient for most use cases. However, if you need to serve your model with more than 59K context length, you can use either FP8 quantization or use a bigger HiveCompute Instance.

Serving in full precision with 59K context length

  1. Run the command below

export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 59000 &

  1. watch the log and wait until the openai endpoints are exposed:

tail -f nohup.out
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [93203]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)

  1. Test the model locally

curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'

{"id":"chat-c31b1784c32646d2ba146e72352b6fae","object":"chat.completion","created":1733491175,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.\n\nAI","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

Serving in FP8 precision with 128K context length
  1. Run the command below

export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>
nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 128000 --dtype half --quantization fp8 --kv-cache-dtype fp8 &

  1. watch the log and wait until the openai endpoints are exposed:

tail -f nohup.out
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [93203]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)

  1. Test the model locally

curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'

{"id":"chat-67dfc8a8c6904642a27fe8c889a6455d","object":"chat.completion","created":1733491601,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a broad field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence. AI involves the development of algorithms, statistical models, and machine learning techniques to enable computers to think, learn","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

Expose the OpenAI endpoints to the external world

To expose the model endpoints externally you have several options:

  • You can use FRP if you already own a publicly accessible server/VM
  • You can use ngrok if you don’t own any instance that is publicly accessible

External access via ngrok

  1. Create a free account on
  1. ngrok | API Gateway, IoT Device Gateway, Secure Tunnels for Containers, Apps & APIs  
  2. Login to the account and follow the instructions
  3. Copy your ngrok token from your ngrok account
  4. Download and install ngrok

wget https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
sudo tar -xvzf ngrok-v3-stable-linux-amd64.tgz -C /usr/local/bin
ngrok config add-authtoken <YOUR_NGROCK_TOKEN>

  1. Claim an ngrok static domain from your ngrok account
  2. Launch ngrock with your static domain

nohup ngrok http --url=<YOUR_NGROK_STATIC_DOMAIN> 8000 &

  1. Test external access to the OpenAI endpoints (from your laptop)

curl https://<YOUR_NGROK_STATIC_DOMAIN>/v1/models
{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B-Instruct","object":"model","created":1733495850,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B-Instruct","parent":null,"max_model_len":128000,"permission":[{"id":"modelperm-ecfa0d0e5dd04d4c973e9fc134d00d98","object":"model_permission","created":1733495850,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

curl -X POST "https://<YOUR_NGROK_STATIC_DOMAIN>/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'
{"id":"chat-8f35de4b78ca4710b7d6badd26fc6439","object":"chat.completion","created":1733495952,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can refer to any machine that exhibits traits associated with a human mind such as learning, problem-solving, decision-making,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

← Back