← Back

Guides

A step-to-step guide on how to deploy Llama 3.1 8B on Compute with Hivenet

A quick, no-nonsense setup guide for deploying Llama 3.1 8B on Hivenet Compute. Whether you need high-performance inference or extended context length, this guide walks you through installing dependencies, serving the model, and exposing OpenAI-compatible endpoints.

🚀 Get started now and unlock the full potential of Llama 3.1 on Hivenet Compute!

Requirements

  • 1 Compute X-Small instance with Ubuntu v24.04 LTS template
  • Hugging Face token with access to Llama 3.1

Dependencies

Ensure the following dependencies are installed:

  • pip
  • miniconda
  • vllm

Installing dependencies

To prevent loss of installed packages, all dependencies should be installed inside the /home/ubuntu/workspace directory.

Connect to your Compute instance

First, establish an SSH connection to your Compute instance:

ssh -i ~/.ssh/id_rsa -o "ProxyCommand=ssh bastion@ssh.hivecompute.ai %h" ubuntu@d348351b-a04c-4b98-9d1a-2e474623395b.ssh.hivecompute.ai

Install miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O install_miniconda.sh
bash install_miniconda.sh -b -p /home/ubuntu/workspace/opt/conda
export PATH=/home/ubuntu/workspace/opt/conda/bin:$PATH
conda init

Disconnect and reconnect (CTRL+D) for the changes to take effect.

Install vllm

pip install vllm

Serving Llama 3.1 8B

Llama 3.1 models have a maximum context length of 128K tokens. On an RTX 4090 (24GB VRAM):

  • 16GB is used for model parameters (BF16 precision)
  • 8GB is available for the KV cache
  • This allows for up to 59K tokens in context length
  • For longer context lengths, use FP8 quantization or a larger Compute instance

Running in full precision (59K Tokens)

export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 59000 &

Monitor logs to confirm endpoints are available:

tail -f nohup.out

Look for output similar to:
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [93203]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)

  1. Test the model locally

curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'

{"id":"chat-c31b1784c32646d2ba146e72352b6fae","object":"chat.completion","created":1733491175,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.\n\nAI","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

Running in FP8 precision (128K Tokens)

To use the full 128K context length, run:

export HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>
nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --download-dir /home/ubuntu/workspace --gpu-memory-utilization 1 --max-model-len 128000 --dtype half --quantization fp8 --kv-cache-dtype fp8 &

Monitor logs and wait until the openai endpoints are exposed:

tail -f nohup.out
INFO 12-06 11:19:23 launcher.py:19] Available routes are:
INFO 12-06 11:19:23 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-06 11:19:23 launcher.py:27] Route: /health, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /version, Methods: GET
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-06 11:19:23 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [93203]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)

Testing the model locally

Send a test request:

curl -X POST "http://localhost:8000/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'

Expected response:

{"id":"chat-67dfc8a8c6904642a27fe8c889a6455d","object":"chat.completion","created":1733491601,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Artificial Intelligence (AI) is a broad field of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence. AI involves the development of algorithms, statistical models, and machine learning techniques to enable computers to think, learn","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

Exposing OpenAI-compatible endpoints

To allow external access, you have two options:

  1. FRP (if you have a publicly accessible server/VM)
  2. Ngrok (if you don’t have a public instance)

Using Ngrok for External Access

  • Create an account at ngrok
  • Log in and get your authentication token
  • Download and install Ngrok

wget https://bin.equinox.io/c/bNyj1mQVY4c/ngrok-v3-stable-linux-amd64.tgz
sudo tar -xvzf ngrok-v3-stable-linux-amd64.tgz -C /usr/local/bin

  • Add your token


ngrok config add-authtoken <YOUR_NGROCK_TOKEN>

  • Claim a static domain from your ngrok account
  • Start Ngrok

nohup ngrok http --url=<YOUR_NGROK_STATIC_DOMAIN> 8000 &

Testing external access

curl https://<YOUR_NGROK_STATIC_DOMAIN>/v1/models
{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B-Instruct","object":"model","created":1733495850,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B-Instruct","parent":null,"max_model_len":128000,"permission":[{"id":"modelperm-ecfa0d0e5dd04d4c973e9fc134d00d98","object":"model_permission","created":1733495850,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

curl -X POST "https://<YOUR_NGROK_STATIC_DOMAIN>/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "meta-llama/Llama-3.1-8B-Instruct","messages":[{"role": "user", "content": "what is AI?"}],"max_tokens": 50}'
{"id":"chat-8f35de4b78ca4710b7d6badd26fc6439","object":"chat.completion","created":1733495952,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can refer to any machine that exhibits traits associated with a human mind such as learning, problem-solving, decision-making,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":89,"completion_tokens":50},"prompt_logprobs":null}

That’s it! You now have Llama 3.1 8B running on Compute with OpenAI-compatible endpoints. If you need more context length, tweak precision settings or upgrade your instance.

🚀 Need help or want to explore more? Start an instance today and take your AI deployment to the next level!

← Back