Groq API
Build applications with Groq's ultra-fast LLM inference (300-1000+ tokens/sec).
Quick Start
Installation
bash
1# Python
2pip install groq
3
4# TypeScript/JavaScript
5npm install groq-sdk
Environment Setup
bash
1export GROQ_API_KEY=<your-api-key>
Basic Chat Completion
Python:
python
1from groq import Groq
2
3client = Groq() # Uses GROQ_API_KEY env var
4
5response = client.chat.completions.create(
6 model="llama-3.3-70b-versatile",
7 messages=[{"role": "user", "content": "Hello"}]
8)
9print(response.choices[0].message.content)
TypeScript:
typescript
1import Groq from "groq-sdk";
2
3const client = new Groq();
4
5const response = await client.chat.completions.create({
6 model: "llama-3.3-70b-versatile",
7 messages: [{ role: "user", content: "Hello" }],
8});
9console.log(response.choices[0].message.content);
Model Selection
| Use Case | Model | Notes |
|---|
| Fast + cheap | llama-3.1-8b-instant | Best for simple tasks |
| Balanced | llama-3.3-70b-versatile | Quality/cost balance |
| Highest quality | openai/gpt-oss-120b | Built-in tools + reasoning |
| Agentic | groq/compound | Web search + code exec |
| Reasoning | openai/gpt-oss-20b | Fast reasoning (low/med/high) |
| Vision/OCR | llama-4-scout-17b-16e-instruct | Image understanding |
| Audio STT | whisper-large-v3-turbo | Transcription |
| TTS | playai-tts | Text-to-speech |
See references/models.md for full model list and pricing.
Common Patterns
Streaming Responses
python
1stream = client.chat.completions.create(
2 model="llama-3.3-70b-versatile",
3 messages=[{"role": "user", "content": "Tell me a story"}],
4 stream=True
5)
6
7for chunk in stream:
8 if chunk.choices[0].delta.content:
9 print(chunk.choices[0].delta.content, end="")
System Messages
python
1response = client.chat.completions.create(
2 model="llama-3.3-70b-versatile",
3 messages=[
4 {"role": "system", "content": "You are a helpful assistant."},
5 {"role": "user", "content": "Hello"}
6 ]
7)
Async Client (Python)
python
1import asyncio
2from groq import AsyncGroq
3
4async def main():
5 client = AsyncGroq()
6 response = await client.chat.completions.create(
7 model="llama-3.3-70b-versatile",
8 messages=[{"role": "user", "content": "Hello"}]
9 )
10 return response.choices[0].message.content
11
12print(asyncio.run(main()))
JSON Mode
python
1response = client.chat.completions.create(
2 model="llama-3.3-70b-versatile",
3 messages=[{"role": "user", "content": "List 3 colors as JSON array"}],
4 response_format={"type": "json_object"}
5)
Structured Outputs (JSON Schema)
Force output to match a schema. Two modes available:
| Mode | Guarantee | Models |
|---|
strict: true | 100% schema compliance | openai/gpt-oss-20b, openai/gpt-oss-120b |
strict: false | Best-effort compliance | All supported models |
Strict Mode (guaranteed compliance):
python
1response = client.chat.completions.create(
2 model="openai/gpt-oss-20b",
3 messages=[{"role": "user", "content": "Extract: John is 30 years old"}],
4 response_format={
5 "type": "json_schema",
6 "json_schema": {
7 "name": "person",
8 "strict": True,
9 "schema": {
10 "type": "object",
11 "properties": {
12 "name": {"type": "string"},
13 "age": {"type": "integer"}
14 },
15 "required": ["name", "age"],
16 "additionalProperties": False
17 }
18 }
19 }
20)
With Pydantic:
python
1from pydantic import BaseModel
2
3class Person(BaseModel):
4 name: str
5 age: int
6
7response = client.chat.completions.create(
8 model="openai/gpt-oss-20b",
9 messages=[{"role": "user", "content": "Extract: John is 30"}],
10 response_format={
11 "type": "json_schema",
12 "json_schema": {
13 "name": "person",
14 "strict": True,
15 "schema": Person.model_json_schema()
16 }
17 }
18)
19person = Person.model_validate(json.loads(response.choices[0].message.content))
See references/structured-outputs.md for schema requirements, validation libraries, and examples.
Audio
Transcription (Speech-to-Text)
python
1with open("audio.mp3", "rb") as f:
2 transcription = client.audio.transcriptions.create(
3 model="whisper-large-v3-turbo",
4 file=f,
5 language="en", # Optional: ISO-639-1 code
6 response_format="verbose_json", # json, text, verbose_json
7 timestamp_granularities=["word", "segment"]
8 )
9print(transcription.text)
Translation (to English)
python
1with open("french_audio.mp3", "rb") as f:
2 translation = client.audio.translations.create(
3 model="whisper-large-v3",
4 file=f
5 )
6print(translation.text) # English text
Text-to-Speech
python
1response = client.audio.speech.create(
2 model="playai-tts",
3 input="Hello, world!",
4 voice="Fritz-PlayAI",
5 response_format="wav", # flac, mp3, mulaw, ogg, wav
6 speed=1.0 # 0.5 to 5
7)
8response.write_to_file("output.wav")
Vision
Process images with Llama 4 multimodal models. Supports up to 5 images per request.
Models: meta-llama/llama-4-scout-17b-16e-instruct (faster), meta-llama/llama-4-maverick-17b-128e-instruct (higher quality)
Image from URL
python
1response = client.chat.completions.create(
2 model="meta-llama/llama-4-scout-17b-16e-instruct",
3 messages=[{
4 "role": "user",
5 "content": [
6 {"type": "text", "text": "What's in this image?"},
7 {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
8 ]
9 }]
10)
Local Image (Base64)
python
1import base64
2
3def encode_image(path: str) -> str:
4 with open(path, "rb") as f:
5 return base64.b64encode(f.read()).decode("utf-8")
6
7response = client.chat.completions.create(
8 model="meta-llama/llama-4-scout-17b-16e-instruct",
9 messages=[{
10 "role": "user",
11 "content": [
12 {"type": "text", "text": "Describe this image"},
13 {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('photo.jpg')}"}}
14 ]
15 }]
16)
python
1response = client.chat.completions.create(
2 model="meta-llama/llama-4-scout-17b-16e-instruct",
3 messages=[{
4 "role": "user",
5 "content": [
6 {"type": "text", "text": "Extract all text and data as JSON"},
7 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
8 ]
9 }],
10 response_format={"type": "json_object"}
11)
See references/vision.md for multi-image, tool use with images, and multi-turn conversations.
For tool calling patterns and examples, see references/tool-use.md.
Quick example:
python
1import json
2
3tools = [{
4 "type": "function",
5 "function": {
6 "name": "get_weather",
7 "description": "Get weather for a location",
8 "parameters": {
9 "type": "object",
10 "properties": {"location": {"type": "string"}},
11 "required": ["location"]
12 }
13 }
14}]
15
16response = client.chat.completions.create(
17 model="llama-3.3-70b-versatile",
18 messages=[{"role": "user", "content": "Weather in Paris?"}],
19 tools=tools
20)
21
22if response.choices[0].message.tool_calls:
23 for tc in response.choices[0].message.tool_calls:
24 args = json.loads(tc.function.arguments)
25 # Execute function and continue conversation
Use groq/compound or openai/gpt-oss-120b for built-in web search and code execution:
python
1response = client.chat.completions.create(
2 model="groq/compound",
3 messages=[{"role": "user", "content": "Search for latest Python news"}]
4)
5# Model automatically uses web search
Connect to third-party MCP servers for tools like Stripe, GitHub, web scraping. Use the Responses API:
python
1import openai
2
3client = openai.OpenAI(
4 api_key=os.environ.get("GROQ_API_KEY"),
5 base_url="https://api.groq.com/openai/v1"
6)
7
8response = client.responses.create(
9 model="openai/gpt-oss-120b",
10 input="What models are trending on Huggingface?",
11 tools=[{
12 "type": "mcp",
13 "server_label": "Huggingface",
14 "server_url": "https://huggingface.co/mcp"
15 }]
16)
See references/tool-use.md for MCP configuration and popular servers.
Reasoning Models
Control how models think through complex problems.
Models: openai/gpt-oss-20b, openai/gpt-oss-120b (low/medium/high), qwen/qwen3-32b (none/default)
GPT-OSS with Reasoning Effort
python
1response = client.chat.completions.create(
2 model="openai/gpt-oss-20b",
3 messages=[{"role": "user", "content": "How many r's in strawberry?"}],
4 reasoning_effort="high", # low, medium, high
5 temperature=0.6,
6 max_completion_tokens=1024
7)
8
9print(response.choices[0].message.content)
10print("Reasoning:", response.choices[0].message.reasoning)
Qwen3 with Parsed Reasoning
python
1response = client.chat.completions.create(
2 model="qwen/qwen3-32b",
3 messages=[{"role": "user", "content": "Solve: x + 5 = 12"}],
4 reasoning_format="parsed" # raw, parsed, hidden
5)
6
7print("Answer:", response.choices[0].message.content)
8print("Reasoning:", response.choices[0].message.reasoning)
Hide Reasoning (GPT-OSS)
python
1response = client.chat.completions.create(
2 model="openai/gpt-oss-20b",
3 messages=[{"role": "user", "content": "What is 15% of 80?"}],
4 include_reasoning=False # Hide reasoning in response
5)
See references/reasoning.md for streaming, tool use with reasoning, and best practices.
Batch Processing
For high-volume async processing (24h-7d completion window):
python
1# 1. Create JSONL file with requests
2# 2. Upload file
3# 3. Create batch
4batch = client.batches.create(
5 input_file_id=file_id,
6 endpoint="/v1/chat/completions",
7 completion_window="24h"
8)
9
10# 4. Check status
11batch = client.batches.retrieve(batch.id)
12if batch.status == "completed":
13 results = client.files.content(batch.output_file_id)
See references/api-reference.md for full batch API details.
Prompt Caching
Automatically reduce latency and costs by 50% for repeated prompt prefixes. No code changes required.
Supported models: moonshotai/kimi-k2-instruct-0905, openai/gpt-oss-20b, openai/gpt-oss-120b, openai/gpt-oss-safeguard-20b
How it works:
- Place static content (system prompts, tools, examples) at the beginning
- Place dynamic content (user queries) at the end
- Cache automatically matches prefixes and applies 50% discount
- Cache expires after 2 hours without use
Track cache usage:
python
1response = client.chat.completions.create(
2 model="moonshotai/kimi-k2-instruct-0905",
3 messages=[{"role": "system", "content": large_system_prompt}, ...]
4)
5
6cached = response.usage.prompt_tokens_details.cached_tokens
7print(f"Cached tokens: {cached}") # 50% discount applied to these
See references/prompt-caching.md for optimization strategies and examples.
Content Moderation
Detect and filter harmful content using safeguard models.
Llama Guard 4
General content safety classification. Returns safe or unsafe\nSX (category code).
python
1response = client.chat.completions.create(
2 model="meta-llama/Llama-Guard-4-12B",
3 messages=[{"role": "user", "content": user_input}]
4)
5
6if response.choices[0].message.content.startswith("unsafe"):
7 # Block or handle unsafe content
8 pass
GPT-OSS Safeguard 20B
Prompt injection detection with custom policies. Returns structured JSON.
python
1response = client.chat.completions.create(
2 model="openai/gpt-oss-safeguard-20b",
3 messages=[
4 {"role": "system", "content": injection_detection_policy},
5 {"role": "user", "content": user_input}
6 ]
7)
8# Returns: {"violation": 1, "category": "Direct Override", "rationale": "..."}
See references/moderation.md for complete policies, harm taxonomy, and integration patterns.
Error Handling
python
1from groq import Groq, RateLimitError, APIConnectionError, APIStatusError
2
3client = Groq()
4
5try:
6 response = client.chat.completions.create(
7 model="llama-3.3-70b-versatile",
8 messages=[{"role": "user", "content": "Hello"}]
9 )
10except RateLimitError:
11 # Wait and retry with exponential backoff
12 pass
13except APIConnectionError:
14 # Network issue
15 pass
16except APIStatusError as e:
17 # API error (check e.status_code)
18 pass
See references/audio.md for complete audio API reference including file handling, metadata fields, and prompting guidelines.
Resources