AudioCraft: Audio Generation
Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
When to use AudioCraft
Use AudioCraft when:
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
Key features:
- MusicGen: Text-to-music generation with melody conditioning
- AudioGen: Text-to-sound effects generation
- EnCodec: High-fidelity neural audio codec
- Multiple model sizes: Small (300M) to Large (3.3B)
- Stereo support: Full stereo audio generation
- Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
- Stable Audio: For longer commercial music generation
- Bark: For text-to-speech with music/sound effects
- Riffusion: For spectogram-based music generation
- OpenAI Jukebox: For raw audio generation with lyrics
Quick start
Installation
bash1# From PyPI 2pip install audiocraft 3 4# From GitHub (latest) 5pip install git+https://github.com/facebookresearch/audiocraft.git 6 7# Or use HuggingFace Transformers 8pip install transformers torch torchaudio
Basic text-to-music (AudioCraft)
python1import torchaudio 2from audiocraft.models import MusicGen 3 4# Load model 5model = MusicGen.get_pretrained('facebook/musicgen-small') 6 7# Set generation parameters 8model.set_generation_params( 9 duration=8, # seconds 10 top_k=250, 11 temperature=1.0 12) 13 14# Generate from text 15descriptions = ["happy upbeat electronic dance music with synths"] 16wav = model.generate(descriptions) 17 18# Save audio 19torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
Using HuggingFace Transformers
python1from transformers import AutoProcessor, MusicgenForConditionalGeneration 2import scipy 3 4# Load model and processor 5processor = AutoProcessor.from_pretrained("facebook/musicgen-small") 6model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") 7model.to("cuda") 8 9# Generate music 10inputs = processor( 11 text=["80s pop track with bassy drums and synth"], 12 padding=True, 13 return_tensors="pt" 14).to("cuda") 15 16audio_values = model.generate( 17 **inputs, 18 do_sample=True, 19 guidance_scale=3, 20 max_new_tokens=256 21) 22 23# Save 24sampling_rate = model.config.audio_encoder.sampling_rate 25scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
Text-to-sound with AudioGen
python1from audiocraft.models import AudioGen 2 3# Load AudioGen 4model = AudioGen.get_pretrained('facebook/audiogen-medium') 5 6model.set_generation_params(duration=5) 7 8# Generate sound effects 9descriptions = ["dog barking in a park with birds chirping"] 10wav = model.generate(descriptions) 11 12torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
Core concepts
Architecture overview
AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘
Model variants
| Model | Size | Description | Use Case |
|---|---|---|---|
musicgen-small | 300M | Text-to-music | Quick generation |
musicgen-medium | 1.5B | Text-to-music | Balanced |
musicgen-large | 3.3B | Text-to-music | Best quality |
musicgen-melody | 1.5B | Text + melody | Melody conditioning |
musicgen-melody-large | 3.3B | Text + melody | Best melody |
musicgen-stereo-* | Varies | Stereo output | Stereo generation |
musicgen-style | 1.5B | Style transfer | Reference-based |
audiogen-medium | 1.5B | Text-to-sound | Sound effects |
Generation parameters
| Parameter | Default | Description |
|---|---|---|
duration | 8.0 | Length in seconds (1-120) |
top_k | 250 | Top-k sampling |
top_p | 0.0 | Nucleus sampling (0 = disabled) |
temperature | 1.0 | Sampling temperature |
cfg_coef | 3.0 | Classifier-free guidance |
MusicGen usage
Text-to-music generation
python1from audiocraft.models import MusicGen 2import torchaudio 3 4model = MusicGen.get_pretrained('facebook/musicgen-medium') 5 6# Configure generation 7model.set_generation_params( 8 duration=30, # Up to 30 seconds 9 top_k=250, # Sampling diversity 10 top_p=0.0, # 0 = use top_k only 11 temperature=1.0, # Creativity (higher = more varied) 12 cfg_coef=3.0 # Text adherence (higher = stricter) 13) 14 15# Generate multiple samples 16descriptions = [ 17 "epic orchestral soundtrack with strings and brass", 18 "chill lo-fi hip hop beat with jazzy piano", 19 "energetic rock song with electric guitar" 20] 21 22# Generate (returns [batch, channels, samples]) 23wav = model.generate(descriptions) 24 25# Save each 26for i, audio in enumerate(wav): 27 torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
Melody-conditioned generation
python1from audiocraft.models import MusicGen 2import torchaudio 3 4# Load melody model 5model = MusicGen.get_pretrained('facebook/musicgen-melody') 6model.set_generation_params(duration=30) 7 8# Load melody audio 9melody, sr = torchaudio.load("melody.wav") 10 11# Generate with melody conditioning 12descriptions = ["acoustic guitar folk song"] 13wav = model.generate_with_chroma(descriptions, melody, sr) 14 15torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
Stereo generation
python1from audiocraft.models import MusicGen 2 3# Load stereo model 4model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') 5model.set_generation_params(duration=15) 6 7descriptions = ["ambient electronic music with wide stereo panning"] 8wav = model.generate(descriptions) 9 10# wav shape: [batch, 2, samples] for stereo 11print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] 12torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
Audio continuation
python1from transformers import AutoProcessor, MusicgenForConditionalGeneration 2 3processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") 4model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium") 5 6# Load audio to continue 7import torchaudio 8audio, sr = torchaudio.load("intro.wav") 9 10# Process with text and audio 11inputs = processor( 12 audio=audio.squeeze().numpy(), 13 sampling_rate=sr, 14 text=["continue with a epic chorus"], 15 padding=True, 16 return_tensors="pt" 17) 18 19# Generate continuation 20audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
MusicGen-Style usage
Style-conditioned generation
python1from audiocraft.models import MusicGen 2 3# Load style model 4model = MusicGen.get_pretrained('facebook/musicgen-style') 5 6# Configure generation with style 7model.set_generation_params( 8 duration=30, 9 cfg_coef=3.0, 10 cfg_coef_beta=5.0 # Style influence 11) 12 13# Configure style conditioner 14model.set_style_conditioner_params( 15 eval_q=3, # RVQ quantizers (1-6) 16 excerpt_length=3.0 # Style excerpt length 17) 18 19# Load style reference 20style_audio, sr = torchaudio.load("reference_style.wav") 21 22# Generate with text + style 23descriptions = ["upbeat dance track"] 24wav = model.generate_with_style(descriptions, style_audio, sr)
Style-only generation (no text)
python1# Generate matching style without text prompt 2model.set_generation_params( 3 duration=30, 4 cfg_coef=3.0, 5 cfg_coef_beta=None # Disable double CFG for style-only 6) 7 8wav = model.generate_with_style([None], style_audio, sr)
AudioGen usage
Sound effect generation
python1from audiocraft.models import AudioGen 2import torchaudio 3 4model = AudioGen.get_pretrained('facebook/audiogen-medium') 5model.set_generation_params(duration=10) 6 7# Generate various sounds 8descriptions = [ 9 "thunderstorm with heavy rain and lightning", 10 "busy city traffic with car horns", 11 "ocean waves crashing on rocks", 12 "crackling campfire in forest" 13] 14 15wav = model.generate(descriptions) 16 17for i, audio in enumerate(wav): 18 torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
EnCodec usage
Audio compression
python1from audiocraft.models import CompressionModel 2import torch 3import torchaudio 4 5# Load EnCodec 6model = CompressionModel.get_pretrained('facebook/encodec_32khz') 7 8# Load audio 9wav, sr = torchaudio.load("audio.wav") 10 11# Ensure correct sample rate 12if sr != 32000: 13 resampler = torchaudio.transforms.Resample(sr, 32000) 14 wav = resampler(wav) 15 16# Encode to tokens 17with torch.no_grad(): 18 encoded = model.encode(wav.unsqueeze(0)) 19 codes = encoded[0] # Audio codes 20 21# Decode back to audio 22with torch.no_grad(): 23 decoded = model.decode(codes) 24 25torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
Common workflows
Workflow 1: Music generation pipeline
python1import torch 2import torchaudio 3from audiocraft.models import MusicGen 4 5class MusicGenerator: 6 def __init__(self, model_name="facebook/musicgen-medium"): 7 self.model = MusicGen.get_pretrained(model_name) 8 self.sample_rate = 32000 9 10 def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0): 11 self.model.set_generation_params( 12 duration=duration, 13 top_k=250, 14 temperature=temperature, 15 cfg_coef=cfg 16 ) 17 18 with torch.no_grad(): 19 wav = self.model.generate([prompt]) 20 21 return wav[0].cpu() 22 23 def generate_batch(self, prompts, duration=30): 24 self.model.set_generation_params(duration=duration) 25 26 with torch.no_grad(): 27 wav = self.model.generate(prompts) 28 29 return wav.cpu() 30 31 def save(self, audio, path): 32 torchaudio.save(path, audio, sample_rate=self.sample_rate) 33 34# Usage 35generator = MusicGenerator() 36audio = generator.generate( 37 "epic cinematic orchestral music", 38 duration=30, 39 temperature=1.0 40) 41generator.save(audio, "epic_music.wav")
Workflow 2: Sound design batch processing
python1import json 2from pathlib import Path 3from audiocraft.models import AudioGen 4import torchaudio 5 6def batch_generate_sounds(sound_specs, output_dir): 7 """ 8 Generate multiple sounds from specifications. 9 10 Args: 11 sound_specs: list of {"name": str, "description": str, "duration": float} 12 output_dir: output directory path 13 """ 14 model = AudioGen.get_pretrained('facebook/audiogen-medium') 15 output_dir = Path(output_dir) 16 output_dir.mkdir(exist_ok=True) 17 18 results = [] 19 20 for spec in sound_specs: 21 model.set_generation_params(duration=spec.get("duration", 5)) 22 23 wav = model.generate([spec["description"]]) 24 25 output_path = output_dir / f"{spec['name']}.wav" 26 torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000) 27 28 results.append({ 29 "name": spec["name"], 30 "path": str(output_path), 31 "description": spec["description"] 32 }) 33 34 return results 35 36# Usage 37sounds = [ 38 {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, 39 {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, 40 {"name": "door", "description": "wooden door creaking and closing", "duration": 2} 41] 42 43results = batch_generate_sounds(sounds, "sound_effects/")
Workflow 3: Gradio demo
python1import gradio as gr 2import torch 3import torchaudio 4from audiocraft.models import MusicGen 5 6model = MusicGen.get_pretrained('facebook/musicgen-small') 7 8def generate_music(prompt, duration, temperature, cfg_coef): 9 model.set_generation_params( 10 duration=duration, 11 temperature=temperature, 12 cfg_coef=cfg_coef 13 ) 14 15 with torch.no_grad(): 16 wav = model.generate([prompt]) 17 18 # Save to temp file 19 path = "temp_output.wav" 20 torchaudio.save(path, wav[0].cpu(), sample_rate=32000) 21 return path 22 23demo = gr.Interface( 24 fn=generate_music, 25 inputs=[ 26 gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), 27 gr.Slider(1, 30, value=8, label="Duration (seconds)"), 28 gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), 29 gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") 30 ], 31 outputs=gr.Audio(label="Generated Music"), 32 title="MusicGen Demo" 33) 34 35demo.launch()
Performance optimization
Memory optimization
python1# Use smaller model 2model = MusicGen.get_pretrained('facebook/musicgen-small') 3 4# Clear cache between generations 5torch.cuda.empty_cache() 6 7# Generate shorter durations 8model.set_generation_params(duration=10) # Instead of 30 9 10# Use half precision 11model = model.half()
Batch processing efficiency
python1# Process multiple prompts at once (more efficient) 2descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] 3wav = model.generate(descriptions) # Single batch 4 5# Instead of 6for desc in descriptions: 7 wav = model.generate([desc]) # Multiple batches (slower)
GPU memory requirements
| Model | FP32 VRAM | FP16 VRAM |
|---|---|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
Common issues
| Issue | Solution |
|---|---|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |
References
- Advanced Usage - Training, fine-tuning, deployment
- Troubleshooting - Common issues and solutions
Resources
- GitHub: https://github.com/facebookresearch/audiocraft
- Paper (MusicGen): https://arxiv.org/abs/2306.05284
- Paper (AudioGen): https://arxiv.org/abs/2209.15352
- HuggingFace: https://huggingface.co/facebook/musicgen-small
- Demo: https://huggingface.co/spaces/facebook/MusicGen