KS
Killer-Skills

audiocraft-audio-generation — how to use audiocraft-audio-generation how to use audiocraft-audio-generation, audiocraft-audio-generation alternative, audiocraft-audio-generation setup guide, what is audiocraft-audio-generation, audiocraft-audio-generation vs Amper Music, audiocraft-audio-generation install, text-to-music generation, audio generation tools, music generation applications

v1.0.0
GitHub

About this Skill

Perfect for Creative Agents needing advanced music and audio generation capabilities from text descriptions audiocraft-audio-generation is a comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

Features

Generates music from text descriptions using MusicGen
Creates sound effects and environmental audio with AudioGen
Supports melody-conditioned music generation
Produces stereo audio output
Allows controllable music generation with style transfer using EnCodec

# Core Topics

mohammedatiaa mohammedatiaa
[0]
[0]
Updated: 2/27/2026

Quality Score

Top 5%
61
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add mohammedatiaa/ai-project-network/audiocraft-audio-generation

Agent Capability Analysis

The audiocraft-audio-generation MCP Server by mohammedatiaa is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use audiocraft-audio-generation, audiocraft-audio-generation alternative, audiocraft-audio-generation setup guide.

Ideal Agent Persona

Perfect for Creative Agents needing advanced music and audio generation capabilities from text descriptions

Core Value

Empowers agents to generate music and audio from text descriptions using Meta's AudioCraft, enabling melody-conditioned music generation, style transfer, and stereo audio output with MusicGen, AudioGen, and EnCodec

Capabilities Granted for audiocraft-audio-generation MCP Server

Generating background music for videos from descriptive text
Creating sound effects and environmental audio for immersive experiences
Building music generation applications with controllable style transfer

! Prerequisites & Limits

  • Requires text descriptions as input
  • Limited to music and audio generation tasks
  • Dependent on Meta's AudioCraft and its supported libraries
Project
SKILL.md
15.5 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

AudioCraft: Audio Generation

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

When to use AudioCraft

Use AudioCraft when:

  • Need to generate music from text descriptions
  • Creating sound effects and environmental audio
  • Building music generation applications
  • Need melody-conditioned music generation
  • Want stereo audio output
  • Require controllable music generation with style transfer

Key features:

  • MusicGen: Text-to-music generation with melody conditioning
  • AudioGen: Text-to-sound effects generation
  • EnCodec: High-fidelity neural audio codec
  • Multiple model sizes: Small (300M) to Large (3.3B)
  • Stereo support: Full stereo audio generation
  • Style conditioning: MusicGen-Style for reference-based generation

Use alternatives instead:

  • Stable Audio: For longer commercial music generation
  • Bark: For text-to-speech with music/sound effects
  • Riffusion: For spectogram-based music generation
  • OpenAI Jukebox: For raw audio generation with lyrics

Quick start

Installation

bash
1# From PyPI 2pip install audiocraft 3 4# From GitHub (latest) 5pip install git+https://github.com/facebookresearch/audiocraft.git 6 7# Or use HuggingFace Transformers 8pip install transformers torch torchaudio

Basic text-to-music (AudioCraft)

python
1import torchaudio 2from audiocraft.models import MusicGen 3 4# Load model 5model = MusicGen.get_pretrained('facebook/musicgen-small') 6 7# Set generation parameters 8model.set_generation_params( 9 duration=8, # seconds 10 top_k=250, 11 temperature=1.0 12) 13 14# Generate from text 15descriptions = ["happy upbeat electronic dance music with synths"] 16wav = model.generate(descriptions) 17 18# Save audio 19torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

Using HuggingFace Transformers

python
1from transformers import AutoProcessor, MusicgenForConditionalGeneration 2import scipy 3 4# Load model and processor 5processor = AutoProcessor.from_pretrained("facebook/musicgen-small") 6model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") 7model.to("cuda") 8 9# Generate music 10inputs = processor( 11 text=["80s pop track with bassy drums and synth"], 12 padding=True, 13 return_tensors="pt" 14).to("cuda") 15 16audio_values = model.generate( 17 **inputs, 18 do_sample=True, 19 guidance_scale=3, 20 max_new_tokens=256 21) 22 23# Save 24sampling_rate = model.config.audio_encoder.sampling_rate 25scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

Text-to-sound with AudioGen

python
1from audiocraft.models import AudioGen 2 3# Load AudioGen 4model = AudioGen.get_pretrained('facebook/audiogen-medium') 5 6model.set_generation_params(duration=5) 7 8# Generate sound effects 9descriptions = ["dog barking in a park with birds chirping"] 10wav = model.generate(descriptions) 11 12torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

Core concepts

Architecture overview

AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│                    Text Encoder (T5)                          │
│                         │                                     │
│                    Text Embeddings                            │
└────────────────────────┬─────────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────────┐
│              Transformer Decoder (LM)                         │
│     Auto-regressively generates audio tokens                  │
│     Using efficient token interleaving patterns               │
└────────────────────────┬─────────────────────────────────────┘
                         │
┌────────────────────────▼─────────────────────────────────────┐
│                EnCodec Audio Decoder                          │
│        Converts tokens back to audio waveform                 │
└──────────────────────────────────────────────────────────────┘

Model variants

ModelSizeDescriptionUse Case
musicgen-small300MText-to-musicQuick generation
musicgen-medium1.5BText-to-musicBalanced
musicgen-large3.3BText-to-musicBest quality
musicgen-melody1.5BText + melodyMelody conditioning
musicgen-melody-large3.3BText + melodyBest melody
musicgen-stereo-*VariesStereo outputStereo generation
musicgen-style1.5BStyle transferReference-based
audiogen-medium1.5BText-to-soundSound effects

Generation parameters

ParameterDefaultDescription
duration8.0Length in seconds (1-120)
top_k250Top-k sampling
top_p0.0Nucleus sampling (0 = disabled)
temperature1.0Sampling temperature
cfg_coef3.0Classifier-free guidance

MusicGen usage

Text-to-music generation

python
1from audiocraft.models import MusicGen 2import torchaudio 3 4model = MusicGen.get_pretrained('facebook/musicgen-medium') 5 6# Configure generation 7model.set_generation_params( 8 duration=30, # Up to 30 seconds 9 top_k=250, # Sampling diversity 10 top_p=0.0, # 0 = use top_k only 11 temperature=1.0, # Creativity (higher = more varied) 12 cfg_coef=3.0 # Text adherence (higher = stricter) 13) 14 15# Generate multiple samples 16descriptions = [ 17 "epic orchestral soundtrack with strings and brass", 18 "chill lo-fi hip hop beat with jazzy piano", 19 "energetic rock song with electric guitar" 20] 21 22# Generate (returns [batch, channels, samples]) 23wav = model.generate(descriptions) 24 25# Save each 26for i, audio in enumerate(wav): 27 torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

Melody-conditioned generation

python
1from audiocraft.models import MusicGen 2import torchaudio 3 4# Load melody model 5model = MusicGen.get_pretrained('facebook/musicgen-melody') 6model.set_generation_params(duration=30) 7 8# Load melody audio 9melody, sr = torchaudio.load("melody.wav") 10 11# Generate with melody conditioning 12descriptions = ["acoustic guitar folk song"] 13wav = model.generate_with_chroma(descriptions, melody, sr) 14 15torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

Stereo generation

python
1from audiocraft.models import MusicGen 2 3# Load stereo model 4model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') 5model.set_generation_params(duration=15) 6 7descriptions = ["ambient electronic music with wide stereo panning"] 8wav = model.generate(descriptions) 9 10# wav shape: [batch, 2, samples] for stereo 11print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] 12torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

Audio continuation

python
1from transformers import AutoProcessor, MusicgenForConditionalGeneration 2 3processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") 4model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium") 5 6# Load audio to continue 7import torchaudio 8audio, sr = torchaudio.load("intro.wav") 9 10# Process with text and audio 11inputs = processor( 12 audio=audio.squeeze().numpy(), 13 sampling_rate=sr, 14 text=["continue with a epic chorus"], 15 padding=True, 16 return_tensors="pt" 17) 18 19# Generate continuation 20audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style usage

Style-conditioned generation

python
1from audiocraft.models import MusicGen 2 3# Load style model 4model = MusicGen.get_pretrained('facebook/musicgen-style') 5 6# Configure generation with style 7model.set_generation_params( 8 duration=30, 9 cfg_coef=3.0, 10 cfg_coef_beta=5.0 # Style influence 11) 12 13# Configure style conditioner 14model.set_style_conditioner_params( 15 eval_q=3, # RVQ quantizers (1-6) 16 excerpt_length=3.0 # Style excerpt length 17) 18 19# Load style reference 20style_audio, sr = torchaudio.load("reference_style.wav") 21 22# Generate with text + style 23descriptions = ["upbeat dance track"] 24wav = model.generate_with_style(descriptions, style_audio, sr)

Style-only generation (no text)

python
1# Generate matching style without text prompt 2model.set_generation_params( 3 duration=30, 4 cfg_coef=3.0, 5 cfg_coef_beta=None # Disable double CFG for style-only 6) 7 8wav = model.generate_with_style([None], style_audio, sr)

AudioGen usage

Sound effect generation

python
1from audiocraft.models import AudioGen 2import torchaudio 3 4model = AudioGen.get_pretrained('facebook/audiogen-medium') 5model.set_generation_params(duration=10) 6 7# Generate various sounds 8descriptions = [ 9 "thunderstorm with heavy rain and lightning", 10 "busy city traffic with car horns", 11 "ocean waves crashing on rocks", 12 "crackling campfire in forest" 13] 14 15wav = model.generate(descriptions) 16 17for i, audio in enumerate(wav): 18 torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec usage

Audio compression

python
1from audiocraft.models import CompressionModel 2import torch 3import torchaudio 4 5# Load EnCodec 6model = CompressionModel.get_pretrained('facebook/encodec_32khz') 7 8# Load audio 9wav, sr = torchaudio.load("audio.wav") 10 11# Ensure correct sample rate 12if sr != 32000: 13 resampler = torchaudio.transforms.Resample(sr, 32000) 14 wav = resampler(wav) 15 16# Encode to tokens 17with torch.no_grad(): 18 encoded = model.encode(wav.unsqueeze(0)) 19 codes = encoded[0] # Audio codes 20 21# Decode back to audio 22with torch.no_grad(): 23 decoded = model.decode(codes) 24 25torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

Common workflows

Workflow 1: Music generation pipeline

python
1import torch 2import torchaudio 3from audiocraft.models import MusicGen 4 5class MusicGenerator: 6 def __init__(self, model_name="facebook/musicgen-medium"): 7 self.model = MusicGen.get_pretrained(model_name) 8 self.sample_rate = 32000 9 10 def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0): 11 self.model.set_generation_params( 12 duration=duration, 13 top_k=250, 14 temperature=temperature, 15 cfg_coef=cfg 16 ) 17 18 with torch.no_grad(): 19 wav = self.model.generate([prompt]) 20 21 return wav[0].cpu() 22 23 def generate_batch(self, prompts, duration=30): 24 self.model.set_generation_params(duration=duration) 25 26 with torch.no_grad(): 27 wav = self.model.generate(prompts) 28 29 return wav.cpu() 30 31 def save(self, audio, path): 32 torchaudio.save(path, audio, sample_rate=self.sample_rate) 33 34# Usage 35generator = MusicGenerator() 36audio = generator.generate( 37 "epic cinematic orchestral music", 38 duration=30, 39 temperature=1.0 40) 41generator.save(audio, "epic_music.wav")

Workflow 2: Sound design batch processing

python
1import json 2from pathlib import Path 3from audiocraft.models import AudioGen 4import torchaudio 5 6def batch_generate_sounds(sound_specs, output_dir): 7 """ 8 Generate multiple sounds from specifications. 9 10 Args: 11 sound_specs: list of {"name": str, "description": str, "duration": float} 12 output_dir: output directory path 13 """ 14 model = AudioGen.get_pretrained('facebook/audiogen-medium') 15 output_dir = Path(output_dir) 16 output_dir.mkdir(exist_ok=True) 17 18 results = [] 19 20 for spec in sound_specs: 21 model.set_generation_params(duration=spec.get("duration", 5)) 22 23 wav = model.generate([spec["description"]]) 24 25 output_path = output_dir / f"{spec['name']}.wav" 26 torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000) 27 28 results.append({ 29 "name": spec["name"], 30 "path": str(output_path), 31 "description": spec["description"] 32 }) 33 34 return results 35 36# Usage 37sounds = [ 38 {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, 39 {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, 40 {"name": "door", "description": "wooden door creaking and closing", "duration": 2} 41] 42 43results = batch_generate_sounds(sounds, "sound_effects/")

Workflow 3: Gradio demo

python
1import gradio as gr 2import torch 3import torchaudio 4from audiocraft.models import MusicGen 5 6model = MusicGen.get_pretrained('facebook/musicgen-small') 7 8def generate_music(prompt, duration, temperature, cfg_coef): 9 model.set_generation_params( 10 duration=duration, 11 temperature=temperature, 12 cfg_coef=cfg_coef 13 ) 14 15 with torch.no_grad(): 16 wav = model.generate([prompt]) 17 18 # Save to temp file 19 path = "temp_output.wav" 20 torchaudio.save(path, wav[0].cpu(), sample_rate=32000) 21 return path 22 23demo = gr.Interface( 24 fn=generate_music, 25 inputs=[ 26 gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), 27 gr.Slider(1, 30, value=8, label="Duration (seconds)"), 28 gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), 29 gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") 30 ], 31 outputs=gr.Audio(label="Generated Music"), 32 title="MusicGen Demo" 33) 34 35demo.launch()

Performance optimization

Memory optimization

python
1# Use smaller model 2model = MusicGen.get_pretrained('facebook/musicgen-small') 3 4# Clear cache between generations 5torch.cuda.empty_cache() 6 7# Generate shorter durations 8model.set_generation_params(duration=10) # Instead of 30 9 10# Use half precision 11model = model.half()

Batch processing efficiency

python
1# Process multiple prompts at once (more efficient) 2descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] 3wav = model.generate(descriptions) # Single batch 4 5# Instead of 6for desc in descriptions: 7 wav = model.generate([desc]) # Multiple batches (slower)

GPU memory requirements

ModelFP32 VRAMFP16 VRAM
musicgen-small~4GB~2GB
musicgen-medium~8GB~4GB
musicgen-large~16GB~8GB

Common issues

IssueSolution
CUDA OOMUse smaller model, reduce duration
Poor qualityIncrease cfg_coef, better prompts
Generation too shortCheck max duration setting
Audio artifactsTry different temperature
Stereo not workingUse stereo model variant

References

Resources

Related Skills

Looking for an alternative to audiocraft-audio-generation or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication