AudioCraft

Overview

AudioCraft represents Meta AI's comprehensive approach to generative audio, providing researchers, developers, and creators with professional-grade tools for synthesizing music, sound effects, and general audio from textual descriptions or reference audio. Released in August 2023 as an open-source project, AudioCraft consolidates cutting-edge research in audio generation into a unified, accessible framework that democratizes high-quality audio synthesis previously available only to professionals with expensive sample libraries and production tools.

The AudioCraft suite consists of three complementary models working in concert: MusicGen generates music in various styles from text prompts or melodic conditioning, capable of producing coherent compositions with instrumentation, rhythm, and harmonic structure; AudioGen specializes in environmental sounds and sound effects, from realistic ambient noise to specific audio events like footsteps or thunder; and EnCodec, a neural audio codec that compresses audio to extremely low bitrates (1.5-12 kbps) while maintaining perceptual quality, serving as the foundation for efficient audio generation.

What distinguishes AudioCraft is its focus on controllability and quality. MusicGen supports both text-only generation and melodic conditioning where users provide a reference melody that the model follows while applying the specified style. AudioGen produces high-fidelity sound effects with precise control over duration and characteristics. All models are fully open-source under the MIT license, allowing commercial use, fine-tuning on custom datasets, and integration into production applications without licensing restrictions.

Key Features

MusicGen: Text-to-music generation with melodic conditioning
AudioGen: Environmental sounds and sound effects synthesis
EnCodec: High-quality neural audio compression at 1.5-12 kbps
Stereo audio output at 32kHz or 48kHz sample rate
Long-form generation up to several minutes
Controllable duration, style, and instrumentation
Melodic conditioning for consistent musical themes
Fine-tuning support for custom audio datasets
Real-time generation capabilities on modern GPUs
Open-source MIT license for commercial use
Integration with Hugging Face Transformers
Gradio web interface for interactive experimentation

Use Cases

Video game background music and adaptive soundtracks
Film and video production sound effects
Podcast intro/outro music and transitions
YouTube video background music without copyright issues
Game audio: footsteps, ambiences, UI sounds
Advertising and marketing audio branding
Music prototyping and composition ideation
E-learning content audio enhancement
Audio library creation for content creators
Accessibility: audio descriptions and soundscapes
Music therapy and wellness applications
Rapid audio prototyping for creative projects

Technical Specifications

AudioCraft's models are built on transformer architectures with varying parameter counts: MusicGen ranges from 300M to 3.3B parameters across small, medium, and large variants, trading off quality for generation speed. AudioGen uses a similar architecture optimized for shorter, more precise sound events. EnCodec operates at 24kHz or 48kHz with residual vector quantization (RVQ) producing highly compressed discrete representations. Generation speed on an NVIDIA A100 GPU ranges from 1x realtime (generate 1 second of audio per second) for small models to 0.1x realtime for large models. Memory requirements vary from 8GB VRAM for small models to 24GB+ for large stereo models. The models generate audio at 32kHz sample rate by default with optional 48kHz for higher fidelity applications. All components are implemented in PyTorch with support for FP16 inference for improved performance on modern GPUs.

Model Variants

MusicGen is available in three sizes: Small (300M parameters) for fast prototyping, Medium (1.5B) balancing quality and speed, and Large (3.3B) for maximum quality. Each variant supports both mono and stereo generation. Melody-conditioned variants accept reference audio to guide musical structure while applying style from text prompts. AudioGen provides a single optimized model focused on sound effects and environmental audio. EnCodec offers multiple bitrate configurations from 1.5kbps (extreme compression) to 12kbps (high fidelity), allowing users to trade file size for audio quality based on application requirements.

Pricing and Licensing

AudioCraft is completely free and open-source under the MIT license, permitting unlimited commercial use, modification, and distribution without royalties or attribution requirements. Users can self-host on their own infrastructure, eliminating per-generation costs. Cloud providers offer hosted endpoints: Hugging Face Inference API charges approximately $0.50-1.00 per GPU hour, Replicate charges roughly $0.0002-0.0008 per second of generated audio depending on model size and quality. For production workloads generating 50+ minutes daily, self-hosting on AWS p3.2xlarge (~$3/hour), GCP T4 (~$0.35/hour), or RunPod RTX 4090 (~$0.40/hour) provides better economics than pay-per-generation cloud APIs.

Code Example: Music and Sound Effects Generation

Deploy AudioCraft for production audio generation workflows with MusicGen for adaptive music and AudioGen for dynamic sound effects. This example demonstrates both text-based and melody-conditioned generation for game audio and video production.

import torch
import torchaudio
from audiocraft.models import MusicGen, AudioGen
from audiocraft.data.audio import audio_write
from pathlib import Path
import numpy as np

class AudioCraftGenerator:
    """
    Production audio generator using AudioCraft (MusicGen + AudioGen)
    """
    
    def __init__(self, music_model="facebook/musicgen-medium", device="cuda"):
        self.device = device
        print(f"Loading MusicGen model: {music_model}")
        self.music_model = MusicGen.get_pretrained(music_model, device=device)
        
        print("Loading AudioGen model...")
        self.audio_model = AudioGen.get_pretrained("facebook/audiogen-medium", device=device)
        
        self.output_dir = Path("audiocraft_output")
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        print("Models loaded successfully\n")
    
    def generate_music(
        self,
        prompt: str,
        duration: float = 10.0,
        melody_path: str = None,
        temperature: float = 1.0,
        top_k: int = 250,
        top_p: float = 0.0,
        cfg_coef: float = 3.0
    ) -> Path:
        """
        Generate music from text prompt with optional melody conditioning
        
        Args:
            prompt: Text description of desired music
            duration: Length in seconds (up to 300)
            melody_path: Optional path to melody reference audio
            temperature: Sampling temperature (higher = more random)
            top_k: Top-k sampling parameter
            top_p: Top-p (nucleus) sampling parameter
            cfg_coef: Classifier-free guidance coefficient (higher = stronger prompt adherence)
        
        Returns:
            Path to generated audio file
        """
        print(f"Generating music: '{prompt}'")
        print(f"Duration: {duration}s, CFG: {cfg_coef}")
        
        # Set generation parameters
        self.music_model.set_generation_params(
            duration=duration,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            cfg_coef=cfg_coef
        )
        
        # Generate with or without melody conditioning
        if melody_path:
            print(f"Using melody conditioning from: {melody_path}")
            melody, sr = torchaudio.load(melody_path)
            
            # Resample if necessary
            if sr != self.music_model.sample_rate:
                resampler = torchaudio.transforms.Resample(sr, self.music_model.sample_rate)
                melody = resampler(melody)
            
            # Generate with melody conditioning
            output = self.music_model.generate_with_chroma(
                descriptions=[prompt],
                melody_wavs=melody[None],
                melody_sample_rate=self.music_model.sample_rate
            )
        else:
            # Generate from text only
            output = self.music_model.generate([prompt])
        
        # Save audio
        filename = f"music_{prompt[:30].replace(' ', '_')}"
        output_path = self.output_dir / filename
        
        audio_write(
            str(output_path),
            output[0].cpu(),
            self.music_model.sample_rate,
            strategy="loudness",
            loudness_compressor=True
        )
        
        print(f"Music generated: {output_path}.wav\n")
        return Path(f"{output_path}.wav")
    
    def generate_sound_effect(
        self,
        prompt: str,
        duration: float = 5.0,
        temperature: float = 1.0,
        cfg_coef: float = 3.0
    ) -> Path:
        """
        Generate sound effects using AudioGen
        
        Args:
            prompt: Description of sound effect
            duration: Length in seconds
            temperature: Sampling temperature
            cfg_coef: Guidance strength
        
        Returns:
            Path to generated sound effect
        """
        print(f"Generating sound effect: '{prompt}'")
        print(f"Duration: {duration}s")
        
        # Set generation parameters
        self.audio_model.set_generation_params(
            duration=duration,
            temperature=temperature,
            cfg_coef=cfg_coef
        )
        
        # Generate sound effect
        output = self.audio_model.generate([prompt])
        
        # Save audio
        filename = f"sfx_{prompt[:30].replace(' ', '_')}"
        output_path = self.output_dir / filename
        
        audio_write(
            str(output_path),
            output[0].cpu(),
            self.audio_model.sample_rate,
            strategy="loudness"
        )
        
        print(f"Sound effect generated: {output_path}.wav\n")
        return Path(f"{output_path}.wav")
    
    def generate_game_audio_pack(
        self,
        music_prompts: list[str],
        sfx_prompts: list[str],
        music_duration: float = 60.0,
        sfx_duration: float = 2.0
    ) -> dict:
        """
        Generate complete audio pack for game development
        
        Args:
            music_prompts: List of music descriptions
            sfx_prompts: List of sound effect descriptions
            music_duration: Duration for music tracks
            sfx_duration: Duration for sound effects
        
        Returns:
            Dict with paths to all generated files
        """
        results = {"music": [], "sfx": []}
        
        print(f"Generating game audio pack...")
        print(f"  Music tracks: {len(music_prompts)}")
        print(f"  Sound effects: {len(sfx_prompts)}\n")
        
        # Generate music tracks
        for i, prompt in enumerate(music_prompts, 1):
            print(f"Music {i}/{len(music_prompts)}")
            path = self.generate_music(prompt, duration=music_duration)
            results["music"].append(path)
        
        # Generate sound effects
        for i, prompt in enumerate(sfx_prompts, 1):
            print(f"SFX {i}/{len(sfx_prompts)}")
            path = self.generate_sound_effect(prompt, duration=sfx_duration)
            results["sfx"].append(path)
        
        print(f"\nGame audio pack complete!")
        print(f"  Total music: {len(results['music'])} tracks")
        print(f"  Total SFX: {len(results['sfx'])} effects")
        
        return results

# Example 1: Game background music
generator = AudioCraftGenerator(music_model="facebook/musicgen-medium")

# Generate adaptive game music
generator.generate_music(
    prompt="Epic orchestral battle music with intense drums and brass, heroic theme, fast tempo",
    duration=90.0,
    cfg_coef=4.0  # Strong prompt adherence
)

generator.generate_music(
    prompt="Peaceful ambient music for exploration, soft piano and strings, calm atmosphere",
    duration=120.0,
    cfg_coef=3.5
)

generator.generate_music(
    prompt="Suspenseful horror music, dark ambience, eerie sounds, tension building",
    duration=60.0,
    cfg_coef=4.5
)

# Example 2: Sound effects library
sfx_prompts = [
    "Footsteps on wooden floor, steady pace",
    "Door creaking open slowly",
    "Sword swing and metal clashing",
    "Magic spell casting with sparkles",
    "Explosion with debris falling",
    "Coin pickup sound, bright and clear",
    "UI button click, satisfying feedback",
    "Thunder and heavy rain",
    "Fire crackling",
    "Water splash"
]

for prompt in sfx_prompts:
    generator.generate_sound_effect(prompt, duration=3.0)

# Example 3: Melody-conditioned music
generator.generate_music(
    prompt="Upbeat electronic dance music, energetic synthesizers, club atmosphere",
    duration=30.0,
    melody_path="reference_melody.wav",  # Reference melody file
    cfg_coef=3.5
)

# Example 4: Complete game audio pack
game_music = [
    "Main menu theme, orchestral, majestic and inviting",
    "Level 1 music, upbeat adventure theme, playful instruments",
    "Boss battle music, intense rock with electric guitar",
    "Victory fanfare, triumphant brass and percussion",
    "Game over music, melancholic piano"
]

game_sfx = [
    "Jump sound effect",
    "Power-up collection",
    "Enemy hit sound",
    "Player damage",
    "Level complete jingle"
]

audio_pack = generator.generate_game_audio_pack(
    music_prompts=game_music,
    sfx_prompts=game_sfx,
    music_duration=45.0,
    sfx_duration=2.0
)

print(f"\nAll audio generation complete!")
print(f"Output directory: {generator.output_dir}")

Professional Integration Services by 21medien

Deploying AudioCraft for professional audio production requires expertise in model fine-tuning, audio post-processing, and production pipeline integration. 21medien offers comprehensive services to help game studios, film producers, and content creators leverage AudioCraft's music and sound generation capabilities for scalable audio content creation.

Our services include: AudioCraft Self-Hosting Infrastructure with GPU-optimized deployment for cost-effective high-volume audio generation, Custom Model Fine-Tuning on your proprietary audio datasets to create brand-specific musical styles and sound signatures, Audio Pipeline Automation integrating AudioCraft with game engines (Unity, Unreal), DAWs (Ableton, Logic Pro), and content management systems, Quality Enhancement Post-Processing including mastering, format conversion, and adaptive audio implementation for interactive applications, Adaptive Audio System Development creating dynamic music systems that respond to gameplay or narrative events, Performance Optimization implementing batch generation, caching strategies, and GPU orchestration to maximize throughput, and Training Programs for audio designers and composers on prompt engineering, melodic conditioning, and iterative refinement techniques specific to AudioCraft.

Whether you need a complete game audio production pipeline, film sound effects library generation system, or custom music synthesis integration, our team of audio engineers and AI specialists is ready to help. Schedule a free consultation call through our contact page to discuss your audio AI requirements and explore how AudioCraft can transform your audio production workflow.

Resources and Links

Official website: https://audiocraft.metademolab.com/ | GitHub: https://github.com/facebookresearch/audiocraft | Documentation: https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md | Demo: https://huggingface.co/spaces/facebook/MusicGen | Research paper: https://arxiv.org/abs/2306.05284

Overview

Key Features

Use Cases

Technical Specifications

Model Variants

Pricing and Licensing

Code Example: Music and Sound Effects Generation

Professional Integration Services by 21medien

Resources and Links

Official Resources

Related Technologies

Bark

Stable Audio

ElevenLabs

Whisper

Hugging Face

Cookie Settings

Necessary Cookies

External Services