Tool for content-based audio retrieval using natural language queries. Leveraging the power of CLAP (Contrastive Language-Audio Pre-training), this project allows you to find specific segments within an audio file by semantically matching text descriptions.
semantic_audio_search/
│
├── src/
│ ├── __init__.py
│ ├── query_expansion.py
│ ├── audio_segmentation.py
│ ├── embedding.py
│ └── semantic_search.py
│
├── utils/
│ ├── __init__.py
│ ├── visualization.py
│ └── file_operations.py
│
├── main.py
│
├── data/ # Sample audio files
│ ├── music/
│ └── speech/
│
├── requirements.txt
├── README.md
└── .gitignore
- Intelligent Query Expansion: Automatically generate related search queries to improve search accuracy
- Advanced Audio Segmentation: Break down audio files into meaningful segments
- Semantic Embedding: Convert both audio and text into semantic vector representations
- Flexible Search: Support for various audio files and query types
- Visualization Tools: Segment boundary and search result visualization
- Python 3.11+
- CUDA-compatible GPU (recommended, but not required)
- Required libraries listed in
requirements.txt
- Clone the repository:
git clone https://github.com/satyam-kr03/SemanticAudioSearch.git
cd SemanticAudioSearch- Create a virtual environment:
python -m venv venv
source venv/bin/activate
# On Windows, use `venv\Scripts\activate`- Install dependencies:
pip install -r requirements.txtfrom src.semantic_search import SemanticAudioSearch
# Initialize the search engine
searcher = SemanticAudioSearch(use_cuda=True)
# Search for audio segments
audio_file = "path/to/your/audio.wav"
text_query = "piano solo"
results = searcher.query_with_expansion(
audio_file,
text_query,
visualize=True, # Optional: visualize segmentation
top_k=3 # Number of top matching segments
)# Customize search parameters
results = searcher.query_with_expansion(
audio_file,
text_query,
num_expansions=5, # More query variations
segment_method='mfcc', # Alternative segmentation method
fusion_method='max', # Different similarity fusion strategy
top_k=5, # More top results
use_transcription=True
)- Generates semantically related queries
- Supports rule-based and emotion-based expansion
- Improves search accuracy and recall
- Multiple segmentation methods:
- MFCC-based
- Novelty-based
- Beat-based
- Configurable segment length and threshold
- Uses Qwen2-7B-Audio-Instruct
- Uses CLAP model for audio and text embeddings
- Supports CUDA acceleration
- Generates semantic vector representations
- GPU acceleration recommended
- Large audio files may require more processing time
- Adjust segment size and query expansion for optimal results