Datasets you can browse, audit, and deploy.
Rights-cleared corpora across audio, image, video, and text. Every dataset ships with a card your compliance team will read before your engineering team does.
Most dataset pages sell volume.
We’d rather show you what’s inside each file: the modality, the locales, the licensing terms, the chain of custody, the sample clips. If it passes your review, it ships. If it doesn’t, it’s not for you.
Conversational speech, rights-cleared, globally balanced.
A two-speaker conversational corpus collected through our platform across 18 locales. Each file ships with word-level transcripts, speaker diarisation, and the full consent trail.
- Modality
- Conversational speech (two-speaker)
- Locales
- 18 globally balanced, en-US, en-GB, zh-CN, pt-BR, pt-PT, es-MX, fr-FR, ja-JP, ko-KR, de-DE, hi-IN, and more
- Domain split
- Healthcare, meetings, and contact-centre
- Format
- WAV, PCM, 44.1 kHz, stereo
- Transcripts
- Word-level timestamps, speaker labels
- Licensing
- Rights-cleared for commercial model training, derivative works available on request
- Samples
- Public on HuggingFace and Datarade
- Availability
- Licensed per locale or as the full corpus
What’s coming next.
We publish new datasets as they clear QA. Here’s what is in production now.
- Image
- Object recognition and scene classification across retail, medical, and industrial domains
- Video
- Egocentric task demonstrations with synced audio narration
- Multimodal
- Paired voice-image-text datasets for instruction-following models
- Accented speech
- Extended single-locale corpora with fine-grained accent labeling
If you need one of these sooner than we’ll have it ready, that’s a custom collection. Talk to us.
Nothing in the catalog fits? Scope a project.
TELL US
The modality, language, domain, volume, and delivery timeline you need. Any special requirements for speaker profile, accent coverage, or capture environment.
WE SCOPE
Within 48 hours of the first call, you’ll have a scoped plan covering contributor sourcing, project timeline, pricing, delivery format, and rights scope.
WE COLLECT
Collection runs through the same platform, same consent framework, same multi-layer QA as every dataset in our catalog. The dataset card is built as we go.
YOU REVIEW
Early batches land in a review folder. Approve, flag for rework, or adjust scope. Final delivery happens on your preferred channel.
Sample clips and full corpora, in three places.
- HuggingFace
- Open samples from most datasets, searchable by modality and locale
- Datarade
- Full catalog listings with licensing terms and request-quote flow
- Direct
- Enterprise contracts and custom projects ship via signed direct access
All three paths point to the same source of truth: the dataset card for each corpus.
The ones we get most.
Can I listen to samples before requesting a spec?
Yes. Public samples are on HuggingFace and Datarade. For custom datasets or full-corpus evaluation, we’ll share a sample set directly after a short scoping call.
What does rights-cleared mean for these datasets?
Every contributor signed a consent form authorizing commercial training use for the specific project their data contributed to. The consent scope is documented in each dataset card and retrievable per file. You buy datasets with the rights spelled out, not assumed.
Can I license just a subset of a corpus?
Usually yes. Most datasets are available per-locale, per-domain, or per-speaker-segment. Contact us with the slice you need and we will confirm licensing and pricing for that subset.
Do you sell exclusive licenses?
For custom projects, yes. Catalog datasets are non-exclusive by default because they were collected with that intention. If you need exclusivity on a catalog dataset, we can discuss a carve-out.
What happens if we find an issue with a dataset we licensed?
Tell us. We investigate and, depending on the nature of the issue, replace files, refund, or adjust scope. A dataset is only as good as its next complaint handled well.
How are datasets priced?
Per hour for audio, per-asset for image and video, per-token or per-document for text. Exact pricing depends on locale, domain, rights scope, and volume. We quote on request.
How do you handle personally identifiable information?
Consent forms explicitly authorise or restrict identifying features. Anonymisation options include voice masking, face blurring, metadata redaction. The dataset card specifies the anonymisation level for each file.
Can I buy data I can re-sell?
No. Redistribution rights are not granted by default. If you are a downstream marketplace or data broker and need redistribution, contact us for a separate license tier.