Skip to content
Skip to content

The [Human] Standard.

How every file, every contributor, and every dataset passes through four layers before it earns the UsergyAI name.

A methodology is only useful if it ships with the data.

Most AI data providers have one. Their website describes it. Their sales deck walks through it. The problem starts when you ask for the paper trail, and there isn’t one.

The [Human] Standard is built to produce evidence at every stage, not stories.

Four stages. Every one produces evidence.

SOURCE

A contributor joins the network through a verified identity check. Before they ever touch a project, the platform confirms language proficiency, domain skill, and consent to the current collection scope. Skill testing happens on tasks that mirror real work, not synthetic prompts.

Contributors who clear the bar land in a pool that is visible to project managers by language, modality, and rating. Pools are scoped per project so a contributor working on a medical transcription project is not automatically eligible for a consumer voice collection.

What you get at this stage: a named, identity-verified contributor whose skill match and consent scope are on file before they record a second.

CAPTURE

Collection happens inside the platform. No third-party apps, no private cloud uploads, no file-sharing links. Contributors record directly into a browser session that streams to our storage over a signed upload path.

Each file receives its metadata at the moment of creation: contributor ID, project ID, consent version, signal hash, and timestamp. The file and its paperwork are one object, not two.

What you get at this stage: a raw file with provenance already attached, not reconstructed after the fact.

VERIFY

Every file passes through three layers of quality control before it reaches your review.

Layer one: automated scoring. Language detection, voice activity, diarization, and signal quality run on every file. Scores attach to the file record.

Layer two: peer review. Verified contributors from the same language community listen to a sampled set of files from every project. Inter-rater agreement is tracked.

Layer three: centralised QA. A specialist team reviews flagged files, edge cases, and a random audit sample. Decisions are logged with actor and time.

What you get at this stage: a dataset where every file has cleared three independent quality layers, with a paper trail to prove it.

DELIVER

The final dataset ships with a dataset card, not just a zip file.

The card documents: modality, volume, locales, format specifications, contributor profiles (anonymised), consent versions in use, licensing scope, QC metrics, and the full provenance trail.

Delivery happens through your preferred channel: direct download, S3-compatible bucket, or API pull.

What you get at this stage: a dataset your compliance team can sign off on without a second meeting.

One answer works for three audiences.

When a model fails an evaluation, the first question is where the training data came from.

When a regulator opens a file, the first question is who consented to what.

When a legal review lands on a dataset, the first question is whether the paperwork holds up.

The [Human] Standard is built so all three questions have the same answer: it’s on file.

The ones we get most.

What makes a contributor "verified"?

Verification runs three checks: identity (a government-issued ID, confirmed against geolocation), skill (domain-appropriate testing that mirrors real work), and consent (signed agreement specifying rights and compensation for the current project). Contributors move through all three before they can record anything.

What does informed consent mean here?

Informed consent means the contributor signs a document that states, in their language, what the recording is for, who will have access to it, what commercial rights are being granted, how long the rights last, and what compensation they will receive. The signed document is timestamped and retrievable by file ID. No blanket agreements, no “future use” language.

How is quality control actually done?

Three layers. Every file runs through automated model-based scoring for language, signal quality, voice activity, and speaker diarisation. Peer review samples files across contributors within each project. Centralised QA reviews flagged files, edge cases, and an audit sample. A file that fails any layer is returned to the queue or rejected.

What file formats and specs do datasets ship in?

Audio: WAV (PCM), with configurable sample rates up to 48kHz and bit depths up to 24-bit. Mono or stereo. Transcripts as JSON with word-level timestamps and speaker labels. Image and video: common formats per project spec. Custom formats on request.

How long does a custom project take?

A typical project scopes within 48 hours of the first call. Contributor sourcing and pipeline setup runs 3 to 7 days. First files arrive in the second week. Full delivery timeline depends on volume and modality: a 100-hour audio project ships in 3 to 5 weeks; a 10,000-image project ships in 2 to 4 weeks.

How is UsergyAI different from Scale AI or Appen?

Scale AI and Appen are annotation-first. They label data that was collected elsewhere, usually by anonymous crowds on third-party platforms. UsergyAI is a collection platform. We run the pipeline end to end: sourcing, identity verification, consent, capture, QC, delivery. We also stay independent of any model lab, which matters if you care about neutrality.

Can the data be used to train commercial models?

Yes, when the project is scoped for commercial rights. Every file carries its own rights flags: commercial training, derivative works, redistribution. You buy only the rights you need, and the scope is on file.

What languages and regions do you cover?

Our contributor network spans 20+ countries and supports common and underrepresented languages across major families: Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian, Niger-Congo, Dravidian, and more. Project matching prioritises native speakers.

How are contributors compensated?

Fairly, above regional minimum wage floors, scaled to task difficulty and contributor expertise. Compensation is documented in the consent form each contributor signs and visible in the platform throughout a project.

What happens if a dataset fails QC?

It does not ship. If individual files fail, they go back to the queue for re-capture or rejection. If a whole project fails quality thresholds, we re-scope or, in rare cases, refund. A dataset ships clean, or it does not ship at all.

The standard is only worth the work.