The Core Problem: Why Ears Fail, Why AI Doesn't

Human voice recognition is pattern matching against a stored mental model. Your brain remembers what someone sounds like and compares new audio against that memory. The problem is that AI voice cloning replicates the exact features your brain uses to recognize people — pitch, timbre, speech cadence, accent.

But there are features of genuine human speech that AI voice cloning does not replicate — features that are imperceptible to humans but mathematically detectable. This is why a deepfake phone call can fool your ears but not a properly designed verification system. This is the foundation of AI voice clone detection: finding signals in the audio that distinguish real human speech from synthesized or converted speech.

VeriCall's on-device speaker verification model is trained to find and act on exactly these signals — producing a confidence score in real time that tells you, before any damage is done, whether the voice on the call is real or an AI clone.

Component 1: Speaker Verification vs. Speaker Recognition

It's important to distinguish between two related but different problems:

Task Question Asked Use Case
Speaker Recognition "Who is speaking?" Identifying an unknown speaker from a database of known voices
Speaker Verification "Is this the person they claim to be?" Confirming that a caller is the specific person expected — what VeriCall does

VeriCall performs speaker verification, not speaker recognition. The system doesn't try to identify who is calling — it answers a binary question: is this voice the same person whose biometric voiceprint is stored for this contact? This is a fundamentally easier and more accurate problem, and it is perfectly suited for the phone call use case.

Component 2: Biometric Voiceprints

A biometric voiceprint is a mathematical representation of a person's unique vocal characteristics — a high-dimensional vector that encodes the features that distinguish that person's voice from any other speaker, including an AI clone of that voice.

The voiceprint captures features across multiple levels:

VeriCall builds this voiceprint for each contact automatically from real, genuine calls. The more real calls with a contact, the more complete and accurate the voiceprint — and the sharper the detection for that contact. The voiceprint is stored encrypted on your device only. It never leaves your phone.

Component 3: Why Voice Clones Fail Verification

An AI voice clone may replicate the features your ears use to recognize a person. But it cannot replicate the full biometric signature that speaker verification models use:

Biometric Divergence

Even acoustically perfect voice clones differ from the original speaker in the biometric feature space. The mathematical distance between a clone's voice embedding and the real speaker's voiceprint is detectably larger than the distance between two genuine utterances from the real speaker. This difference is invisible to human perception but measurable to the verification model.

Liveness Detection

Real human speech contains liveness signals that AI-generated audio does not reproduce:

Double Encoding Artifacts

Real-time voice conversion introduces a double encoding signature: the original audio is encoded by the voice cloning model, then transmitted over phone compression (G.711, Opus, etc.), creating a distinctive artifact pattern that differs from genuine speech transmitted over the same channel.

Temporal Inconsistencies

Real-time voice conversion operates with a buffer — typically 20–200ms — that introduces subtle temporal artifacts in the relationship between speech events. Speaker verification models trained on these patterns can detect the signature of real-time conversion.

Component 4: On-Device Neural Inference

Running speaker verification in real time on a live phone call requires two things: very low latency inference and complete audio privacy. Both require on-device processing.

<1s
VeriCall delivers a verification confidence score in under one second after a call connects — before any significant conversation has taken place. On-device inference on Apple's Neural Engine makes this latency possible.

VeriCall uses Apple's Neural Engine — the dedicated machine learning accelerator built into every modern iPhone — and Apple's CoreML framework to run the speaker verification model. The Neural Engine is designed for exactly this type of real-time, low-latency inference on high-dimensional audio data.

The benefits of on-device processing:

Component 5: Continuous and Adaptive Monitoring

VeriCall does not perform a single check at the start of the call and stop. It monitors continuously throughout the call — an important capability because some attacks use a real human voice at the start of the call and switch to a clone mid-conversation. High-stakes attacks like the grandparent voice cloning scam depend on this sustained deception throughout the entire call.

The adaptive learning component means the voiceprint improves over time as more genuine calls with a contact are recorded. Early calls with a new contact provide the foundation; subsequent genuine calls refine and sharpen the biometric model, making detection increasingly accurate and reducing false alerts.

The Full Detection Pipeline

01

Audio capture

Incoming call audio is captured on-device. No recording is transmitted externally. The audio stream is processed in real time by the VeriCall analysis pipeline.

02

Feature extraction

The on-device model extracts acoustic features from the incoming audio — MFCCs, spectral features, deep embeddings — building a representation of the caller's voice characteristics in real time.

03

Voiceprint comparison

The extracted features are compared against the stored voiceprint for the contact using a neural speaker verification model running on the Neural Engine. The mathematical distance between the two is computed.

04

Liveness analysis

Simultaneously, liveness signals are analyzed — detecting artifacts consistent with AI voice synthesis, real-time voice conversion, or replay attacks that would indicate the voice is not from a live human speaker.

05

Confidence score output

Within under 1 second, a live confidence score is displayed. The score updates continuously throughout the call. A significant drop in confidence mid-call triggers an immediate alert.

// FAQ

Frequently Asked Questions

AI voice clone detection uses biometric speaker verification — comparing the caller's voice against a stored mathematical voiceprint of the real person. VeriCall builds voiceprints from genuine calls, then runs a speaker verification model on-device during live calls to detect whether the incoming voice matches or diverges from the real person's biometric. The detection takes under 1 second.

A biometric voiceprint is a mathematical vector encoding a person's unique vocal characteristics — pitch, formant structure, prosodic patterns, and deep acoustic features. VeriCall builds a voiceprint for each contact from real calls, stores it encrypted on-device only, and uses it to verify callers in real time. The voiceprint improves in accuracy with more genuine calls.

On-device processing keeps all audio and biometric data private — no voice data ever leaves your phone. It also enables sub-second inference latency without network round-trips, works without a data connection, and eliminates any central server that could be compromised. VeriCall uses Apple's Neural Engine and CoreML for all inference.

// VeriCall

Real-Time Detection.
Zero Cloud.

VeriCall's on-device AI detects voice clones in under 1 second. Join the private beta and be among the first to use it.

Private beta · No spam · Founding members only