On-Device Disfluency Detection with Voice Stress Association Analysis: A Mobile Framework Validated on SEP-28K

Nazar Kozak

Submitted to IEEE Signal Processing Letters, 2026

Abstract

We present DisfluoSDK, an on-device framework for real-time speech disfluency detection and voice stress analysis, motivated by privacy-sensitive speech therapy applications. The system classifies five disfluency types — prolongation, block, sound repetition, word repetition, and interjection — from 3-second mel-spectrogram clips using CoreML on iOS. We evaluate two architectures on the SEP-28K dataset (20,131 clips, 5-fold episode-grouped cross-validation): a custom 4-block convolutional neural network (617K parameters, 1.2 MB) achieving macro-F1 of 0.382, and an adapted ResNet-18 (11.2M parameters, 21 MB) achieving macro-F1 of 0.404. Both models achieve sub-millisecond inference (0.62–0.79 ms) on an Apple M1 processor, enabling 3,000-times real-time processing. We additionally examine the relationship between voice stress markers (jitter, shimmer, fundamental frequency variability) and disfluency labels across a subset of 14,645 clips with valid pitch estimates, finding no practically meaningful linear associations (all absolute correlations below 0.05, negligible effect sizes). To our knowledge, this is the first framework to perform multi-type disfluency classification entirely on-device.

Keywords
Disfluency detection, stuttering, on-device inference, CoreML, voice stress analysis, SEP-28K, mobile speech processing
Status
Under peer review at IEEE Signal Processing Letters (submitted March 31, 2026)
Models
DisfluencyCNN (617K params, 1.2 MB, 0.62 ms) · ResNet-18 Adapted (11.2M params, 21 MB, 0.79 ms)
Dataset
SEP-28K (20,131 clips, 5-fold episode-grouped cross-validation)
Author
Nazar Kozak — Kozak Technologies Inc., Los Angeles, CA, USA
Contact
nzrkzk@gmail.com · ORCID

← Back to all publications