Submitted to IEEE Signal Processing Letters, 2026
We present DisfluoSDK, an on-device framework for real-time speech disfluency detection and voice stress analysis, motivated by privacy-sensitive speech therapy applications. The system classifies five disfluency types — prolongation, block, sound repetition, word repetition, and interjection — from 3-second mel-spectrogram clips using CoreML on iOS. We evaluate two architectures on the SEP-28K dataset (20,131 clips, 5-fold episode-grouped cross-validation): a custom 4-block convolutional neural network (617K parameters, 1.2 MB) achieving macro-F1 of 0.382, and an adapted ResNet-18 (11.2M parameters, 21 MB) achieving macro-F1 of 0.404. Both models achieve sub-millisecond inference (0.62–0.79 ms) on an Apple M1 processor, enabling 3,000-times real-time processing. We additionally examine the relationship between voice stress markers (jitter, shimmer, fundamental frequency variability) and disfluency labels across a subset of 14,645 clips with valid pitch estimates, finding no practically meaningful linear associations (all absolute correlations below 0.05, negligible effect sizes). To our knowledge, this is the first framework to perform multi-type disfluency classification entirely on-device.