|Joaquin Gonzalez-Rodriguez, Juana Gil, Rubén Pérez and Javier Franco-Pedroso|
Speaker comparison, as stressed by the current NIST i-vector Machine Learning Challenge where the speech signals are not available, can be effectively performed through pattern recognition algorithms comparing compact representation of the speaker identity information in a given utterance. However, this i-vector representation ignores relevant segmental (non-cepstral) and supra-segmental speaker information present in the original speech signal that could significantly improve the decision making process. In order to confirm this hypothesis in the context of NIST SRE trials, two experienced phoneticians have performed a detailed perceptual and instrumental analysis of 18 i-vector-based falsely accepted trials from NIST HASR 2010 and SRE 2010 trying to find noticeable differences between the two utterances in each given trial. Remarkable differences were obtained in all trials under detailed analysis, where combinations of observed differences vary for every trial as expected, showing specific significant differences in voice quality (creakiness, breathiness, etc.), rhythmic and tonal features, and pronunciation patterns, some of them compatible with possible variations across recording sessions and others highly incompatible with the same speaker hypothesis. The results of this analysis suggest the interest in developing banks of non-cepstral segmental and supra-segmental attribute detectors, imitating some if the trained abilities of a non-native phonetician. Those detectors can contribute in a bottom-up decision approach to speaker recognition and provide descriptive information of the different contributions to identity in a given speaker comparison.