INTREPIDO: an automated forensic voice comparison tool

INTREPIDO provides the user with an operational environment for conducting forensic voice comparison, using an automated approach.

The primary users of the software are professionals in the investigative and judicial sectors: experts, technical consultants, personnel from judicial police, and law enforcement officers in general.

In its current version, INTREPIDO relies on the extraction of voice recording features known as “embeddings,” through a component implementing an ECAPA-TDNN neural network (Emphasized Channel Attention, Propagation and Aggregation – Time Delay Neural Network). The characteristics of this neural network are described in Desplanques et al. (2020). The first international publication applying this neural network to forensic voice comparison is by the authors: Sigona & Grimaldi (2024).

Metrics for Monitoring the Reliability of Voice Comparison

The most commonly used evaluation metrics to describe system performance include both numerical and graphical metrics. The numerical metrics considered are the following (cf. Brümmer & du Preez, 2006; van Leeuwen & Brümmer, 2007; González-Rodríguez et al., 2007; Morrison, 2011; Drygajlo et al., 2015; Meuwly et al., 2017):

Numerical metrics: CLLR (log-likelihood ratio cost): pooled, mean; discrimination loss; calibration loss; 95% Credibility Interval (CI); Equal Error Rate (EER)

Graphs: Calibration Plot, Tippett Plot; DET (Detection Error Tradeoff) Plot; Receiver Operating Characteristics (ROC) Plot; Validation Scatter Plot; Empirical Cross Entropy (ECE) Plot

For further information about the software: impavido@biometriaforense.it

Further Insights

CLLR.pooled (Log Likelihood Ratio Cost): a numerical parameter summarizing the overall system quality, given by Eq. (1).

LR.ss are the likelihood ratio values for comparisons of the same speaker, while LR.ds are the likelihood ratio values for comparisons between different speakers. Since very high LR.ss values support the same-speaker hypothesis better, and very low LR.ds values support the different-speakers hypothesis better, smaller CLLR.pooled values indicate better performance. Conversely, a system that provides no useful information and always responds with a likelihood ratio of 1 will have a CLLR.pooled value equal to 1.

95% Credible Interval (CI): a precision (reliability) metric of the system’s output. It measures the variability of multiple likelihood ratio values obtained when a speaker recording is compared with all available recordings (if any) belonging to the same speaker (which can be the same or a different speaker). This metric is calculated using the parametric procedure described in Morrison (2011) and is reported on a ± orders of magnitude scale (log10 scale).

CLLR.mean (Likelihood Ratio Cost, accuracy only): a measure of the accuracy (validity) of the system output. According to Morrison and Enzinger (2016), “this is the same as the CLLR.pooled metric, but while all test results were aggregated to calculate CLLR.pooled, CLLR.mean is calculated on averages of groups as defined in the 95% CI metric” (a group consists of multiple likelihood ratio values as described above).

CLLR.min (Discrimination Loss): a measure of the quality of the embedding extraction phase, i.e., the quality of the score values. It is a CLLR calculated after the LR values from the test results have been optimized using the non-parametric pool-adjacent-violators (PAV) procedure, which involves training and testing on the same data. Therefore, this metric is not representative of expected performance on new test data. According to Meuwly et al. (2017), discrimination power represents the ability to distinguish between forensic comparisons where different propositions are true.

CLLR.cal (Calibration Loss): equal to the difference CLLR.pooled - CLLR.min. It measures the quality of the presentation phase, i.e., the calibration of the likelihood ratio.

EER (Equal Error Rate): another widely used metric to evaluate the system’s discrimination power. Likelihood ratio test values can be combined with some prior probabilities to obtain posterior probabilities; these can then be compared to a threshold to classify a test comparison as same speaker (prosecution hypothesis) or different speakers (defense hypothesis). In this way, false identification and false rejection error rates can be calculated as the proportion of misclassifications. The EER is obtained by adjusting prior probabilities and the threshold so that the two error rates are equal, and the resulting error rate is called the EER.

Essential Bibliography

Brümmer, N., & Du Preez, J. A. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2–3), 230–275. https://doi.org/10.1016/j.csl.2005.08.001
Desplanques, B., Thienpondt, J., Demuynck, K. (2020). ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proc. Interspeech 2020, pp. 3830–3834. https://doi.org/10.21437/Interspeech.2020-2650
Meuwly, D. (2001). Reconnaissance De Locuteurs En Sciences forensiques: L’apport D’une Approche Automatique. University of Lausanne PhD dissertation.
Meuwly, D., Ramos, D., & Haraksim, R. (2017). A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation. Forensic Science International, 276, 142–153. https://doi.org/10.1016/j.forsciint.2016.03.048
Morrison, G. S. (2010). Forensic voice comparison. In: Freckelton, I., Selby, H. (Eds.), Expert Evidence. Thomson Reuters, Sydney, Australia, p. 99. https://expertevidence.forensicvoice-comparison.net
Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood-ratio systems. Science & Justice, 51(3), 91–98. https://doi.org/10.1016/j.scijus.2011.03.002
Morrison, G. S., & Enzinger, E. (2016). Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01) – Introduction. Speech Communication, 85, 119–126. https://doi.org/10.1016/j.specom.2016.07.006
Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076. https://doi.org/10.1214/aoms/1177704472
Sigona, F., & Grimaldi, M. (2024). Evaluation of Emphasized Channel Attention, Propagation and Aggregation in TDNN based automatic speaker recognition software under conditions reflecting those of a real forensic voice comparison case (forensic_eval_01). Speech Communication. ISSN 0167-6393. DOI: 10.1016/j.specom.2024.103045