
		<paper>
			<loc>https://jjcit.org/paper/223</loc>
			<title>BEYOND WORDS: HARNESSING SPEECH SOUND FOR SPEAKER AGE AND GENDER DETECTION USING 1D CNN ARCHITECTURE WITH SELF-ATTENTION MECHANISM</title>
			<doi>10.5455/jjcit.71-1703265368</doi>
			<authors>Umniah Hameed Jaid,Alia Karim Abdulhasan</authors>
			<keywords>Speaker age,Speaker gender,Speaker profiling,Wav2vec embedding,Attention mechanism</keywords>
			<citation>1</citation>
			<views>4716</views>
			<downloads>1171</downloads>
			<received_date>22-Dec.-2023</received_date>
			<revised_date>  9-Mar.-2024</revised_date>
			<accepted_date>  20-Mar.-2024</accepted_date>
			<abstract>Beyond the immediate content of speech, the voice can provide rich information about a speaker's demographics, 
including age and gender. Estimating a speaker's age  and gender offers a wide  range of applications, spanning 
from voice forensic analysis to personalized advertising, healthcare monitoring and human-computer interaction. 
However, pinpointing precise age remains intricate due to age ambiguity. Specifically, utterances from individuals 
at adjacent ages are frequently indistinguishable. Addressing this, we propose a novel, end-to-end approach that 
deploys Mozilla's Common Voice dataset to transform raw audio into high-quality feature representations using 
Wav2Vec2.0  embeddings.  These  are  then  channeled  into  our  self-attention-based  convolutional  neural  network 
(CNN) model.  To address age ambiguity, we evaluate the effects of different loss functions such as focal loss and 
Kullback-Leibler  (KL) divergence  loss.  Additionally,  we  evaluate  the estimation  accuracy  at  different  speech 
durations.  Experimental  results  from  the  Common  Voice  dataset  underscore  the  efficacy  of  our  approach, 
showcasing an accuracy of 87% for male speakers, 91% for female speakers and 89% overall accuracy, as well 
as an accuracy of 99.1% for gender prediction.</abstract>
		</paper>


