
		<paper>
			<loc>https://jjcit.org/paper/266</loc>
			<title>JORDANIAN ARABIC TO MODERN STANDARD ARABIC TRANSLATION USING A LARGE MODEL TUNED ON A PURPOSE-BUILT DATASET AND SYNTHETIC ERROR INJECTION</title>
			<doi>10.5455/jjcit.71-1740933141</doi>
			<authors>Gheith A. Abandah,Moath R. Khaleel,Iyad F. Jafar,Mohammad R. Abdel- Majeed,Yousef H. Hamdan,Ashraf E. Suyyagh,Asma A. Abdel-Karim,Shorouq M. AlAwawdeh</authors>
			<keywords>Jordanian Arabic,Modern Standard Arabic,Dialectal translation,Large language models,Synthetic error injection,Natural-language processing,ByT5</keywords>
			<citation>2</citation>
			<views>2459</views>
			<downloads>727</downloads>
			<received_date>2-Mar.-2025</received_date>
			<revised_date>  7-Jun.-2025</revised_date>
			<accepted_date>  8-Jun.-2025</accepted_date>
			<abstract>This paper addresses the challenge of accurately translating Jordanian Arabic into Modern Standard Arabic 
(MSA) and correcting common linguistic errors. Although MSA is the formal standard for Arabic communication, 
the widespread use of local dialects in social media and everyday interactions often results in texts laden with 
spelling and grammatical issues. To overcome these challenges, we present an end-to-end system based on a newly 
constructed Jordanian Arabic dataset (JODA) comprising 59,135 sentences, as well as the Tashkeela dataset 
perturbed through synthetic error injection. We employ ByT5, a large pre-trained language model that processes 
text at the byte level, making it resilient to spelling variations and morphological complexities common in Arabic 
dialects. Our experimental results show that fine-tuning ByT5 on JODA and a 10% error-injected Tashkeela subset 
notably improves both BLEU score and character error rate (CER). Combining JODA with the synthetically 
modified Tashkeela data reduces the CER to 4.64% on the Test-200 test set and 1.65% on the TSMTS test set. 
Moreover, manual inspections reveal that the model produces correct or near-correct translations in most cases. 
Finally, we developed a custom smartphone keyboard and a web portal to demonstrate how the system can be 
made easily accessible to interested users, offering a practical solution for millions of Arabic speakers seeking to 
produce accurate, diacritized MSA text. This solution is currently limited to the Jordanian dialect; future work 
will focus on developing similar datasets and solutions for other Arabic dialects.</abstract>
		</paper>


