Part of speech (POS) tagging in Roman Urdu: datasets and models
Part of speech (POS) tagging in Roman Urdu: datasets and models
Roman Urdu is a prevalent medium of expression on social media, news websites, and text messages in the subcontinent, making it a valuable data source for social media and text analytics, particularly in the Indo-Pak perspective. However, despite the immense potential, limited efforts have been made in the area of Roman Urdu text analytics due to various complexities, such as a lack of a standard lexicon, the informal nature of the text, and the lack of text processing tools. The development of the Roman Urdu Part-of-Speech (POS) dataset and the implementation of a robust tagger hold immense importance for text analytics in Roman Urdu. In this work, we created a comprehensive, large-scale Roman Urdu POS dataset and developed a Roman Urdu POS tagger, laying the foundation for future advancements in advanced text analysis. Our approach involved the utilization of Hidden Markov Models, Neural Networks, state-of-the-art transformer models, and Large Language Models as baselines. In our work, we curated two distinct test datasets: one with lexical variation and the other without such variation. This approach allowed us to test the model’s robustness in handling different linguistic challenges posed by lexical variations. Our tagger yields high-quality output with an accuracy score of 96% without lexical variation and 86% on test data with lexical variations. We also evaluated state-of-the-art Large Language Models (GPT-4o and Llama-3-8B) in zero-shot and few-shot settings, with GPT-4o achieving up to 53.78% accuracy in the few-shot configuration, demonstrating a substantial performance gap compared to specialized models. This work establishes a comprehensive framework for Roman Urdu POS tagging that effectively addresses lexical variation challenges, providing essential resources and benchmarks for advancing Roman Urdu natural language processing research.
Low resource, Part of speech, Roman Urdu
Faheem, Ali
24ea442f-f33b-4fcd-ae7c-b34e634aec09
Azam, Ubaid
243c228b-8e17-4bba-9b3f-c788c0f9e858
Ayub, Muhammad Sohaib
5aa0e601-e192-4cfc-8000-c5a6de73dd0c
Karim, Asim
d5c5d31c-3712-403b-8449-c60a2a82fb75
Faheem, Ali
24ea442f-f33b-4fcd-ae7c-b34e634aec09
Azam, Ubaid
243c228b-8e17-4bba-9b3f-c788c0f9e858
Ayub, Muhammad Sohaib
5aa0e601-e192-4cfc-8000-c5a6de73dd0c
Karim, Asim
d5c5d31c-3712-403b-8449-c60a2a82fb75
Faheem, Ali, Azam, Ubaid, Ayub, Muhammad Sohaib and Karim, Asim
(2025)
Part of speech (POS) tagging in Roman Urdu: datasets and models.
Language Resources and Evaluation.
(doi:10.1007/s10579-025-09865-w).
Abstract
Roman Urdu is a prevalent medium of expression on social media, news websites, and text messages in the subcontinent, making it a valuable data source for social media and text analytics, particularly in the Indo-Pak perspective. However, despite the immense potential, limited efforts have been made in the area of Roman Urdu text analytics due to various complexities, such as a lack of a standard lexicon, the informal nature of the text, and the lack of text processing tools. The development of the Roman Urdu Part-of-Speech (POS) dataset and the implementation of a robust tagger hold immense importance for text analytics in Roman Urdu. In this work, we created a comprehensive, large-scale Roman Urdu POS dataset and developed a Roman Urdu POS tagger, laying the foundation for future advancements in advanced text analysis. Our approach involved the utilization of Hidden Markov Models, Neural Networks, state-of-the-art transformer models, and Large Language Models as baselines. In our work, we curated two distinct test datasets: one with lexical variation and the other without such variation. This approach allowed us to test the model’s robustness in handling different linguistic challenges posed by lexical variations. Our tagger yields high-quality output with an accuracy score of 96% without lexical variation and 86% on test data with lexical variations. We also evaluated state-of-the-art Large Language Models (GPT-4o and Llama-3-8B) in zero-shot and few-shot settings, with GPT-4o achieving up to 53.78% accuracy in the few-shot configuration, demonstrating a substantial performance gap compared to specialized models. This work establishes a comprehensive framework for Roman Urdu POS tagging that effectively addresses lexical variation challenges, providing essential resources and benchmarks for advancing Roman Urdu natural language processing research.
Text
Part of speech (POS) tagging in Roman Urdu: datasets and models
- Version of Record
Restricted to Repository staff only
Request a copy
More information
Accepted/In Press date: 7 July 2025
e-pub ahead of print date: 30 July 2025
Keywords:
Low resource, Part of speech, Roman Urdu
Identifiers
Local EPrints ID: 507092
URI: http://eprints.soton.ac.uk/id/eprint/507092
PURE UUID: 11047520-e339-4781-8bbc-815b40dfa46e
Catalogue record
Date deposited: 26 Nov 2025 17:48
Last modified: 28 Nov 2025 17:32
Export record
Altmetrics
Contributors
Author:
Ali Faheem
Author:
Ubaid Azam
Author:
Muhammad Sohaib Ayub
Author:
Asim Karim
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics