Part of speech (POS) tagging in Roman Urdu: datasets and models

Roman Urdu is a prevalent medium of expression on social media, news websites, and text messages in the subcontinent, making it a valuable data source for social media and text analytics, particularly in the Indo-Pak perspective. However, despite the immense potential, limited efforts have been made in the area of Roman Urdu text analytics due to various complexities, such as a lack of a standard lexicon, the informal nature of the text, and the lack of text processing tools. The development of the Roman Urdu Part-of-Speech (POS) dataset and the implementation of a robust tagger hold immense importance for text analytics in Roman Urdu. In this work, we created a comprehensive, large-scale Roman Urdu POS dataset and developed a Roman Urdu POS tagger, laying the foundation for future advancements in advanced text analysis. Our approach involved the utilization of Hidden Markov Models, Neural Networks, state-of-the-art transformer models, and Large Language Models as baselines. In our work, we curated two distinct test datasets: one with lexical variation and the other without such variation. This approach allowed us to test the model’s robustness in handling different linguistic challenges posed by lexical variations. Our tagger yields high-quality output with an accuracy score of 96% without lexical variation and 86% on test data with lexical variations. We also evaluated state-of-the-art Large Language Models (GPT-4o and Llama-3-8B) in zero-shot and few-shot settings, with GPT-4o achieving up to 53.78% accuracy in the few-shot configuration, demonstrating a substantial performance gap compared to specialized models. This work establishes a comprehensive framework for Roman Urdu POS tagging that effectively addresses lexical variation challenges, providing essential resources and benchmarks for advancing Roman Urdu natural language processing research.

Low resource, Part of speech, Roman Urdu

10.1007/s10579-025-09865-w

4285-4312

Faheem, Ali

24ea442f-f33b-4fcd-ae7c-b34e634aec09

Azam, Ubaid

243c228b-8e17-4bba-9b3f-c788c0f9e858

Ayub, Muhammad Sohaib

5aa0e601-e192-4cfc-8000-c5a6de73dd0c

Karim, Asim

d5c5d31c-3712-403b-8449-c60a2a82fb75

30 July 2025

Faheem, Ali

24ea442f-f33b-4fcd-ae7c-b34e634aec09

Azam, Ubaid

243c228b-8e17-4bba-9b3f-c788c0f9e858

Ayub, Muhammad Sohaib

5aa0e601-e192-4cfc-8000-c5a6de73dd0c

Karim, Asim

d5c5d31c-3712-403b-8449-c60a2a82fb75

Faheem, Ali, Azam, Ubaid, Ayub, Muhammad Sohaib and Karim, Asim (2025) Part of speech (POS) tagging in Roman Urdu: datasets and models. Language Resources and Evaluation, 59 (4), 4285-4312. (doi:10.1007/s10579-025-09865-w).

Record type: Article

Abstract

Text

Part of speech (POS) tagging in Roman Urdu: datasets and models - Version of Record

Restricted to Repository staff only

Request a copy

More information

Accepted/In Press date: 7 July 2025

e-pub ahead of print date: 30 July 2025

Published date: 30 July 2025

Keywords: Low resource, Part of speech, Roman Urdu

Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 507092

URI: http://eprints.soton.ac.uk/id/eprint/507092

DOI: doi:10.1007/s10579-025-09865-w

PURE UUID: 11047520-e339-4781-8bbc-815b40dfa46e

Catalogue record

Date deposited: 26 Nov 2025 17:48

Last modified: 27 Jan 2026 18:07

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Ali Faheem

Author: Ubaid Azam

Author: Muhammad Sohaib Ayub

Author: Asim Karim

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information