News

Language barrier

Recent grad’s thesis project becomes a mission to better represent Urdu in software

Zeerak Ahmed

Zeerak Ahmed hopes to build a programming infrastructure that developers could use to produce software in Urdu, or other non-Latin languages. (Photo by Sabeen Sheikh)

One afternoon, two of Zeerak Ahmed’s Spanish-speaking classmates noticed he was using his computer in English, rather than in his native language.

When they asked why the Pakistan native hadn’t set the computer to Urdu, they were surprised to learn that hardly any applications offer more than rudimentary support for the 39-letter Urdu alphabet.

That conversation served as the impetus for Ahmed’s thesis project for the master in design engineering program, offered jointly by the Harvard John A. Paulson School of Engineering and Applied Sciences and the Graduate School of Design. He set out to develop an Urdu keyboard for smartphones.

“There is an entire generation of children growing up in Pakistan who, because of the way technology is built, are losing touch with their native languages. And for many people who are immigrants like me, our difficulty in communicating in non-English languages, even though we want to, is becoming manifest,” said Ahmed, M.D.E. ’18. “I am trying to fight back against this loss of our own cultural heritage.”

With that inspiration in mind, Ahmed, who earned an undergraduate computer science degree from Princeton University in 2013, got to work. One of his biggest challenges at the beginning of the project was determining how to lay out the 39 letters in a way that would make sense for users.

Arabic languages are comprised of 21 basic shapes; in Urdu, the 39 cursive characters are created from those 21 shapes with the addition of dots or other symbols. The letters also change their shapes depending on where they are positioned in a word.

Urdu keyboard

Ahmed overcame many unique challenges to produce this Urdu smartphone keyboard. (Image courtesy of Zeerak Ahmed)

Ahmed determined it would be much simpler for smartphone users to select from among the 21 shapes and then have the software add the dots and symbols afterward to complete each character.

But reducing the alphabet to 21 keys created unique algorithmic challenges. 

“The uncertainty in what you were typing is no longer a matter of whether or not you pressed the wrong key. Even if you pressed the correct key, the software has to guess the correct letter, because we are reducing the accuracy of the input,” he explained. “To get around that, we have to ensure the accuracy of our algorithm, and the text that it is learning from has to be even more rock solid.”

Finding, and then cleaning, a large enough corpus of Urdu text to train the algorithm proved to be a major hurdle that has taken Ahmed far more time than he originally anticipated.

And with each problem he solved, it seemed that a host of new technical roadblocks appeared, such as how to utilize common functions when programming with Arabic characters.

“When you have a string of text in any Latin character set, any programming language worth its salt can do a number of operations with it built in. You can just tell it what you want and it will give you what you want,” he said. “But for us, we had to build everything from the ground up. We went through every letter in the Arabic script that has ever been put out in Unicode and found a way to deal with it.”

Our goal is not to just build one keyboard; our goal is to usher in a new kind of software era.

Zeerak Ahmed

While he continues to put finishing touches on the keyboard, and hopes to have a version ready for a public beta test soon, he has expanded his thesis work into an interdisciplinary research project, Matnsaz (which means “text instrument” in Urdu). He and his collaborators are now working to build a programming infrastructure that developers could use to produce software in Urdu, or other non-Latin languages.

He has published the massive corpus of Urdu text he spent months cleaning so other developers can use it to train their own algorithms. And as he continues to refine that corpus, he is also exploring additional functions of the keyboard, such as adding unique symbols that are commonly used in Urdu texts.

His hope is to scale the project up and find additional applications for this groundwork, as part of an effort to encourage more developers to pick up the mantle and keep the 1,000-year-old language alive in software.

“Our goal is not to just build one keyboard; our goal is to usher in a new kind of software era,” he said. “We’ve always assumed that tech progress is a question of who gets there first. What worries me is, with Urdu software, will we get there at all? I feel a responsibility to keep working on it and share what we’re producing with the world—we’ve waited too long for Urdu software. To let it go now would just be heartbreaking.”

Topics: AI / Machine Learning, Computer Science, Design

Press Contact

Adam Zewe | 617-496-5878 | azewe@seas.harvard.edu