Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National
Researchers at NYUAD will release text prediction software for Gulf Arabic using a 200,000 word collection of Gulf vocabulary completed last year. Mathew Kurian / The National

How do you say 'Hey Siri' in spoken Arabic? NYUAD lab is working on Gulf dialects to find the answer


  • English
  • Arabic

How do you talk to a robot in Arabic? Is it best to address it in the formal language of news broadcasters or as you would speak to a friend?

It is a philosophical question at the heart of what language means and what Arabic is today.

“I usually joke about how even in the Arab imagination we have a black hole,” said Nizar Habash, programme head of computer science at New York University Abu Dhabi.

“How do I talk to a computer or a robot? What would I actually say to it? In what dialect? How would it answer back?”

Mr Habash leads a lab with a seemingly simple goal: to make everyday Arabic understandable to machines.

Our goal, from a technology point of view, is just to try and catch up with what's happening in other languages technologically

Arabic has two forms, the formal literary language called Fusha and myriad dialects, which are often mutually unintelligible. Dialect is the language of daily life but has a lower status.

This second-class standing means everyday technology such as predictive text and speech recognition still do not work well in spoken Arabic.

The NYUAD lab plans to change this.

This year, it will release text prediction software for Gulf Arabic using a collection of 200,000 words compiled last year. The collection, called the Gumar Corpus, opens the door for predictive text, speech recognition and speech synthesis in the dialect.

This is good news for Arabic speakers who want Alexa’s Arabic voice to sound like a neighbour instead of a literature professor.

Nizar Habash, the programme head of computer science at NYUAD, heads a lab that makes everyday Arabic understandable to machines. Pawan Singh / The National
Nizar Habash, the programme head of computer science at NYUAD, heads a lab that makes everyday Arabic understandable to machines. Pawan Singh / The National

The development of dialect in computing has not been welcomed by everybody. Formal Arabic still lags behind English and many believe it should be the priority, not dialect.

“People, just by default, think dialects are just bad Arabic,” Mr Habash said. “It’s such an insult to all of this wonderful culture that is celebrated and enjoyed but at the same time denied status.”

There are also technological barriers. Machines can learn languages by comparing identical documents in two languages or similar texts in different languages about the same topic, such as news stories. But news stories and government papers are written in formal Arabic and there are few comparative texts in dialects.

The variety of spellings in dialect is another obstacle.

For Mr Habash, the need for more programming in dialect was self-evident. He was raised in several countries in which different dialects of Arabic are spoken.

The Palestinian was born in Iraq and grew up in Lebanon, Syria, the Soviet Union and Tunisia. At 17, he moved to the US to study linguistics and computer engineering as an undergraduate.

Our goal, from a technology point of view, is just to try and catch up with what's happening in other languages technologically

Programming in dialect was common sense to Mr Habash because it is the language of daily life.

Social media increased the use of written dialect, because it is the language of choice for texting.

“And of course, you know, when it comes to people who cannot read or write, they only have dialect,” he said.

“It is the dominant form in the spoken space, so we have to deal with whatever that means.

“Our goal is to develop a better understanding of the data to build better applications. It’s not to make political statements. Our goal, from a technology point of view, is just to try and catch up with what’s happening in other languages technologically.”

The building blocks of language, found in romantic novels

To do this, the building blocks of language are needed: words.

Each word must be manually labelled, or annotated, with descriptors such as tense and gender. With hundreds of thousands of examples, a computer can teach itself the language.

The more examples are used, the better the prediction.

“People are so fixated about algorithms when they do AI but they don’t ask where the data for algorithms comes from,” Mr Habash said.

“If your data is not done in a proper, consistent way, you’re going to get garbage in and garbage out.”

Formal Arabic has about a million annotated words. The Egyptian dialect, spoken by about 98 million people and a vast diaspora, has 400,000 annotated words.

Nizar Habash works at his office at the NYUAD campus on Saadiyat Island. Pawan Singh / The National
Nizar Habash works at his office at the NYUAD campus on Saadiyat Island. Pawan Singh / The National

Levantine Arabic has about 50,000 annotated words and Gulf Arabic has 200,000 annotated words, thanks to the NYUAD project.

To compile its collection of words, the Gumar project had to find non-copyrighted text in dialect, and a lot of it.

Researchers hit the jackpot when they found a directory of 1,200 romantic novels written by anonymous women. The genre was popular in the blogosphere before the rise of social media.

The public directory had more than 100 million words in Gulf Arabic.

The task of annotation began. This is a long process in Arabic, because most vowels are not written and readers decipher words by context. A single written word in Arabic, on average, has three meanings, seven pronunciations and 12 interpretations.

For a computer to guess a word’s vowels and pronunciation, it must first derive meaning from context.

Annotating 200,000 words took three Egyptian linguistics in Alexandria, all former Gulf residents, eight months. This was finished last August. Meanwhile, NYUAD researchers began to train computers to distinguish and translate between dialects.

The politics of language equality

The Madar programme, a collaboration with researchers at Carnegie Mellon University in Qatar, creates comparable data for different dialects.

Its creators have built a 47,000-word lexicon for dialects from 25 different cities, sourcing material from travel books.

"All we need is the funding," says Khaled Shaalan, a professor of Computer Science at the British University in Dubai. Antonie Robertson / The National
"All we need is the funding," says Khaled Shaalan, a professor of Computer Science at the British University in Dubai. Antonie Robertson / The National

Resources from the Gumar and Madar projects are free to university researchers and available for commercial licensing.

Dialect databases matter because they make technology accessible to all, said Mona Diab, a computer science professor at George Washington University.

We are behind because Arabic needs a lot of resources, a lot of investment and this has become very low

“You’re basically giving people first-hand access to information, so I think that’s one of the most important and impactful aspects of dialect and technology,” said Prof Diab, a specialist in natural language processing.

“You won’t need to have an education to understand what’s happening.”

This hit home for Prof Diab when she was a girl in Egypt. Her uncles lived on the Arabian Peninsula during the First Gulf War and her illiterate grandmother relied on her grandchildren to translate televised news about the conflict into her dialect because she couldn't understand the formal Arabic on the broadcast.

“How do you guarantee fairness and equality in the data that you’re using?” she asked.

“How do you use that to create better technology and how do you use that to democratise knowledge?”

Technological investment in dialect requires government support. Otherwise, the Arab world could be left behind.

AI Arabic research is led by the West. If Arabs do not do it themselves, there can be unintended consequences, Prof Diab said.

“There’s always a cultural dimension and a nuance that is going to be missed if you’re not native to the culture. It’s not just about language, it’s about identity. It’s now an opportunity to define our identity outside an occidental or outside perspective.”

Funding for Arabic dropped as western countries reduced their military presence in the Middle East, said Khaled Shaalan, a professor of computer science at the British University in Dubai.

“We are behind because Arabic needs a lot of resources, a lot of investment, and this has become very low,” Prof Shaalan said.

“For example, the United States and many other places stopped funding projects. At the time that there was war, yes, they were interested. But now they have switched to other languages.

“We have the technology now, the computer capacity to do language processing. All we need is the funding to train the career researchers who will work on this. It needs effort.”

'THE WORST THING YOU CAN EAT'

Trans fat is typically found in fried and baked goods, but you may be consuming more than you think.

Powdered coffee creamer, microwave popcorn and virtually anything processed with a crust is likely to contain it, as this guide from Mayo Clinic outlines: 

Baked goods - Most cakes, cookies, pie crusts and crackers contain shortening, which is usually made from partially hydrogenated vegetable oil. Ready-made frosting is another source of trans fat.

Snacks - Potato, corn and tortilla chips often contain trans fat. And while popcorn can be a healthy snack, many types of packaged or microwave popcorn use trans fat to help cook or flavour the popcorn.

Fried food - Foods that require deep frying — french fries, doughnuts and fried chicken — can contain trans fat from the oil used in the cooking process.

Refrigerator dough - Products such as canned biscuits and cinnamon rolls often contain trans fat, as do frozen pizza crusts.

Creamer and margarine - Nondairy coffee creamer and stick margarines also may contain partially hydrogenated vegetable oils.

Bio

Age: 25

Town: Al Diqdaqah – Ras Al Khaimah

Education: Bachelors degree in mechanical engineering

Favourite colour: White

Favourite place in the UAE: Downtown Dubai

Favourite book: A Life in Administration by Ghazi Al Gosaibi.

First owned baking book: How to Be a Domestic Goddess by Nigella Lawson.

Don't get fined

The UAE FTA requires following to be kept:

  • Records of all supplies and imports of goods and services
  • All tax invoices and tax credit notes
  • Alternative documents related to receiving goods or services
  • All tax invoices and tax credit notes
  • Alternative documents issued
  • Records of goods and services that have been disposed of or used for matters not related to business
MATCH INFO

Uefa Champions League last-16, second leg:

Real Madrid 1 (Asensio 70'), Ajax 4 (Ziyech 7', Neres 18', Tadic 62', Schone 72')

Ajax win 5-3 on aggregate

Jetour T1 specs

Engine: 2-litre turbocharged

Power: 254hp

Torque: 390Nm

Price: From Dh126,000

Available: Now

Volvo ES90 Specs

Engine: Electric single motor (96kW), twin motor (106kW) and twin motor performance (106kW)

Power: 333hp, 449hp, 680hp

Torque: 480Nm, 670Nm, 870Nm

On sale: Later in 2025 or early 2026, depending on region

Price: Exact regional pricing TBA

THE SPECS

Engine: 6.0-litre, twin-turbocharged W12

Transmission: eight-speed automatic

Power: 626bhp

Torque: 900Nm

Price: Dh1,050,000

On sale: now

The lowdown

Badla

Rating: 2.5/5

Produced by: Red Chillies, Azure Entertainment 

Director: Sujoy Ghosh

Cast: Amitabh Bachchan, Taapsee Pannu, Amrita Singh, Tony Luke

Getting%20there%20and%20where%20to%20stay
%3Cp%3EFly%20with%20Etihad%20Airways%20from%20Abu%20Dhabi%20to%20New%20York%E2%80%99s%20JFK.%20There's%2011%20flights%20a%20week%20and%20economy%20fares%20start%20at%20around%20Dh5%2C000.%3Cbr%3EStay%20at%20The%20Mark%20Hotel%20on%20the%20city%E2%80%99s%20Upper%20East%20Side.%20Overnight%20stays%20start%20from%20%241395%20per%20night.%3Cbr%3EVisit%20NYC%20Go%2C%20the%20official%20destination%20resource%20for%20New%20York%20City%20for%20all%20the%20latest%20events%2C%20activites%20and%20openings.%3Cbr%3E%3C%2Fp%3E%0A
Final scores

18 under: Tyrrell Hatton (ENG)

- 14: Jason Scrivener (AUS)

-13: Rory McIlroy (NIR)

-12: Rafa Cabrera Bello (ESP)

-11: David Lipsky (USA), Marc Warren (SCO)

-10: Tommy Fleetwood (ENG), Chris Paisley (ENG), Matt Wallace (ENG), Fabrizio Zanotti (PAR)

match info

Union Berlin 0

Bayern Munich 1 (Lewandowski 40' pen, Pavard 80')

Man of the Match: Benjamin Pavard (Bayern Munich)

Company%20profile
%3Cp%3E%3Cstrong%3ECompany%20name%3A%3C%2Fstrong%3E%20Ogram%3Cbr%3E%3Cstrong%3EStarted%3A%20%3C%2Fstrong%3E2017%3Cbr%3E%3Cstrong%3EFounders%3A%3C%2Fstrong%3E%20Karim%20Kouatly%20and%20Shafiq%20Khartabil%3Cbr%3E%3Cstrong%3EBased%3A%20%3C%2Fstrong%3EDubai%2C%20UAE%3Cbr%3E%3Cstrong%3EIndustry%3A%3C%2Fstrong%3E%20On-demand%20staffing%3Cbr%3E%3Cstrong%3ENumber%20of%20employees%3A%3C%2Fstrong%3E%2050%3Cbr%3E%3Cstrong%3EFunding%3A%20%3C%2Fstrong%3EMore%20than%20%244%20million%3Cbr%3E%3Cstrong%3EFunding%20round%3A%3C%2Fstrong%3E%20Series%20A%3Cbr%3E%3Cstrong%3EInvestors%3A%20%3C%2Fstrong%3EGlobal%20Ventures%2C%20Aditum%20and%20Oraseya%20Capital%3Cbr%3E%3C%2Fp%3E%0A