An ancient language has defied translation for 100 years. Can AI crack the code?

 

Illustration by Somnath Bhatt

 

Machine learning can translate between two known languages, but could it ever decipher those that remain a mystery to us?


JIAMING LUO GREW up in mainland China thinking about neglected languages. When he was younger, he wondered why the different languages his mother and father spoke were often lumped together as Chinese “dialects.”

When he became a computer science doctoral student at MIT in 2015, his interest collided with his advisor’s long-standing fascination with ancient scripts. After all, what could be more neglected — or, to use Luo’s more academic term, “lower resourced” — than a long-lost language, left to us as enigmatic symbols on scattered fragments? “I think of these languages as mysteries,” Luo told Rest of World over Zoom. “That’s definitely what attracts me to them.”

In 2019, Luo made headlines when, working with a team of fellow MIT researchers, he brought his machine-learning expertise to the decipherment of ancient scripts. He and his colleagues developed an algorithm informed by patterns in how languages change over time. They fed their algorithm words in a lost language and in a known related language; its job was to align words from the lost language with their counterparts in the known language. Crucially, the same algorithm could be applied to different language pairs.

Luo and his colleagues tested their model on two ancient scripts that had already been deciphered: Ugaritic, which is related to Hebrew, and Linear B, which was first discovered among Bronze Age–era ruins on the Greek island of Crete. It took professional and amateur epigraphists — people who study ancient written matter — nearly six decades of mental wrangling to decode Linear B. Officially, 30-year-old British architect Michael Ventris is primarily credited with its decipherment, although the private efforts of classicist Alice Kober lay the groundwork for his breakthrough. Sitting night after night at her dining table in Brooklyn, New York, Kober compiled a makeshift database of Linear B symbols, comprising 180,000 paper slips filed in cigarette boxes, and used those to draw important conclusions about the nature of the script. She died in 1950, two years before Ventris cracked the code. Linear B is now recognized as the earliest form of Greek.  

Luo and his team wanted to see if their machine-learning model could get to the same answer, but faster. The algorithm yielded what was called “remarkable accuracy”: it was able to correctly translate 67.3% of Linear B’s words into their modern-day Greek equivalents. According to Luo, it took between two and three hours to run the algorithm once it had been built, cutting out the days or weeks — or months or years — that it might take to manually test out a theory by translating symbols one by one. The results for Ugaritic showed an improvement on previous attempts at automatic decipherment.

The work raised an intriguing proposition. Could machine learning assist researchers in their quests to crack other, as-yet undeciphered scripts — ones that have so far resisted all attempts at translation? What historical secrets might be unlocked as a result? 

Read the rest of the story in Rest of World…