With 10 verb tenses, eight noun cases, three grammatical genders and a strong predilection for compound words, Sanskrit is not an easy language to teach a human — let alone an AI model.
But Indologist Oliver Hellwig is undertaking the challenge, training deep learning models that can analyze Sanskrit texts up to 4,000 years old. A digital repository of Sanskrit works parsed word by word would enable researchers to more easily search for information and better identify passages with parallel context.
AI is being used to interpret historical texts in German and Italian, as well as classical Japanese literature. But most existing NLP models are geared towards Western languages that follow similar rules of grammar, punctuation and formatting.
That presents a challenge for researchers developing software to transcribe and analyze scripts that are read right to left, are pictographical instead of phonetic, or — like Sanskrit — often don’t use character breaks between words.
Unlike English, Sanskrit is a highly inflected language, which means words change their form depending on their function in a sentence. Some Sanskrit verbs have more than 200 forms depending on the context. The language also has an extensive vocabulary, with more than 50 words for terms like “sun” or “moon” — making it essential that an AI model be trained on a large, diverse dataset of text.
Hellwig, a postdoctoral researcher at the University of Zurich, Switzerland, knew 15 years ago that computational tools could enable new possibilities for his linguistics research — but found that just a fraction of Sanskrit manuscripts have been digitized into machine-readable text.
For a half hour almost every day since, he’s been changing that bit by bit, painstakingly parsing Sanskrit works and adding them to a database that now consists of 4.5 million manually labeled words.
Hellwig began building Sanskrit-parsing tools from scratch — starting with statistical models before advancing to more complex optical character recognition and NLP models. Using an NVIDIA Quadro GPU, he’s now training deep learning models that can identify characters and find word endings in Sanskrit texts.
AI tools that transcribe Sanskrit could help digitize a vast corpus of historical manuscripts, spanning epic poetry, religious texts and Ayurvedic medicine.
When training an AI model for texts based on the Latin alphabet, researchers can teach the neural network to detect white spaces to determine where one word ends and another begins.
That’s not the case for Sanskrit manuscripts, where one line of text can be made up of multiple words merged together into just one or two compound strings. The word sandhi, meaning “connection,” is used to describe the phonetic process of joining these words together.
An effective NLP model for Sanskrit texts must be able to split a sandhied line into individual words, posing a significant challenge for researchers.
“Any algorithm has to a certain degree understand the semantics of a line of text to generate a valid split form of it,” said Hellwig. “What’s quite trivial for English is actually the most problematic step in Sanskrit.”
The deep learning tool Hellwig developed to split lines of Sanskrit into individual words is 10 to 15 percent more accurate than previous methods.
“I was surprised that it worked so well,” he said, “because it’s a complicated task, even for human readers using the original forms of these texts.”
Using an NVIDIA GPU helped Hellwig speed up training his AI models by 10x. This speed allows him to evaluate errors faster, and efficiently develop more accurate models. His sandhi-splitting tool is now being used on a large Sanskrit corpus dubbed GRETIL.
Many historians debate the age of key Sanskrit texts — particularly religious works like the Bhagavad Gita. To contribute to this academic conversation, Hellwig wants to use neural networks and NVIDIA GPUs to analyze the grammatical structure and language patterns in ancient Sanskrit texts.
By connecting this linguistic evidence with a model of how Sanskrit changed over time, he hopes to help determine when some of these major texts were composed.
Main image shows a leaf from a manuscript of the Mahabharata, a 100,000-verse Sanskrit epic poem that includes the Bhagavad Gita — a foundational Hindu text. Image from Miami University Libraries Digital Collections, available in the public domain.