DzoSEM — Bilingual Embeddings

The Challenge

A language at the edge of digital legibility.

Six quiet pressures shape the moment Dzongkha is in — a language carrying centuries of Himalayan knowledge, asking to be read by machines.

A low-resource language

Unlike English or Mandarin, Dzongkha lacks large digitised corpora, translation tooling, and core NLP infrastructure.

Roughly 600,000 speakers

A uniquely Bhutanese problem — too small for global model coverage to solve, too vital to leave unattended.

Written meets spoken

A wide gap separates classical written orthography from the modern spoken form, complicating computational modelling.

Economic gravity

The difficulty of Dzongkha and the pull of English in high-paying work nudges a generation away from the mother tongue.

Heritage at risk

Most manuscripts, court records, and Himalayan medical knowledge live in Dzongkha — and quietly grow more distant.

Digitised, yet unreachable

His Majesty's digital initiative has scanned vast archives, but they remain hard to search and harder to read.

Impact

Four places this becomes useful.

A shared embedding space is quiet infrastructure — the value shows up where people meet the archive.

Historical archives & land records

Sifting through thousands of pages of monastic intelligence, old property records, and legal texts in milliseconds using everyday English phrasing.

Sowa Rigpa & medicine

Mapping topics and surfacing patterns inside centuries of Himalayan medical knowledge currently locked in complex traditional script.

A bridge for the next generation

Giving a generation that increasingly thinks and texts in English a frictionless way to explore their mother tongue, keeping the language alive and relevant.

Limitless downstream potential

This isn't just a search tool. It's the foundational infrastructure needed to build reliable AI assistants, bilingual chatbots, and real-time translation for Bhutan.

Interactive Retrieval

Search in English. Read in Dzongkha.

Type an English idea — the model returns the closest Dzongkha documents by meaning, not by exact keyword matching.

Measuring semantic distance...

Architecture Roadmap

From corpus to deployment.

Eight movements that take raw parallel sentences and end with an encoder ready for document-level retrieval.

01.Gathering the corpus

Collect and filter roughly 700,000 high-quality Dzongkha–English parallel sentence pairs.

02.Setting the architecture

Deploy an encoder-only model projecting language representations into a stable 1024-dimension space.

03.Pre-training

Masked-language modelling on massive monolingual corpora to establish core syntactic understanding.

04.Fine-tuning

Contrastive learning on the 700K pairs, optimizing semantic alignment between English and Dzongkha vectors.

05.Post-training

Aligning representation styles and token vocabularies with deployment runtime environments.

06.Matryoshka pruning

Pruning dense representations into smaller dimensions (e.g. 512, 256) to accelerate low-latency search systems.

07.Evaluation

Rigorous benchmarking using cross-lingual document recall and human qualitative alignment reviews.

08.Document tuning

Optimizing model weights for long-context retrieval, prepping the encoder for production archives.

09.Production Scaling & APIs

Deploy the model behind latency-optimized caching layers and high-throughput vector retrieval APIs for production access.

Research Division

Building Enterprise AI For
Himalayan Languages.

We develop Gen AI solutions designed for domain-specific and multilingual landscapes.

Semantic Search & Infra

Building the core retrieval engines and robust managed infrastructure necessary to deploy large-scale semantic search for specialized Bhutanese datasets.

Multilingual Voice AI

Expanding our embedding space beyond text. We are mapping out semantic search for spoken Tshangla, Bumthangkha, and Sharchopkha to capture oral traditions.

Vision and OCR AI

Developing a highly accurate, Dzongkha-optimized vision model capable of reading and digitizing historical manuscripts, pechas, and modern documents from any era.

Collaborate

Build the future of Himalayan NLP together.

If you are interested in contributing parallel data, evaluating models, or exploring custom enterprise search pipelines, let's connect.

DzoSEM.ai.lab@gmail.com

Send Inquiry

A shared space for Dzongkha and English.

A language at the edge of digital legibility.

A low-resource language

Roughly 600,000 speakers

Written meets spoken

Economic gravity

Heritage at risk

Digitised, yet unreachable

To graduate Dzongkha from a low-resource language into a stable one.

Four places this becomes useful.

Historical archives & land records

Sowa Rigpa & medicine

A bridge for the next generation

Limitless downstream potential

Search in English. Read in Dzongkha.

From corpus to deployment.

01.Gathering the corpus

02.Setting the architecture

03.Pre-training

04.Fine-tuning

05.Post-training

06.Matryoshka pruning

07.Evaluation

08.Document tuning

09.Production Scaling & APIs

Building Enterprise AI For
Himalayan Languages.

Semantic Search & Infra

Multilingual Voice AI

Vision and OCR AI

Build the future of Himalayan NLP together.

A language at the edge of digital legibility.

A low-resource language

Roughly 600,000 speakers

Written meets spoken

Economic gravity

Heritage at risk

Digitised, yet unreachable

To graduate Dzongkha from a low-resource language into a stable one.

Four places this becomes useful.

Historical archives & land records

Sowa Rigpa & medicine

A bridge for the next generation

Limitless downstream potential

Search in English. Read in Dzongkha.

From corpus to deployment.

01.Gathering the corpus

02.Setting the architecture

03.Pre-training

04.Fine-tuning

05.Post-training

06.Matryoshka pruning

07.Evaluation

08.Document tuning

09.Production Scaling & APIs

Building Enterprise AI For Himalayan Languages.

Semantic Search & Infra

Multilingual Voice AI

Vision and OCR AI

Build the future of Himalayan NLP together.

Building Enterprise AI For
Himalayan Languages.