DzoSEM is a bilingual sentence-embedding model that places Dzongkha and English meaning side by side, opening Bhutan's written and oral heritage to anyone who can phrase the question.
Six quiet pressures shape the moment Dzongkha is in — a language carrying centuries of Himalayan knowledge, asking to be read by machines.
Unlike English or Mandarin, Dzongkha lacks large digitised corpora, translation tooling, and core NLP infrastructure.
A uniquely Bhutanese problem — too small for global model coverage to solve, too vital to leave unattended.
A wide gap separates classical written orthography from the modern spoken form, complicating computational modelling.
The difficulty of Dzongkha and the pull of English in high-paying work nudges a generation away from the mother tongue.
Most manuscripts, court records, and Himalayan medical knowledge live in Dzongkha — and quietly grow more distant.
His Majesty's digital initiative has scanned vast archives, but they remain hard to search and harder to read.
A shared embedding space is quiet infrastructure — the value shows up where people meet the archive.
Sifting through thousands of pages of monastic intelligence, old property records, and legal texts in milliseconds using everyday English phrasing.
Mapping topics and surfacing patterns inside centuries of Himalayan medical knowledge currently locked in complex traditional script.
Giving a generation that increasingly thinks and texts in English a frictionless way to explore their mother tongue, keeping the language alive and relevant.
This isn't just a search tool. It's the foundational infrastructure needed to build reliable AI assistants, bilingual chatbots, and real-time translation for Bhutan.
Type an English idea — the model returns the closest Dzongkha documents by meaning, not by exact keyword matching.
Eight movements that take raw parallel sentences and end with an encoder ready for document-level retrieval.
Collect and filter roughly 700,000 high-quality Dzongkha–English parallel sentence pairs.
Deploy an encoder-only model projecting language representations into a stable 1024-dimension space.
Masked-language modelling on massive monolingual corpora to establish core syntactic understanding.
Contrastive learning on the 700K pairs, optimizing semantic alignment between English and Dzongkha vectors.
Aligning representation styles and token vocabularies with deployment runtime environments.
Pruning dense representations into smaller dimensions (e.g. 512, 256) to accelerate low-latency search systems.
Rigorous benchmarking using cross-lingual document recall and human qualitative alignment reviews.
Optimizing model weights for long-context retrieval, prepping the encoder for production archives.
Deploy the model behind latency-optimized caching layers and high-throughput vector retrieval APIs for production access.
We develop Gen AI solutions designed for domain-specific and multilingual landscapes.
Building the core retrieval engines and robust managed infrastructure necessary to deploy large-scale semantic search for specialized Bhutanese datasets.
Expanding our embedding space beyond text. We are mapping out semantic search for spoken Tshangla, Bumthangkha, and Sharchopkha to capture oral traditions.
Developing a highly accurate, Dzongkha-optimized vision model capable of reading and digitizing historical manuscripts, pechas, and modern documents from any era.
If you are interested in contributing parallel data, evaluating models, or exploring custom enterprise search pipelines, let's connect.