Natural Language Processing (NLP) for Japanese is a complex domain due to its unique grammatical structures and high context depedency. Developing NLP applications for this market require locale awareness on custom algorithms and evaluation strategies. In this post, I will outline the challenges it poses, latest innovations in Japanese Large Language Models (LLMs), and applications taking advantage of new business opportunities in Japan.

The multi-facet complexity of Japanese language

Let’s go to the root and examine why it is so difficult to parse the Japanese language, both for humans and computers.

Structural complexity

Japanese language’s inherent structure is characterized by:

  • mixed scripts in writing: Kanji (漢字) is a set of ideographic characters carrying semantic weight but may have several readings and meanings for the same character; phonetic letterings such as Hiragana (平仮名 / ひらがな) provides syllabary for native words and grammar, and Katakana (片仮名 / カタカナ) used predominantly for loanwords and emphasis. These 3 scripts are often intermingled in a single sentence.

    mixed script

    this simple sentence shows a mix of scripts. Image credit: Hugging Face

In fact, a University of British Columbia course has claimed that the combination of these writing systems makes Japanese writing among the most difficult to learn. There were actually numerous attempts throughout Japanese history to either adopt a western-style alphabet or abolish the Chinese kanji completely, but none of these attempts ever came to fruition.

  • massive use of emoticons : These are written in unicode, Cyrillic and Greek letters and punctuation marks. All these need to be preserved in sentiment analysis but need extra care to decipher and contextualize.
  • Lack of explicit word boundaries: Unlike English, which relies on spaces to indicate word boundaries, Japanese text is a continuous string of characters. Without clear delimiters, traditional tokenization methods stumble upon ambiguity, making it difficult to differentiate between compound words and distinct lexemes.
  • contextual omissions and politeness nuances: Japanese speakers frequently omit subjects and other elements, relying on context and shared understanding. Additionally, levels of politeness in the form of keigo add another layer of complexity in tone and meaning.
  • flexble word order: Rather than adhering to the usual subject–object–verb order, Japanese sentences allow the positions of the subject, predicate, and object to vary to emphasize different contexts. A study on narrative discourse examine how flexible ordering contributes to semantic richness and computational complexity. It complicates syntactic parsing because the parser must account for numerous constituent arrangements during real-time interpretation.
  • use of particles to impart structure and meaning: small, functional markers such as “は” (wa), “が” (ga), and “を” (wo) serve to delineate grammatical relationships and impart nuanced meaning to a sentence. The specific particle employed can drastically alter the sentence’s interpretation. For example, while “は” commonly marks the topic, “が” frequently indicates the subject or highlights contrast, and “を” typically marks the object, though their roles are not always fixed. Siegel’s research on the syntactic processing of particles in spoken language further underscores the challenges these elements pose for both human and computational parsing systems.
  • onomatopoeic and colloquial words: onomatopoeic mimicks sounds, emotions, conditions, and even atmospheric qualities. Colloquial expressions and regional dialects are frequenty used online and evolve rapidly. Both are hard to segment and less consistently represented in training corpora, but essential for understanding in chatbot and interactive scenarios.

These unique structures force us to rethink standard methodologies adopted for Latin-based languages. It means that an essential upstream task is to correctly identify, segment and make sense of words. The following sections will explore each of these challenges and their mitigations.

Tokenization

NLP begins with tokenization. For languages like English, spaces within a sentence make segmentation a manageable process. For Japanese which is written in a continuous stream of words, tokenization often requires sophisticated morphological analyzers (which breaks words into morphemes, which is a minimal unit of language that is larger than a character though unbreakable into even smaller structures). Such analyzers help to handle segmentations that can be done in multiple valid ways. Lacking a language-specific tokenizer with a local dictionary also prevents downstream tasks from understanding and breaking down the common terms seen in the language.

For example, the same sequence of characters could denote vastly different meanings depending on how it is segmented. Compound words such as “日本海” (Sea of Japan) must be correctly identified as one single concept rather than disjointed fragments (日本 and 海, “Japan” and “Sea” respectively).

The hiragana script also introduces another level of ambiguity as the same phonetic sound can have different meanings. Learners of Japanese language might come across this issue already. When attempting to learn the language via reading e-hon (children book) which is often in hiragana, a beginner might find it hard to understand the meaning by breaking up phrases into correct grammatical structures.

Existing tools

Several robust tools have a long history catering to processing and analyzing Japanese text:

  • MeCab, Kuromoji and Sudachi: These are widely used open-source morphological analyzers that break strings into morphemes with custom levels, while providing other linguistic analytic tools such as part of speech tagging, NER, dependency parsing, lemmatization, etc.
  • Janome: as demonstrated by my post on topic modeling . It is an easy to use and versatile library that offers not just tokenizing, but also text processing functions such as lowercase/upper case conversion, specific token filters, regular expression operation, compound noun filters, removing or keeping specific POS words, count word frequencies, customize dictionary, etc.
  • spaCy : a statistically -based natural language analysis (not generative) library with Japanese support. It was the first library I used for NLP development in Japanese, before the advent of the latest LLMs.
  • Japanese dictionaries : the foundation block of tokenizers. The same tokenizer with different dictionaries will yield different results.

Subword tokenization

Modern NLP models have also used subword tokenization techniques such as Byte Pair Encoding (BPE) or SentencePiece to better capture new and compound words and domain-specific terminology.

Linguistic ambiguity and contextual nuances

Japanese inherently relies on context. Sentences may omit subjects and objects, with the assumption that the reader or listener can infer them. For instance, in a sentence like “行ってきます” (“I’m off”), there is no explicit mention of the subject, yet the meaning is clear to native speakers.

In addition, Japanese includes finely graded politeness levels (keigo), honorific expressions, and culturally specific idioms that introduce additional layers of meaning.

This high-context dependency demands models that can understanding long-range dependencies or inferring implicit information.

Mitigation using Transformer models

Recent advancements in contextual embeddings from Transformers (e.g., BERT, specific embedding models, etc.) have vastly improved context parsing. Fine-tuning these models on large Japanese corpora enables them to capture Japanese text’s unique subtleties. However, high quality contextual models need ample annotated data. For Japanese, the relative scarcity (only 5.1% of web content is in Japanese) of such resources compared to English (49.2%) poses a significant challenge. Even though this is a far cry from LLMs for Low Resource Languages , most mainstream globally developed LLMs (commercial and open sourced) fare much better with English content than Japanese. From my experience, mainstream LLMs perform ok with general knowledge but hallucinate in more niche areas such as ryokans and traditional arts. This spurs the rise of LLMs built from scratch with Japanese training data, a topic we will examine in the next section on Japanese LLMs.

Locale-specific colloquial expressions and onomatopoeic words

Japanese onomatopoeic words are unique word forms. They don’t just mimic sounds, but also describe emotions, conditions, and even atmospheric qualities. This multifunctional use means that the same onomatopoeia can convey different meanings depending on context. For instance, an onomatopoeic term used to describe a sound in one context might be employed metaphorically in a narrative. Research by Fukushima et al. illustrates that effective disambiguation often requires analyzing neighboring nouns and verbs to determine the intended meaning, yet this dependency on surrounding context makes it challenging for LLMs to process such expressions accurately.

Colloquial expressions, including slang, abbreviations, and regional dialects, tend to deviate from standardized language and evolves rapidly. This variability means that common LLM tokenization strategies and pretrained models might struggle to recognize or correctly interpret these expressions, leading to inaccuracies in downstream tasks.

Moreover, onomatopoeia is frequently written in Katakana for stylistic emphasis (especially in Manga). The same phenomenon can occur with colloquial expressions, which might mix standard forms with creative abbreviations or phonetic spellings. Such heterogeneity challenges NLP-preprocessing, making it difficult to consistently create semantically meaningful representations.

While large-scale datasets exist for formal or literary Japanese, there’s limited well-annotated corpora that capture the full spectrum of colloquial language and onomatopoeia. Without sufficient annotated data, models lack exposure to the diverse contexts where these expressions are used, hindering the learning of accurate representations. This scarcity is a barrier to developing models that can reliably distinguish and generate appropriate colloquial forms or interpret onomatopoeic nuances.

Latest developments in Japanese LLMs

Let’s take a look at recent advancement in the Japanese-specific LLM ecosystem. In the last 2 years, more models are trained from scratch with original data curated by local Japanese teams, instead of fine-tuning an existing English model with Japanese data. As we know, the performance of a language model in understanding a language is positively correlated with the volume of training data in that particular language. We would expect locally developed models using mostly Japanese training data to perform better in Japanese-specific tasks than global models.

Collaborative ecosystem

The National Institute of Informatics (NII) setup LLM-jp project as a collaborative ecosystem bringing together more than 1,500 contributors from academic, private, and governmental sectors to develop fully open Japanese LLMs and benchmarks. Development spans across the entire pipeline fom data preprocessing, through token embedding, to model training, fine-tuning and guardrailing .

Ultra-large open-source models

The National Institute of Informatics (NII) has released the llm-jp-3-172b model. Boasting 172 billion parameters with training data mostly from Japanese text in Common Crawl, WARP, JP-Wikipedia and summary texts of each research topic in KAKEN. It has outperformed comparable systems such as GPT-3.5 on Japanese-specific evaluation metrics, tailoring its architecture to capture the subtleties inherent in Japanese texts, such as mixed scripts and context-dependent nuances.

Leveraging supercomputing for distributed training

Fugaku-LLM is an LLM developed using the RIKEN supercomputer Fugaku with 13 billion parameters. Due to GPU shortage, the team opts for training using CPUs (manufactured by Fujitsu) in Fugaku, instead of GPUs. 60% of the training data are curated locally in Japanese, combined with English, mathematics, and code.

Researchers use distributed training methods by porting the Megatron-DeepSpeed deep learning framework to Fugaku supercomputer to accelerate computation speed of matrix multiplication and communication parallelization. It achieves high performance on JA MT-Bench . Specifically, the benchmark performance for humanities and social sciences tasks reached a remarkably high score of 9.18. Its creator expects it to perform well on natural dialogue based on keigo (honorific speech) and other features of the Japanese language.

Advanced architectures with Mixture-of-Experts (MoE)

Rakuten’s RakutenAI-2.0 utilizes a Mixture-of-Experts (MoE) architecture whereby 8 7-billion-parameter experts are combined. During inference and training, each individual token is sent to the two most relevant experts, as decided by the router. Both experts and router are trained together with both Japanese and English corpora to capture Japan’s unique linguistic, cultural, aesthetic, and regulatory nuances. Rakuten claims that evaluation with JA MT-Bench shows that it outperforms other Japanese-specific models.

Locally-developed multimodal models

Even though the clip-ViT-B-32 multimodal model works well out of the box in understanding images that have Japanese elements (e.g., ramen, castles, Japanese signs, Tokyo), it still fails to distinguish some fine nuances such as certain food items and interior decor. Locally developed CLIP models such as the CLIP-Japanese-Base by Line Corp trains on curated and cleansed Japanese-only data. Of which, 1 billion of image-text pairs from CommonCrawl that are only applicable to Japan, and an additional 500k annotated images from Japan domain were preprocessed. It has achieved higher scores from the STAIR Captions and Recruit image classification evaluation tests, both of which tested cultural-awaren semantic understanding of image content.

Open benchmarking and performance evaluation

Global evaluation suites often fall short in testing Japanese LLMs due to their lack of locale-specific dataset. Researchers have deployed a variety of Japanese-specific benchmarks to rank LLMs on language understanding, task performances, domain-specific knowledge, and alignment.

For example, the Rakuda benchmark provides a question set designed to evaluate how well AI Assistants can answer Japanese-specific questions. The Open Japanese LLM Leaderboard from LLM-jp on Hugging Face has provided a centralized benchmark to automatically evaluate Japanese LLMs across 26 evaluation datasets . The leaderboard uses the llm-jp-eval suite to automatically evaluate submitted models with 16 tasks from language understanding to code generation and math reasoning. It provides transparency in evaluation, establishes clear performance baselines and drive reproducibility.

In my ryokan recommender work , I have used the JMTEB leaderboard and this post to evaluate embedding model performance for various tasks in Japanese text processing. In this way, I can start with better textual representations before moving to downstream tasks. I have also used the vision model benchmarks to find a substitute of clip-ViT-B-32 when its performance falls short of capturing some nuanced differences in local architecture and food images. Since I rely heavily on RAG, I also find this benchmark helpful.

Business Opportunities in Japanese NLP

From challenges come exciting solutions! After reviewing the intrinsic issues in the language and latest developments to overcome them, let’s appreciate some new applications that take advantage of the new LLMs. Since my interest is in the travel industry, we will look at several scenarios where AI and especially LLM are deployed to remove the language barrier and faciliate a smooth and memorable experience for foreign travelers in Japan.

Chatbots

Japanese consumer interactions demand a high degree of politeness and cultural awareness. By developing NLP-powered voice assistants and chatbots tuned specifically to Japanese nuances, businesses can offer a level of personalization and empathy that off-the-shelf English models cannot match. These systems can adapt their dialogue style based on contexts, such as differentiating between standard inquiries and requests requiring keigo (honorific language).

In addition, the post-pandemic influx of foreign travelers put a heavy demand on the aging labor force both in capacity and foreign language training. For example, NHK news cites that fewer than 1% of Japan’s 400k taxi drivers are fluent in English. By leveraging the latest NLP advancement, traveler-facing industries can build interactive and engaging experiences that provide accurate, relevant and consistent service to users in their own languages. The NLP-systems can also handle inquiries 24/7 and be able to scale service delivery without additional labor.

Multilingual and multimodal digital kiosks

An exciting application is the integration of LLMs with digital kiosks found in airports, train stations, and tourist centers. These systems combine natural language understanding with visual information to assist travelers with real-time travel information (e.g., train schedules, platform changes, and route navigation), local recommendations, and real-time translation services. By leveraging LLMs that understand both the text and context in multiple languages, these multimodal systems create a more interactive and engaging experience for users, helping reduce language barriers for foreign visitors while providing rich, culturally relevant information.

When utilizing predictive analytics and up-to-date service data, these systems can also dynamically adjust the information displayed to meet evolving passenger needs. For instance, during peak hours or in the event of service disruptions, the kiosks can provide personalized suggestions and alternative route planning, ensuring that travelers are kept well-informed with minimal effort on their part.

Transportation companies can also leverage multilingual LLM with RAG to let foreign travelers search for company-specific information in their own language via phone or an onsite kiosk. A common scenario would be to query the eligibility and coverage of the JR East rail pass for someone who just landed in Narita airport, and upon confirmation, buy and activate it on the spot. With the kiosk’s guidance, the same user can also immediate reserve shinkansen seats for their upcoming trips using the just-bought train pass.

One example is the face-to-face translation panel deployed by Seibu Railway in Shinjuku, the busiest transit hub in the world with a total of 53 platforms across 12 train lines with 200 different exits. Of the 3.6 millions passengers passing through Shinjuku everyday, sure a significant proportion of first-time visitors will summon the help of this panel. Developed by Toppan using a local translation engine in the VoiceBiz app, it features a transparent screen displaying real-time translated text in bubbles. In addition to translation between different languages, the device also offers a virtual keyboard for people with hearing or speech disabilities.

Both sides can speak their native languages and the receiving party will be able to read the words in their own language. In addition, the conversion is simultaneous and continuous, meaning that both parties can keep talking without the need to signal the machine to start or stop processing, enhancing the natural flow of coversation. The merit of this glass panel over using a tablet app is that both parties can talk naturally with eye-contact, and observe nuances such as facial and bodily expressions that assist in understanding. It certainly beats taking out your phone, finding cellular/wifi signal, opening Google Translate and showing it to the other party while staring at a shared screen together!

live translation

transparent screen that translates 12 languages. Photo credit: Asahi Shimbun

Following Seibu Railway’s lead, Keikyu has installed the panel in its Haneda Airport and Shinagawa Station. Tokyo Subway and Takashimaya department store in Osaka are also following lead.

Toppan started the development with a phone app, then extend it to a physical display to scale its usage for more immersive interaction. The website/app technology is deployed in Expo 2025 for annoucement, orientation and crowd control. The glass version as in Seibu Railway’s usage above is expected to be deployed in the 2025 Deaflympics and other businesss and public spaces.

Customized chatbots using RAG

The JR EAST Travel Concierge , based on Google Gemini, support travel planning for international visitors to Japan. It aims to use an interactive chabot to recommend tourist attractions tailored to individual preferences, deliver real-time travel information (e.g., Japan’s cultural norms, practices, and travel-related rules) during a trip to reduce anxiety and friction, and create tailored itineraries based on interests and schedule. It also promises to allow editing of existing itinerary during the trip by adding new locations. As of this post’s writing, the system is unfinished. As of this post’s writing (2025/6), it only has a web interface and users need to tell the system your travel preferences and demographics via preset q/a, then it picks a set of fixed spots.

Osaka Convention & Tourism Bureau also launched a GPT4-based chatbot on its Osaka-Info website.

It is a multi-language (supporting more than 20 languages) generative AI system developed and provided by Kotozna in partnership with JTB, providing travel recommendation, disaster prevention and the latest news such as the Osaka-Kansai Expo 2025. Per the system diagram by the developer, this is a RAG-based system that retrieves information from the Osaka Tourism Bureau homepage and database to construct a prompt with user query. This reduces the chance of hallucination and ensure the response is crafted from relevant and up-to-date content from the Travel Bureau.

The website and chatbot detects my browser setting and automatically displays in English.

osaka

here I am asking the chatbot to create an itinerary around watching bunraku performance, which is a local cultural specialty

Bebot, a chatbot service offered by Bespoke, is deployed in Narita airport , hotel , public institutions (e.g., online city hall, disaster response, resident support , etc.) and Kyoto city to handle multi-language inquries regardless of the time of day and location, reducing the need for staff language training and improve response time and quality.

These chatbots make it feel like we have a trusted guide in our pocket. For example, when it rains, we can quickly ask them to recommend some indoor attractions or we can ask help to recommend or book a local restaurant. I think the strength over using something like GPT is that since these are either RAG systems or fine-tuned with local data, it is more up-to-date and specific for the particular scenario a traveler is in. For example, with the Travel Bureau’s curated content as the base for its RAG, I’d expect the Osaka chatbot to know much more about every nook and cranny in Osaka than the plain villa GPT-4 it is built on.

virtual hotel staff and concierge

Across Japan, several hotels are now deploying AI-driven virtual concierge systems to handle guest inquiries in real time. These systems can answer frequently asked questions, provide recommendations for local attractions, and even manage in-room service requests.

Kotozna, the developer for Osaka-Info’s RAG system, has developed another in-room virtual assistant that is deployed in 500 hotels in Japan. It handles room service and requests in multiple languages and can track any data flow for further analysis.

Henn Na Hotels’ robotic staff

Henn Na (which literally means strange) Hotels pioneered the concept of a “robotic hotel” where many front-of-house functions (e.g., check-in, concierge services, and room service) are supported or entirely handled by automated systems. For example, the reception staff are mostly humanoids that greet guests in various languages using NLP. Such humanoids also take up the task of concierge for reservation and local spot recommendation. It not only cuts operation cost but creates a strong talking point for adventurous travelers, increasing the hotel chain’s impression online.

henn

humanoid reception staff. Photo credit: Henn-na Hotel official website

henn

an even more engaging reception staff! Photo credit: Henn-na Hotel official website

AI-based traffic flow analysis

At JR Kitakami Station, an innovative project is underway where Electric Engineering Co. and Cyber Core have collaborated to implement an AI-based video analysis system . Strategically positioned cameras capture real-time video feeds, which are then processed using advanced AI algorithms to monitor and analyze passenger flows. While at first glance this technology supports operational improvements—by identifying congestion areas and enabling better crowd management—it also has a direct impact on the passenger experience. Information distilled from the system can be used to power dynamic digital signage that displays real-time updates about platform congestion, waiting times, and service adjustments. This means travelers receive timely, data-driven guidance while navigating busy stations.

Sentiment analysis and market research

In Japan, where consumer feedback is often nuanced rather than overtly direct and critical (rooted from the deep-seated cultural mindset that direct confrontation is to be avoided), traditional sentiment analysis tools may fail to capture subtleties. NLP models trained or finetuned with labelled Japanese customer feedback data (rather than merely translating global data), and evaluated with Japanese data should be able to understand context and tone much better.

Conclusion

In this post, we have explored the following areas in developing Japanese NLP applications:

  • The challenge of handling mixed scripts, implicit contextual cues and local expressions
  • Mitigating language processing challenges through both traditional and modern approaches
  • Latest advancement of local LLM development
  • Deployment of NLP applications in the travel sector to improve traveler experience

Developing robust Japanese NLP applications demands a deep understanding of linguistic complexity, algorithm design, and a constant balancing act between user experience, performance and cost. As the tourism boom is on a upward slope with government support, I see a huge opportunity of deploying LLM-backed services to enhance the experience of both inbound and domestic tourists. I look forward to any new development to deliver truly localized, context-aware services for better engagement in the travel sector!