Today, information overload has become a daily problem. We often need to probe into a large set of information, and uncover common underlying themes so we have a better understanding of what we are facing. For example, we might want to quickly grasp user reviews and see whether customers are loving or hating us, and for each category, what exactly are the biggest areas that we need to focus our resources on? A researcher might want to find out the most recurring themes in economic news this year so he/she can zoom in to an area of study. Another marketer might want to know what kind of products and usage patterns are trending in social media in the past quarter. And how about finding out all the unique features of a hotel from existing text (or images), a problem we wanted to solve in our ryokan recommender FMR? For all these scenarios, topic modeling, a classic NLP (natural language processing) technique is very helpful.
what is topic modeling?
Topic modeling uses unsupervised machine learning algorithms to extract latent themes (called topics) within a collection of documents by analyzing word distributions, and clustering keywords or phrases that commonly appear together as specific topics. Doing it right, you will get distinguished topics in which each one contains a set of most important words for this topic only (i.e., not in other topics).
As the goal of topic modeling is to gives you an easier way to organize and understand large amounts of textual data, I will wrap up this post in the last section with a bird’s eye view of all the major themes in a group of news articles, and then zoom into a specific theme to find a relevant document.
meet BERTopic, our topic modeling tool
In this post, I am going to analyze Japanese news articles with BERTopic, a library that uses the Transformer encoder architecture to create clusters for topic modeling.
BERTopic pipeline. Image credit: Maarten Grootendorst, creator of BERTopic.
BERTopic generates document embedding with pre-trained encoder models, reduces the embeddings’ dimension, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure.
One thing I love BERTopic is that each step it carries out is very modular, allowing the user to modify the underlying algorithm without affecting other components, or even swap in another algorithm or model as desired.
The modular BERTopic architecture. Image credit: Maarten Grootendorst, creator of BERTopic.
As we can see in this blog post, Japanese content carries its unique challenges in NLP. I am going to show you how taking advantage of this modularity to customize two steps in the BERTopic algorithm can vastly improve the output quality in topic modeling Japanese text.
prepare the environment and dataset
First we need to install BERTopic:
!pip install bertopic
from bertopic import BERTopic
For this blog post, we are going to analyze this Japanese wikinews dataset hosted in Hugging Face.
!pip install datasets
from datasets import load_dataset
dataset_id = "izumi-lab/wikinews-ja-20230728"
dset = load_dataset(dataset_id)
news=dset['train']['text']
use the default BERTopic settings
The first step in BERTopic is to generate embeddings for the input text. Embedding our dataset is essential as it transforms each document into vectors carrying contextual meaning, that can be compared with other documents and generated topics for similarity. This will be used in later steps of the algorithm.
If we leave everything by default in the constructor, BERTopic will use all-MiniLM-L6-v2
as the embedding model when it builds the model. Let’s see what it comes up with.
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(news)
After that, we can check the most frequent topics in our news set with topic_model.get_topic_info()
. And the output is:
Topic | Count | Name | Representation | Representative_Docs | |
---|---|---|---|---|---|
0 | -1 | 1913 | -1_editprotected_utc_11_10 | [editprotected, utc, 11, 10, 18, 20, utc9, nhk… | [産経新聞が、九州地方(沖縄県除く)と山口県に向けた「九州・山口特別版」という新聞を10月1… |
1 | 0 | 199 | 0_2004200635_1220114635_200412_47news | [2004200635, 1220114635, 200412, 47news, wortn… | [47NEWS(共同通信)、時事通信、産経新聞、山陽新聞、日経新聞、中央日報、ロイター通信に… |
2 | 1 | 145 | 1_eu_2014725_12_512 | [eu, 2014725, 12, 512, 1215, 100, utc3, 13, 37… | [16日、日本政府は閣議でモンテネグロを国家として承認することを決めた。これにより日本が国家… |
3 | 2 | 128 | 2_40_bysa_cc_de | [40, bysa, cc, de, jra, g1, 2015, 15000, 82124… | [11月17日のサンケイスポーツ、日経ラジオ社(ラジオNIKKEI)など報道機関各社によると… |
4 | 3 | 115 | 3_21durian_chanchu_13shanshan_12ioke | [21durian, chanchu, 13shanshan, 12ioke, utc9, … | [気象庁によると、台風1号「チャンチー」 (Chanchu) は、広がりをまして「大型」の台… |
we can also execute topic_model.get_document_info()
to check the assigned topics for each document:
Document | Topic | Name | Representation | Representative_Docs | Top_n_words | Probability | Representative_document | |
---|---|---|---|---|---|---|---|---|
0 | 台湾の自由時報によると、3日、陳水扁総統夫人の呉淑珍氏と馬永成元総統府副秘書長、林徳訓総統府… | -1 | -1_editprotected_utc_11_10 | [editprotected, utc, 11, 10, 18, 20, utc9, nhk… | [産経新聞が、九州地方(沖縄県除く)と山口県に向けた「九州・山口特別版」という新聞を10月1… | editprotected - utc - 11 - 10 - 18 - 20 - utc9… | 0.000000 | False |
1 | カナダ・トロント市のジョン・トーリー氏が今年2月に市長を辞職したことを受け、今月26日に実施… | 0 | 0_2004200635_1220114635_200412_47news | [2004200635, 1220114635, 200412, 47news, wortn… | [47NEWS(共同通信)、時事通信、産経新聞、山陽新聞、日経新聞、中央日報、ロイター通信に… | 2004200635 - 1220114635 - 200412 - 47news - wo… | 0.816134 | False |
2 | 岩波書店は約10年ぶりの大改訂となる「広辞苑 第六版」を1月11日に発売した。\nJ-CAS… | 53 | 53_7811_67_oriconstylenews24_d06d902i200511d90… | [7811, 67, oriconstylenews24, d06d902i200511d9… | [毎日新聞によると、日本の文部科学省の幹部らに対して「刺殺する」とする内容の書き込みがインタ… | 7811 - 67 - oriconstylenews24 - d06d902i200511… | 1.000000 | True |
3 | 4日、長野県安曇野市の長野県立こども病院は、心臓疾患を持って1,100gの低出生体重で生まれ… | 28 | 28_neanderthalensis_520828_homo_82919 | [neanderthalensis, 520828, homo, 82919, 102552… | [4日、長野県安曇野市の長野県立こども病院は、心臓疾患を持って1,100gの低出生体重で生ま… | neanderthalensis - 520828 - homo - 82919 - 102… | 0.939870 | True |
4 | ニッポン ニュース ネットワーク (NNN) によると27日、山形県庄内町の羽越線の脱線事故… | 3 | 3_21durian_chanchu_13shanshan_12ioke | [21durian, chanchu, 13shanshan, 12ioke, utc9, … | [気象庁によると、台風1号「チャンチー」 (Chanchu) は、広がりをまして「大型」の台… | 21durian - chanchu - 13shanshan - 12ioke - utc… | 0.317053 | False |
We can examine the terms in the biggest topic clusters with topic_model.visualize_barchart()
As we can see from the 3 outputs above, the topics and their terms aren’t very helpful, right? They are all in English even though our data is Japanese! And for topic 5, even though we can kinda grasp that this topic is around forecasting earthquakes, a lot of the words here are not meaningful at all.
switch to a multilingual embedding model
Given the poor performance of the default embedding model in handling Japanese, let’s see if switching to the default multilingual model will help. This is done by passing the multilingual
flag in the BERTopic constructor, which asks it to use the paraphrase-multilingual-MiniLM-L12-v2
embedding model under the hood.
topic_model = BERTopic(language="multilingual")
topics, _ = topic_model.fit_transform(news)
Doing another round of topic_model.visualize_barchart()
gives us this chart which is much better as it generates more semantically correct embeddings for Japanese text. However, the output still feels quite unclean as some are just phrases or prepositions without any specific meaning. For example, topic 0 and 7 both have similar terms from the names of the 3 biggest daily newspapers, but they don’t add much value to understand what this topic is about. Also recall that a good topic model will produce clusters with words that are distinguished exclusively for this cluster, and we can see that the most popular words in #0 and #7 are in fact similar. In addition, topic 5 hints on jra
which stands for the Japan Racing Association for horseracing, but the other words are not relevant at all. Topic 3 and 6 are about Meteorlogical Agency’s forecasts and USGS respectively, but they contain a lot of stopwords, those frequently used words with little semantic value such as “now”, “then”, “later”.
customize with a Japanese tokenizer
Now we have a multilingual embedding model that can generate meaningful representation for our input Japanese documents, let’s do more customization to produce better topic representation.
Our goal this round is to create more standout and easier to understand topics than what we got last time (long phrases, numbers, prepositions, similar terms across topics, etc.).
To do this, we will use a tokenizer created to handle Japanese text specifically. A tokenizer (equipped with a dictionary) breaks down a stream of text into smaller, meaningful units called tokens. These tokens can range from individual words or phrases, depending on the level of granularity required.
The reason we need to employ a Japanese tokenizer is because unlike English, Japanese words in a sentence are not separated by whitespaces. Lacking a language-specific tokenizer with a local dictionary also prevents the model from understanding and breaking down the common terms seen in the language. That’s why we got long unmeaningful and shared phrases in the last step. Just like the embedding model, the default tokenizer in BERTopic doesn’t tokenize Japanese (and other East Asian languages). Using a tokenizer built for Japanese text processing will help us extract all the individual terms into tokens so our topics carry more insights.
We will then pass this tokenizer to the CountVectorizer
from the scikit-learn
machine learning library. CountVectorizer
counts the occurrences of words (tokens) in a collection of text documents and creates a matrix where each row represents a document and each column represents a unique token. This bag-of-words matrix is then passed on to the c-TF-IDF algorithm that BERTopic implements to weigh each token’s importance in topic the cluster and create an accurate term distributions for the topic.
Since CountVectorizer
needs the words rendered in meaningful tokens to create the count matrix, we can now see the importance of a good tokenizer. Just like the embedding model, we can plug in our own tokenizer, thanks to the
modular architecture of BERTopic.
The tokenizer I choose is Janome, an easy to use and versatile library that offers not just tokenizing, but also text processing tasks such as lowercase/upper case conversion, specific token filters, regular expression operation, compound noun filters, removing or keeping specific POS words, count word frequencies, etc.
We will be using the wakati
mode in Janome, which will only return the surface form of a token. All the extra information (POS, pronounciation, etc) for a token are not loaded from dictionary to reduce memory usage.
!pip install janome
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
def tokenize_jp(text):
t = Tokenizer(wakati=True)
tok=list(t.tokenize(text))
return tok
We initiate a new instance of CountVectorizer
with our tokenizer. We then call .update_topics()
with this instance of CountVectorizer to recreate the term distributions without re-training our model.
vectorizer = CountVectorizer(tokenizer=tokenize_jp)
topic_model.update_topics(news, vectorizer_model=vectorizer)
topic_model.visualize_barchart()
Now we see that our topics are gradually taking shape. Topic 2-7 begin to surface more meanings such as train transport, typhoon, media station, horseracing, earthquake observation, food retailing respectively. We still get some empty and prepositional terms that we will remove in the next step.
remove stopwords and punctuations
In this improvement, we are going to use a regular expression to remove punctuation and numbers from the topics. We will also remove common stopwords that provide little semantic value for our understanding of the input text. The stopword list I use is consolidated from here and here.
We will revise our tokenizer with the following code and update our model:
import re
def tokenize_jp(text):
# remove numbers and punctuation
text = re.sub(r'[^\w\s]|\d+', '', text)
t = Tokenizer(wakati=True)
tok=list(t.tokenize(text))
with open('stopwords-ja.txt', encoding='utf-8') as f:
stopwords = set(f.read().split())`
tok = [w for w in tok if w not in stopwords]
return tok
vectorizer = CountVectorizer(tokenizer=tokenize_jp)
topic_model.update_topics(news, vectorizer_model=vectorizer)
topic_model.visualize_barchart()
The result this time is way better. Each topic carries strong and distinguished theme that set them apart from each other. These 8 most popular topics are arresting criminal suspects, death of certain ppl, train transpor, typhoon, media broadcast, horseracing, earthquake/tsunami observation and food manufacturing/retail. Moreover, the most popular terms in each topic cluster is clear, meaningful and add value to understanding the topic.
We can also visualize the hierarchical structure of the topics and find potential ones that can be merged later. This can be done by the following code:
hierarchical_topics = topic_model.hierarchical_topics(news)
tree = topic_model.get_topic_tree(hierarchical_topics)
print(tree)
Expand the following to visualize the topic tree generated:
.
├─選手_試合_戦_リーグ_チーム
│ ├─試合_リーグ_戦_選手_チーム
│ │ ├─j_リーグ_チーム_昇格_クラブ
│ │ │ ├─j_リーグ_チーム_昇格_jfl
│ │ │ │ ├─■──j_リーグ_加盟_jfl_チーム ── Topic: 55
│ │ │ │ └─■──j_リーグ_降格_昇格_試合 ── Topic: 73
│ │ │ └─■──j_リーグ_優勝_ガンバ_レッズ ── Topic: 76
│ │ └─試合_回_選手_戦_野球
│ │ ├─回_試合_戦_点_大会
│ │ │ ├─回_高校_甲子園_大会_点
│ │ │ │ ├─■──部員_生徒_高校_野球_同校 ── Topic: 59
│ │ │ │ └─回_高校_点_試合_甲子園
│ │ │ │ ├─■──回_甲子園_高校_点_大会 ── Topic: 48
│ │ │ │ └─■──回_駒大_苫小牧_選手_高校 ── Topic: 61
│ │ │ └─戦_試合_優勝_対_チーム
│ │ │ ├─大会_戦_選手_ブラジル_点
│ │ │ │ ├─■──ブラジル_戦_大会_ワールドカップ_点 ── Topic: 27
│ │ │ │ └─■──優勝_回_選手_大会_キューバ ── Topic: 35
│ │ │ └─試合_戦_チーム_シリーズ_優勝
│ │ │ ├─■──優勝_シリーズ_回_日本ハム_千葉ロッテ ── Topic: 19
│ │ │ └─■──試合_チーム_戦_対_年度 ── Topic: 36
│ │ └─監督_野球_球団_プロ_ドラフト
│ │ ├─監督_球団_ドラフト_野球_プロ
│ │ │ ├─■──ドラフト_指名_巡_プロ_野球 ── Topic: 45
│ │ │ └─監督_球団_球場_契約_命名
│ │ │ ├─■──監督_就任_オシム_氏_リーグ ── Topic: 25
│ │ │ └─■──球場_命名_大阪_球団_権 ── Topic: 16
│ │ └─安打_達成_選手_記録_大リーグ
│ │ ├─■──引退_清原_現役_球団_選手 ── Topic: 79
│ │ └─■──安打_達成_記録_イチロー_大リーグ ── Topic: 41
│ └─選手_獲得_位_オリンピック_メダル
│ ├─選手_オリンピック_獲得_メダル_金メダル
│ │ ├─■──秒_記録_世界_選手_男子 ── Topic: 81
│ │ └─選手_オリンピック_獲得_メダル_金メダル
│ │ ├─■──選手_獲得_オリンピック_メダル_金メダル ── Topic: 44
│ │ └─■──選手_位_メダル_フィギュア_スケート ── Topic: 77
│ └─■──位_駅伝_区_区間_トップ ── Topic: 99
└─ _よる_日_者_人
├─選挙_党_投票_議席_候補
│ ├─選挙_投票_議席_党_候補
│ │ ├─投票_選挙_市長_知事_氏
│ │ │ ├─■──投票_用紙_区_票_選挙 ── Topic: 94
│ │ │ └─■──選挙_市長_知事_投票_氏 ── Topic: 15
│ │ └─議席_選挙_党_党首_連立
│ │ ├─議席_選挙_党_候補_区
│ │ │ ├─■──議席_選挙_区_比例_民主党 ── Topic: 87
│ │ │ └─■──党_議席_候補_選挙_国民党 ── Topic: 78
│ │ └─■──spd_cdu_党首_連邦_首相 ── Topic: 80
│ └─大臣_自民党_内閣_安倍_議員
│ ├─内閣_総裁_大臣_首相_麻生
│ │ ├─■──首相_辞任_党_政権_内閣 ── Topic: 51
│ │ └─■──総裁_票_麻生_内閣_大臣 ── Topic: 54
│ └─■──解散_自民党_衆議院_総理_議員 ── Topic: 75
└─ _よる_日_者_人
├─ _者_よる_年_月日
│ ├─馬_競馬_競走_騎手_レース
│ │ ├─馬_競馬_騎手_競走_レース
│ │ │ ├─■──馬_ユニック_オグリキャップ_引退_牡馬 ── Topic: 93
│ │ │ └─馬_競馬_騎手_競走_レース
│ │ │ ├─■──競馬_騎手_馬_賞_レース ── Topic: 22
│ │ │ └─■──馬_競馬_ばん_騎手_馬術 ── Topic: 24
│ │ └─■──競馬_開催_馬_競走_インフルエンザ ── Topic: 60
│ └─ _者_よる_さん_年
│ ├─容疑_者_逮捕_事件_歳
│ │ ├─容疑_事件_警察_男_者
│ │ │ ├─容疑_事件_男_警察_者
│ │ │ │ ├─■──火災_放火_階_出火_店 ── Topic: 8
│ │ │ │ └─■──容疑_事件_警察_者_男 ── Topic: 1
│ │ │ └─■──脅迫_書き込み_容疑_妨害_逮捕 ── Topic: 66
│ │ └─容疑_逮捕_者_覚醒剤_酒井
│ │ ├─容疑_逮捕_覚醒剤_酒井_者
│ │ │ ├─■──容疑_覚醒剤_酒井_所持_逮捕 ── Topic: 18
│ │ │ └─■──容疑_者_逮捕_府警_円 ── Topic: 33
│ │ └─■──容疑_監視_暴行_者_員 ── Topic: 82
│ └─ _年_よる_氏_月日
│ ├─円_億_証券_株式_取引
│ │ ├─証券_取引_株_株式_東証
│ │ │ ├─■──証券_取引_東証_株_株式 ── Topic: 84
│ │ │ └─■──取引_証券_ライブ_ドア_上場 ── Topic: 67
│ │ └─円_億_経営_会社_統合
│ │ ├─円_億_経営_会社_統合
│ │ │ ├─■──等_当せん_億_toto_くじ ── Topic: 74
│ │ │ └─円_経営_億_会社_統合
│ │ │ ├─■──円_兆_人口_上昇_給与 ── Topic: 26
│ │ │ └─経営_億_会社_統合_買収
│ │ │ ├─■──統合_ビール_買収_経営_会社 ── Topic: 7
│ │ │ └─■──申請_破産_億_経営_負債 ── Topic: 14
│ │ └─■──発行_休刊_新聞_夕刊_朝刊 ── Topic: 91
│ └─ _年_氏_さん_よる
│ ├─氏_年_さん_ _大統領
│ │ ├─大統領_政府_北朝鮮_よれ_投票
│ │ │ ├─大統領_政府_武装_人_派
│ │ │ │ ├─■──ロシア_大統領_ウクライナ_ボルソナロ_プーチン ── Topic: 86
│ │ │ │ └─政府_大統領_武装_派_人
│ │ │ │ ├─政府_大統領_武装_投票_よれ
│ │ │ │ │ ├─■──大統領_投票_選挙_候補_よれ ── Topic: 30
│ │ │ │ │ └─■──政府_デモ_武装_スーダン_軍 ── Topic: 4
│ │ │ │ └─■──爆発_爆弾_負傷_テロ_ラマダーン ── Topic: 72
│ │ │ └─北朝鮮_モンテネグロ_ミサイル_韓国_核
│ │ │ ├─■──モンテネグロ_独立_セルビア_eu_加盟 ── Topic: 89
│ │ │ └─北朝鮮_ミサイル_韓国_核_発射
│ │ │ ├─■──北朝鮮_ミサイル_韓国_核_発射 ── Topic: 20
│ │ │ └─■──議定_量_排出_ガス_目標 ── Topic: 97
│ │ └─氏_年_さん_歳_死去
│ │ ├─判決_被告_参拝_死刑_裁判
│ │ │ ├─■──参拝_靖国神社_小泉_判断_平和 ── Topic: 53
│ │ │ └─判決_被告_死刑_裁判_控訴
│ │ │ ├─■──判決_原告_訴訟_地裁_側 ── Topic: 43
│ │ │ └─■──死刑_被告_判決_弁護_控訴 ── Topic: 21
│ │ └─氏_年_さん_死去_fifa
│ │ ├─氏_年_さん_死去_fifa
│ │ │ ├─氏_さん_年_死去_歳
│ │ │ │ ├─氏_さん_年_死去_歳
│ │ │ │ │ ├─■──さん_結婚_女優_沢尻_交際 ── Topic: 39
│ │ │ │ │ └─■──氏_年_死去_さん_歳 ── Topic: 0
│ │ │ │ └─■──ノーベル_賞_受賞_氏_授賞 ── Topic: 52
│ │ │ └─fifa_会長_サッカー_連盟_ブラッター
│ │ │ ├─漢字_字_今年_貫主_清水寺
│ │ │ │ ├─■──漢字_字_今年_貫主_清水寺 ── Topic: 88
│ │ │ │ └─■──入場_博覧_万博_開館_ディズニーランド ── Topic: 100
│ │ │ └─fifa_会長_サッカー_連盟_ブラッター
│ │ │ ├─■──開催_地_競技_五輪_オリンピック ── Topic: 64
│ │ │ └─■──fifa_会長_サッカー_連盟_ブラッター ── Topic: 12
│ │ └─相撲_力士_青龍_親方_琴
│ │ ├─■──親方_相撲_力士_協会_時津 ── Topic: 31
│ │ └─■──青龍_朝_琴_場所_横綱 ── Topic: 68
│ └─放送_ _感染_ _記事
│ ├─放送_番組_視聴_テレビ_nhk
│ │ ├─放送_番組_視聴_nhk_テレビ
│ │ │ ├─■──番組_放送_関西テレビ_フジテレビ_ドラマ ── Topic: 9
│ │ │ └─■──放送_視聴_中継_nhk_率 ── Topic: 23
│ │ └─■──放送_アナログ_受信_デジタル_デジ ── Topic: 70
│ └─ _感染_ _記事_確認
│ ├─感染_確認_ _インフルエンザ_商品
│ │ ├─宇宙_遺産_打ち上げ_飛行_世界
│ │ │ ├─遺産_登録_世界_発見_公園
│ │ │ │ ├─■──遺産_登録_世界_公園_知床 ── Topic: 85
│ │ │ │ └─発見_動物_ゲノム_化石_伐採
│ │ │ │ ├─■──伐採_脱走_発見_動物_園 ── Topic: 37
│ │ │ │ └─■──ゲノム_化石_発見_絶滅_解読 ── Topic: 69
│ │ │ └─宇宙_打ち上げ_飛行_探査_iss
│ │ │ ├─■──惑星_冥王星_天体_ub_粒子 ── Topic: 90
│ │ │ └─■──宇宙_打ち上げ_飛行_探査_iss ── Topic: 11
│ │ └─感染_インフルエンザ_商品_販売_確認
│ │ ├─感染_インフルエンザ_確認_新型_型
│ │ │ ├─■──インフルエンザ_新型_感染_細胞_アレルギー ── Topic: 42
│ │ │ └─■──感染_インフルエンザ_鳥_who_養鶏 ── Topic: 10
│ │ └─商品_販売_食品_製造_問題
│ │ ├─建築_手術_計算_移植_出産
│ │ │ ├─手術_移植_出産_不正_患者
│ │ │ │ ├─■──移植_出産_手術_さま_秋篠宮 ── Topic: 46
│ │ │ │ └─不正_試験_流出_職員_アドレス
│ │ │ │ ├─■──接種_合格_検査_処分_患者 ── Topic: 57
│ │ │ │ └─■──流出_アドレス_パスワード_情報_記述 ── Topic: 63
│ │ │ └─■──建築_計算_設計_構造_物件 ── Topic: 96
│ │ └─商品_販売_食品_製造_製品
│ │ ├─商品_販売_食品_製造_製品
│ │ │ ├─■──商品_食品_販売_製造_牛 ── Topic: 5
│ │ │ └─■──製品_電池_事故_産業_石綿 ── Topic: 58
│ │ └─発売_windows_ゲーム_netscape_xbox
│ │ ├─■──発売_windows_ゲーム_netscape_xbox ── Topic: 29
│ │ └─■──事業_放送_デジタル_終了_フィルム ── Topic: 92
│ └─ _ブロック_記事_カテゴリ_語
│ ├─ _ブロック_記事_カテゴリ_語
│ │ ├─■──カテゴリ_含ま_ページ_表示_以下 ── Topic: 13
│ │ └─ブロック_語_年月日_ _記事
│ │ ├─語_年月日_ _記事_
│ │ │ ├─■──年月日_語_ _記事_ ── Topic: 49
│ │ │ └─■──財団_版_ウィキメディア_ウィキペディア_ウィキニュース ── Topic: 50
│ │ └─■──ブロック_返信_ip_アカウント_投稿 ── Topic: 95
│ └─■──____ ── Topic: 62
└─地震_台風_気象庁_駅_日
├─駅_事故_運転_列車_線
│ ├─駅_列車_運転_線_jr
│ │ ├─■──運行_列車_特急_系_車両 ── Topic: 71
│ │ └─駅_運転_線_列車_jr
│ │ ├─■──駅_開業_線_鉄道_新幹線 ── Topic: 17
│ │ └─駅_運転_列車_jr_事故
│ │ ├─■──運転_駅_列車_電車_線 ── Topic: 6
│ │ └─■──復旧_駅_jr_区間_事故 ── Topic: 32
│ └─事故_墜落_航空_乗客_人
│ ├─事故_コースター_乗用車_車線_雪崩
│ │ ├─■──コースター_事故_ジェット_アトラクション_両目 ── Topic: 98
│ │ └─■──乗用車_車線_事故_雪崩_トラック ── Topic: 38
│ └─墜落_航空_事故_乗客_機
│ ├─■──船_衝突_タンカー_事故_航行 ── Topic: 83
│ └─墜落_航空_機_乗客_事故
│ ├─■──ヘリコプター_墜落_事故_自衛隊_ヘリ ── Topic: 56
│ └─■──航空_墜落_乗客_機_乗員 ── Topic: 28
└─地震_台風_気象庁_日_時
├─台風_号_時_気象庁_上陸
│ ├─■──ハリケーン_半島_フロリダ_カテゴリー_時 ── Topic: 34
│ └─■──台風_号_気象庁_上陸_進ん ── Topic: 2
└─地震_津波_観測_震度_気象庁
├─地震_津波_震度_震源_観測
│ ├─地震_津波_震度_震源_観測
│ │ ├─■──噴火_火山_火口_区域_噴煙 ── Topic: 47
│ │ └─■──地震_津波_震度_震源_観測 ── Topic: 3
│ └─■──竜巻_突風_気象台_礼文_棟 ── Topic: 101
└─梅雨_平年_気温_豪雨_ミリ
├─■──梅雨_豪雨_平年_ミリ_災害 ── Topic: 40
└─■──気温_度_猛暑_観測_気象庁 ── Topic: 65
Notice from the tree that some words should be combined for better understanding. For example, the 3rd sub-branch in the tree shows j_リーグ_
which are the words “J” and “League”. Since the nearby branches contain the words athlete
, test
, win
(in Japanese, of course), this should be a topic around soccer. Thus the right tokenized term should be the compound noun “J League” - j リーグ
. Note that Janome has a filter to find compound nouns in its dictionary and tokenize them correctly. See how it works in the following snippet, in which we retrieve compound words, remove punctuations and particles, and convert English characters into lowercase.
from janome.analyzer import Analyzer
from janome.charfilter import*
from janome.tokenfilter import*
from janome.tokenizer import Tokenizer
s ='日本サッカー協会(JFA)とJリーグは4月20日、22日に行われる「JFA/Jリーグポストユースマッチ」に参加する選抜メンバー18人を発表した。'
t = Tokenizer()
tok=list(t.tokenize(s,wakati=True))
print(tok)
token_filters = [CompoundNounFilter(), POSStopFilter(['記号','助詞']), LowerCaseFilter()]
a = Analyzer(char_filters=char_filters, token_filters=token_filters)
for token in a.analyze(s):
print(token)
As we can see in the following output, when we use the plain Janome tokenier, JLeague is broken up into 2 separate words - ‘J’, ‘リーグ’, just like what we saw in the above trees. And Japan Football Association (日本サッカー協会) is broken up into 3 tokens: ‘日本’, ‘サッカー’, ‘協会’. Notice that by using the CompoundNounFilter()
, these 2 terms are tokenized correctly.
['日本', 'サッカー', '協会', '(', 'JFA', ')', 'と', 'J', 'リーグ', 'は', '4', '月', '20', '日', '、', '22', '日', 'に', '行わ', 'れる', '「', 'JFA', '/', 'J', 'リーグポストユースマッチ', '」', 'に', '参加', 'する', '選抜', 'メンバー', '18', '人', 'を', '発表', 'し', 'た', '。']
日本サッカー協会(jfa) 名詞,複合,,,,,日本サッカー協会(jfa),ニッポンサッカーキョウカイ***,ニッポンサッカーキョーカイ***
jリーグ 名詞,複合,,,,,jリーグ,リーグ,リーグ
4月20日 名詞,複合,,,,,4月20日,ツキニチ,ツキニチ
22日 名詞,複合,,,,,22日,ニチ,ニチ
行わ 動詞,自立,,,五段・ワ行促音便,未然形,行う,オコナワ,オコナワ
れる 動詞,接尾,,,一段,基本形,れる,レル,レル
jfa/jリーグポストユースマッチ 名詞,複合,,,,,jfa/jリーグポストユースマッチ,,
参加 名詞,サ変接続,,,,,参加,サンカ,サンカ
する 動詞,自立,,,サ変・スル,基本形,する,スル,スル
選抜メンバー18人 名詞,複合,,,,,選抜メンバー18人,センバツメンバーニン,センバツメンバーニン
発表 名詞,サ変接続,,,,,発表,ハッピョウ,ハッピョー
し 動詞,自立,,,サ変・スル,連用形,する,シ,シ
た 助動詞,,,*,特殊・タ,基本形,た,タ,タ
switch to other pre-trained embedding models
Piqued by the JMTEB leaderboard that measures the performance of different embedding models for various tasks in Japanese text processing, as well as this post, I decided to switch to Ruri v3 embedding model that has a very high rating on clustering (as BERTopic uses clustering after the embedding stage). I have retrained the model with Ruri, and used the same Janome tokenizer code as above to refine the topic representation.
sentence_model = SentenceTransformer("cl-nagoya/ruri-v3-130m")
vectorizer = CountVectorizer(tokenizer=tokenize_jp)
topic_model = BERTopic(embedding_model=sentence_model, vectorizer_model=vectorizer)
topics, _ = topic_model.fit_transform(news)
topic_model.visualize_barchart()
Here’s the topic clusters generated.
As we can see, the topic representation is very similar to what we get in the default multilingual embedding model. Not much improvement is gained here.
visualize topics and documents to see their relationship
Finally, let’s visualize the topics and documents inside them to check relevancy by using topic_model.visualize_documents(news)
.
Each dot is a document. Documents close together are similar in meaning and topic (because their position is based on text embeddings). Choosing the topic on the right highlights where that topic is on the plot.
We get the following nice rendition showing all the topic clusters and documents. Each news article document is represented as a colored dot (the color corresponding to its topic). Just looking at this high level representation, we can quickly glance the most popular topics, namely topic 0 (orange - obituary), topic 1 (green, criminal suspects), topic 2 (red, typhoon) as they are the biggest cluster with most documents assigned.
Additionally, 3 blue dense clusters at the bottom of the graph, #29, 38 and 79 are all related to infection (COVID, bird flu respectively). Recall that we first used a context and language-aware embedding model to generate text embeddings for all the news articles before we cluster them. Thus similar articles (in meaning and topic) will be positioned closer together due to their embeddings.
We can also zoom in to closely examine documents within a topic. We can click #3 (earthquake) on the right topic list to highlights where that topic is on the plot. Then we can use the + sign on the upper right toolbar to zoom into topic 3 and hover over a dot in the cluster. Doing so shows us the content of a document talking about an earthquake in Chiba prefecture.
save and load models
I like to save models for further benchmarking and avoid expending GPU time to rebuild. Fortunately, BERTopic makes it easy to save and load offline models.
If you are using Google Colab, run the following code to authorize access to your Google Drive. A window will pop up to ask what Google account to sign in and grant access. Strangely, it expects you to authorize access to EVERYTHING. Merely uncheck one like Photo will fail the authorization.
from google.colab import drive
drive.mount('/content/drive/')
After that, you can save your trained model by:
topic_model.save("/content/drive/My Drive/your_model_name", serialization="pickle")
Loading it can be done by
loaded_model = BERTopic.load("/content/drive/My Drive/your_model_name")
From there, you can again examine the topics, documents, or update the model as desired.
conclusion
In this blog post, we walked through extracting important and common topics from a Japanese dataset using BERTopic, so that we can understand the most popular themes in the newspaper sample.
- Produce better representation of BOTH the input and output text using models that are built to handle Japanese language
- employ a language-aware embedding model to generate contextual vectors for the input text. Since this is the first step in the whole BERTopic workflow, we should be judicious in the selection of embedding models due to garbage in, garbage out as can be seen from our 1st try of using the default model.
- customize a Japanese tokenizer to refine topic representation by splitting the text into correct tokens and removing meaningless words and punctuations so our topics stand out.
- Visualize the relationships between topics and examine the documents inside topics to check the quality of our model
We haven’t touched other parameter tuning for dimension reduction, clustering, and tokenization yet. I look forward to looking deeper into these steps and share my findings in a later post.