Gairaigo Segmentation and Translation

This server attempts to segment Japanese loanwords (gairaigo) into their component parts and to generate basic translations for them. Two methods are used: the CLST (Corpus-based Loanword Segmentation and Translation) method, and the MeCab Japanese morphological analysis package combined with the Unidic lexicon. Both are described below. (Try a couple: クーリングパイプ, アルキレンオキシドポリマ.)

Enter Gairaigo:



CLST Method
This method is described in a recent paper. In summary, it:

  1. uses a large list of known gairaigo to find all possible ways of segmenting the input term;
  2. looks up each possible segment in a large gairaigo-English dictionary and generates all the combinations of possible meanings of the term;
  3. checks each possible translation against the Google English N-gram database to see if that sequence of words is used in WWW pages and with what frequency. (As this database is about 50Gb, checking a large number of combinations may take a few seconds.)
MeCab/Unidic
MeCab is a major open-source Japanese morphological analysis and part-of-speech tagging package. It uses advanced AI techniques and a large morpheme lexicon to find the boundaries between morphemes in Japanese text. In the past Japanese morphological analysis systems did not work very effectively with gairaigo. The Unidic morpheme lexicon in its latest version (2.1.1) has been trained with a large number of gairaigo (about 95,000), and has simple English translations attached to about half of them.

Warning
The server code is still a bit buggy (well, all code is...). Occasionally it will hang or abort. I'm chasing a couple of over-run issues, so if you find it misbehaves, email me the offending gairaigo.

Jim Breen
March 2013/May 2018