Sentence-Dictionary Linking

From EDRDG Wiki
Revision as of 22:45, 15 March 2013 by JimBreen (talk | contribs) (Index Format)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

To enable dictionary systems, apps, etc. to use the Japanese-English sentences from the Tanaka Corpus/Tatoeba as examples, a set of word-level indices have been compiled and are associated with each sentence (at present about 150,000 sentences have indices.) These indices are maintained within the Tatoeba system (there is a special GUI for this), and periodically downloaded for use with dictionary systems. The indices are particularly associated with the JMdict/EDICT2 dictionary files, but may also be used elsewhere.

Index Format

The indices for a sentence consist of a line of text with space-delimited index elements for each word in the sentence. The following is an example:

Sentence: その家はかなりぼろ屋になっている。

Indices: 其の[01]{その} 家(いえ)[01] は 可也{かなり} ぼろ屋[01]~ になる[01]{になっている}

The format of the index elements is as follows:

  • the usual headword as it appears in the dictionary. Even if the word is usually written in kana, the kanji form must be used if it is available. This field is mandatory, howver it may be omitted for proper names not found in the dictionary.
  • the reading of the word. This is optional, however it must be used if there are several different dictionary entries with the same headword.
  • a sense number. This is used when the word has multiple senses in the JMdict/EDICT2 file, and indicates which sense applies in the sentence. It is a two-digit numeric in square parentheses. The field is optional.
  • the form in which the word appears in the sentence. This may differ from the indexing word, e.g. if it is an inflected verb or adjective, if the word is usually written in a different way, etc. This field is in "curly" parentheses. It is not mandatory, but should be included where appropriate.
  • a "~" character to indicate that the sentence pair is a good and checked example of the usage of the word. Words are marked to enable appropriate sentences to be selected by dictionary software. Typically only one instance per sense of a word will be marked. (The WWWJDIC server displays these sentences below the display of the related dictionary entry.)

Some indices are followed by a "|" character and a digit. These are an artefact from a former maintenance system, and can be safely ignored.

The fields after the indexing headword ()[]{}~ must be in that order.

File Format

A file of the Japanese-English sentence pairs with the indices can be downloaded from the Tatoeba site. This file, which is generated once each week, is in UTF-8 encoding, and has the following format:

Jpn_seq_no[TAB]Eng_seq_no[TAB]Japanese sentence[TAB]English sentence[TAB]Indices

Another version, which is used by the WWWJDIC servers, has the sentences and indices on separate lines. The format is:

A: Japanese sentence[TAB]English sentence#ID=Engseq_Jpnseq
B: Indices

This file can be downloaded in EUC-JP coding or UTF-8 coding.