This zip file contains two files: kale-p-*.csv kale-q-*.csv where * is "u" if the file is utf-8 encoded, or "w" if it is shift-jis encoded. The xxx-u.csv files are packaged together in file kale-u.zip, and the xxx-w.csv files in kale-w.zip. The zip files are available at http://www.edrdg.org/~smg/. The csv files combine the Google page count data generated for for JMdict reading and kanji text strings by Kale Stutzman(*) with "P" tag information for the strings extracted from JMdict. This allows comparison of these two frequency-of-use metrics. All JMdict data is from the 2007-01-14 version of ftp://ftp.cc.monash.edu.au/pub/nihongo/JMdict.gz kale-p-*.csv Gives all the text strings and page counts in Kale's data, matches them to JMdict entries, and indicates whether or not the string is marked P in the entry. See below for detailed description of format. kale-q-*.csv For all reading and kanji text strings in JMdict that are marked with a P tag in edict/wwwjdict, gives the Google page counts from Kale's data, with a visual flag if the page count values is less than 1M. See below for detailed description of format. (*) Generated around 2007-01-14 by Kale Stutzman. See email to the edict-JMdict mailing list at: http://http://tech.groups.yahoo.com/group/edict-JMdict/message/1076 Kale's data file is (at time of writing) at: http://www.samuraifight.com/edict-gfreq.txt General Notes: I have not checked these results much yet. There may be goofs, possibly major. The files were created using queries on the experimental JMdict database which was loaded from the full JMdict file rather than the english-only version. Hence the glosses are rather long and contain non-english characters. These files may be updated in the future based on feedback from the edict-JMdict mailing list. ============================================ kale-p-*.csv ============================================ The data is arranged in CSV (comma separated value) format as a table with 7 columns: txt -- This is a text string from Kale's file. The original file had multiple search results on a single line, and many search strings occurred multiple times in the file. I coalesced all occurrences of the same search string into a single one by averaging the hit counts for all occurrences of the same string. hits -- Hit count for "txt". As mentioned above, if there were multiple occurrences of the text in Kale's file, this value is the average of all of them. seq -- The sequence number of the JMdict entry with a reading or kanji that matched 'txt". case -- Is "P" if the reading or kanji that matches "txt" would have a P tag in edict/wwwjdict. Blank otherwise. kanji, rdng, gloss -- These columns give a summary of the JMdict entry. Notes: Because a "txt" string may match readings or kanji in multiple JMdict entries, it will be repeated on multiple lines for each JMdict entry matched. Records are ordered by descending hit counts. If you can get the data in Excel or similar, you can re-sort however you wish. ============================================ kale-q-*.csv ============================================ The data is arranged in CSV (comma separated value) format as a table with 7 columns: txt -- A P marked reading of kanji text string from JMdict. hits -- Google page count for this string from Kale's data. low -- Contains "*" if "hits" is < 1M, blank otherwise. seq -- The sequence number of the JMdict entry that contains the reading or kanji in "txt". kanji, rdng, gloss -- These columns give a summary of the JMdict entry. Records are ordered by entry seq number. ============================================ Stuart McGraw, 2007-01-16 Revised 2007-01-17, minor edits, spelling and typo corrections.