how to interpret mecab unidic csv columns - mecab

Here are some sample entries from mecab unidic:
ネコ - 名詞,普通名詞,一般,,,,ネコ,猫,ネコ,ネコ,ネコ,ネコ,和,,,,,,,体,ネコ,ネコ,ネコ,ネコ,1,C4,,7918141644612096,28806
が - 助詞,格助詞,,,,,ガ,が,が,ガ,が,ガ,和,,,,,,,格助,ガ,ガ,ガ,ガ,,動詞%F2#0,名詞%F1,,2168520431510016,7889
蚊 - 名詞,普通名詞,一般,,,,カ,蚊,蚊,カ,蚊,カ,和,,,,,,,体,カ,カ,カ,カ,0,C4,,1536851034907136,5591
を - 助詞,格助詞,,,,,ヲ,を,を,オ,を,オ,和,,,,,,,格助,ヲ,ヲ,ヲ,ヲ,,動詞%F2#0,名詞%F1,形容詞%F2#-1,,11381878116459008,41407
As you can see, there are 30 csv columns in those unidic entries. What do they all represent?

You can see a list of the Japanese names of all columns at the UniDic FAQ. Most of the columns are pretty obvious once you see the name.
There are more details in the UniDic Manual that explain all the fields, though for some of them - mainly the *ConType and *ModType fields - they're pretty complicated. These fields are mostly related to the pronunciation of compound words.

Related

Underscore and dash in column names after JSON import

I've been using OpenRefine very successfully for a couple of years, working solely with CSV (and TSV) source files. Recently I had some tables from an sql database that I wanted to bring into OpenRefine so I exported them (from SQL) as JSON and then used OpenRefine's JSON import feature. It works beautifully except that the column names all begin with _ - . For example, my JSON records start with
{"ID":"97247",
and OpenRefine made the first column name _ - ID instead of just ID (which I'd prefer - I know I can edit them later, but I have hundreds of fields). I can't see any settings in the parsing page that might help this. Does anyone know if there is a way to import without the extra characters (or if there's an explanation for the underscore dash)? I'm considering submitting a feature request but I thought I'd check to see what other users may know.
This is a known issue.
There has also been a proposal to switch to a standard representation for JSON paths.
Feel free to comment on either tickets to indicate which solution you would prefer.

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

SQL - Referencing a table at a specific position in content?

I am not sure if this is possible, but I'm also not sure if I am using the correct terminology here, so forgive and correct me if I don't. Also, this question is more about database design more generally.
Say I have something like Article:
Title: Stem cells are soon being used for stuff
Text: "Here is the content for an article about stuff. Here is some more info on stemm cells and stuff. [To the uninitiated, here comes an Info Box on Stem Cells in general, you can expand it!] Now some more text about stem cells and stuff"
In my app I would like to display the article, and then at an exact position (here after sentence no. 2, but this will vary from article to article) insert an info-box on stem cells, which is in its own SQL table.
I know that the idea of SQL in general is that I reference InfoBox in my Article and simply point to it. That would be the relation between article and infoBox.
But how do I specify that infoBox should come exactly after Sentence No. 2? (as in the example). And this will not always be the case. Sometimes there might be no infoBox for an article or multiple, sometimes it will come after sentence no. 25 or 100 -etc.
I don't want to mix relations/fields and content, but I don't understand how I would realise something like this in SQL.
A table (base table or query result) holds rows of values that participate in a relation(ship)/association. Those are the rows that make the table's associated (characteristic) predicate (sentence template parameterized by column names) into a true proposition (statement).
You seem to want a base table for triples where
infoBox [i] should come exactly after Sentence No. [n] of article [a]
PS Time to read a published academic textbook on information modeling & database design. (Manuals for languages & tools to record & use designs are not textbooks on doing information modeling & database design.)

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.