I am trying to find a standard format for the Wordnet database to use with many languages (starting with English and French). I was only able to find the French version (WOLF) in VisDic XML format and the English one as the original format (non-XML) defined on the Princeton website
I found the Open Multilingual Wordnet which seems to have most languages in a standard XML format, but they do not contain relationships between terms (e.g. hypernym).
Does the English database exist in the VisDic XML Format?
or
Does the French database exist in the original non-xml format?
or
Is there any other better standard format that I am not aware of?
Related
I am trying to connect 2 API queries.
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Albert+Einstein&format=json
Where I search for article descriptions and
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllang=de&titles=Companion%20dog
Where I retrieve the name of the article in another language (here German).
Is there a way to connect them to retrieve description data both in English and German?
I have tried connecting them via "generators" and I seem to not understand how to apply it here.
I also tried inputting another query after extracting names in 2 languages (searching for descriptions). However, the names are sometimes formatted so that I cannot reuse them in the query.
No. The description is a snippet from the start of the article. If you want a German description, you need to get it from the German Wikipedia (ie. a different API endpoint).
For now it supports data only in English.
For example I have a title and a description column. I want them to be in Russian or Hebrew. But when I type in these languages, it converts it to question marks.
Any solutions?
I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.
I'm trying to find sorted lists of words that meet two criteria:
the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale
I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?
"words.en.txt" is an example text file containing American English text:
Andrew
Brian
Chris
Zachary
I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.
Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):
cote
côte
coté
côté
The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):
cote
coté
côte
côté
Thank you for the help,
Chris
Here's what I found.
The Unicode Common Locale Data Repository (CLDR) is pretty much the authority on collations for international text. I was able to find several lists of words conforming to the rules found in CLDR in the ICU Project's ICU Demonstration - Locale Explorer tool. It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. It's a great library; check it out.
In some cases, it was useful to construct some nonsense terms by reverse-engineering the CLDR rules directly. Search engines available in the United States were not suited for finding foreign terms with the case/diacritic/other nuances I was interested in for this testing (in retrospect, I wonder if international search engines would have been better-suited for this task).
I am using WordNet 3.0. The WordNet documentation shows how to find synsets of a given word such as:
wn car -synsn
But, is there a way to find terms with
a) no noun synsets
b) with at least one noun synset and so on.
Thanks,
Sony
The short answer is:
"NO! There is no way to search based on existence or count of words in synset"
Neither the Command Line interface nor the Library API provide the ability to apply this kind of predicates to a search.
This said, it is possible to import WordNet files to a more relational type of storage, and perform this type of queries in the resulting database.
The more direct way to import WordNet data is by tapping directly into the WordNet files themselves (see in particular these two files and parsing out the desired data.
An alternative is to build some kind of scanner of the data based on the Library API, hence leveraging all the WordNet format parsing capability of the library, and to output the desired Fields to a text file more suitable for database import.
Is there somewhere one can get the xml for the english thesaurus from the web (for mssql that is)? I'd really hate to populate it by hand...
Here is a free one used on project guthenburg although I think it is TXT format
http://www.gutenberg.org/dirs/etext02/mthes10.zip