using wordnet to find terms with no noun synsets or at least one noun synset - wordnet

I am using WordNet 3.0. The WordNet documentation shows how to find synsets of a given word such as:
wn car -synsn
But, is there a way to find terms with
a) no noun synsets
b) with at least one noun synset and so on.
Thanks,
Sony

The short answer is:
"NO! There is no way to search based on existence or count of words in synset"
Neither the Command Line interface nor the Library API provide the ability to apply this kind of predicates to a search.
This said, it is possible to import WordNet files to a more relational type of storage, and perform this type of queries in the resulting database.
The more direct way to import WordNet data is by tapping directly into the WordNet files themselves (see in particular these two files and parsing out the desired data.
An alternative is to build some kind of scanner of the data based on the Library API, hence leveraging all the WordNet format parsing capability of the library, and to output the desired Fields to a text file more suitable for database import.

Related

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

fetching wikidata labels in other languages from reconciled column

I want to use wikidata reconciliation to translate a column of terms into various languages by fetching the labels in those languages. Using SPARQL, I'd filter a query for label by language (this is the approach suggested in various similar cases). I don't see how to do the same using OpenRefine reconciliation, however.
Maybe the problem is that the wikidata API is language-specific?
Say that you want to fetch labels in Italian, which has language code it. You can do that by entering Lit in the property input. You can also fetch descriptions with Dit or aliases with Ait. To fetch these terms in other languages, replace it by other language codes.
This is only documented at https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation so far - I acknowledge that we need a more visible documentation for this (ideally it should be easily accessible from OpenRefine's user interface, given that the reconciliation service comes preconfigured in OpenRefine).

Using an ontology to produce semantic full information from the raw data

Problem Definition: Store sensor data (temperature readings, sensor description) into rdf form using an ontoloy. Further, use SPARQL to perform queries on stored data.
My Approach: I am not an expert in this domain, but I have some basic understanding and accordingly I am using this approach: 1. Create Ontology, 2. convert data according to ontology vocabulary, 3. store converted data into triple store, 4. Perform SPARQL queries. I am not sure whether I am following the right way. Any comments from your side will be valuable.
Till now, I have done the following:
I have created an ontology in Protege 5.0.0 as for representing temperature sensor. This ontology only represents one part of full ontology.
I have collected data in a CSV file which includes date, time and temperature reading as shown as
Now, I want to use this ontoloy for storing the csv file in rdf form in some data store. At this step I am stuck from last three days. I have found some links like link1, link2 but still I am finding it difficult to proceed further. Do I need a script which will read csv file and perform mapping to given ontology concepts. If yes, is there a sample script which does the same? Possibly, outcome might look like:
<datetime>valX</datetime>
<tempvalue>valY</tempvalue>
Can anyone guide me in the following:
1. Am I taking correct steps to solve the problem?
2. How should I solve step 3, i.e, store data according to ontology.
P.S: I have posted this question on answers.semanticweb.com also. This is only to get the response asap.
actually, this is a great use of D2RQ mapping language, and D2RQ server.
Go install D2RQ, then start it up with a connection to your relational database. Then generate a mapping file using their generator that comes with the software. Then you'll have a mapping file -- edit that and swap out the automatically generated ontology prefixes with your own. Their website has a page that explains how the mapping language works.
Once you've done that and there are no errors in the mapping file, you can actually query your whole relational dataset with SPARQL without even having to export it and load it in a real triplestore.
However, if you want to export and load into a triplestore, you'd just run the D2RQ generate triples functionality (Also included in d2rq server), and then import that triples file into a triplestore like Jena Fuseki.

How to store custom token attribute in Lucene Index

I want to create a Lucene analyzer for RDF nodes. RDF nodes can have multiple types (uri, bnode, plain literal, plain literal with language, typed literal with datatype). While analyzing the term, I want to create a RDFNodeTypeAttribute, LanguageAttribute and DatatypeAttribute to store respectively the type of RDF node, the language of the literal and the datatype attribute. My question is how these attributes can be stored in lucene index. Do I have to write a custom Codecs ? Do I have to use the PayloadAttribute ? How can I leverage these attributes once stored in the index for my search ?
Thank you for your help
I could not exactly get your requirements but you would use Codecs if you are not happy with the way a Lucene index is encoded and decoded. Codecs gives you flexibility to have your own PostingsFormat, SegmentInfosFormat, LiveDocsFormat etc. So let us say, you want a different postingsFormat from the default Lucence codec - which is more like for every term, store all docIds it occurs in, how many times it occurs in a doc, at what position etc in a particular format. If you want this information to be stored in a different format, you would need a codec.
I do not think you need to write any Codec or any PostingFormat for this. Perhaps writing your own Analyzer and Similarity classes should be sufficient. If you give more information about your problem, I can think further.
Payload is at term level and typical use case is to store meta data for every term. So, a use case like: this term is written in Bold,or is a noun etc are meta data for the term and should be stored in a payload. You actually use payloads for scoring of the docs and they matter in giving a term some weight.
Though RDF is a metadata for a web resource, you are probably talking about indexing RDF itself. Even if it is part of the web document, you are indexing, putting the RDF info for every term in the web document will not be a viable approach, as there are better ways to allocate weights to a document than that.

Searching Natural Language Sentence Structure

What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:
(TOP
(S
(NP (DT This))
(VP
(VBZ uploads)
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
(. .)))
If someone entered the query "How can I upload files?", equating to the tree:
(TOP
(SBARQ
(WHADVP (WRB How))
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
(. ?)))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords, but is obviously describing a completely different behavior.
Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.
I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":
det(function-2, This-1)
nsubj(uploads-3, function-2)
dobj(uploads-3, files-4)
det(machine-8, a-6)
amod(machine-8, remote-7)
prep_to(uploads-3, machine-8)
In your database, you could store a simplified subset of these triples associated with the function, e.g.:
upload_file(): subj(uploads, function)
upload_file(): obj(uploads, file)
upload_file(): prep(uploads, machine)
When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.
There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.
This is not a complete answer, but if you want to perform linguistically sophisticated queries on your trees, the best bet is to pre-process your parser output and search it with tgrep2:
http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html
Trgrep/tgrep2 are, as far as I know, the most flexible and full-featured packages for searching parse trees. This is not a MySQL-based solution as you requested, but I thought you might be interested to know about this option.
Tgrep2 allows you to ask questions about parents, descendants and siblings, whereas other solutions would not retain the full tree structure of the parse or allows such sophisticated queries.