Mapping from fine-grained POS tags to coarse-grained tags tags - spacy

I have a bunch of documents already POS tagged with fine-grained POS tags, specific to English.
I would like to map those tags to coarse-grained tags that more universal across different languages.
Is there a mapping defined in spacy for that?
For instance, something that maps all the following fine-grained tags to NOUN.
"NN": "noun, singular or mass",
"NNP": "noun, proper singular",
"NNPS": "noun, proper plural",
"NNS": "noun, plural",
I know that spacy can tag document with both types of tags, but I don't want to re-tag the document again.

spaCy is already doing what you describe in the pretrained models using an AttributeRuler in the pipeline. I would recommend you look at the AttributeRuler documentation.

Related

Does Named Entity Recognition only work on nouns?

I am considering training spaCy to recognize a custom named entity, but I am curious if this really only works for nouns or if it would equally work well with POS such as adjectives?
For example, I want to train on words like depressed, anxious, paranoid, etc. I'm trying to curate a list of adjectives that are considered clinically relevant, separating them from other irrelevant adjectives like happy, sad, unwell.
Is NER the right approach here, would it make more sense to just manually maintain a list of clinical adjectives and use a custom extension (e.g. ent._.clinical_adj) to mark them?
NER is typically used mainly on nouns. It's not that sensitive to part of speech type, but picking up just adjectives would be an unusual use.
Since it sounds like you do have a specific, finite list of words you're interested in, it probably makes sense for you to just use that word list and an extension to mark them.
You might also want to look at the DependencyMatcher. If there are some nouns you are interested in, you can use the DependencyMatcher to get all adjectives that modify them, for example.

Wikipedia data extraction

I am trying to populate some tables with Hindi Wikipedia data. I have to populate it with article titles, their categories and their corresponding English url.
Right now I am finding the category and English url by parsing the html file and locating the particular div tag. This is taking a lot of time. Is there any direct and efficient way to populate the categories. Do let me know.
I have downloaded hindi wikipedia from the link: ftp://wikipedia.c3sl.ufpr.br/wikipedia/hiwiki/20131201/
You could either use some sort of parsing engine like Wikiprep: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/
Or you could use the MediaWiki engine to handle the Wiki markup language.
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
There might be some other options that might be relevant to your case, you can check out also here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Help_importing_dumps_into_MySQL
(I've personally used options #1 and #2)

How can I implement hierarchical tags (tags belonging to other tags) with acts_as_taggable_on?

On our website for a cancer-related organization, we have a flat tag structure with tags like "Leukemia" but also "Chronic Myelogenous Leukemia" and "Acute Lymphoblastic Leukemia". We have a rule that anything tagged with a specific kind of leukemia should also be tagged with the main "leukemia" tag, but there is no programmatic link between them.
It'd be nice if there were such a link--some relation between tags describing one as being a parent/child of another--so that on, say, the "Leukemia" page we could have some links to the sub-topics: AML, CML, etc.
It look like the developers don't plan on supporting this (according to a Jan 2011 github issue), but it seems like a common enough use case that maybe someone's found a workaround, perhaps by modifying the Tag model to make tags themselves taggable (insert Xzibit photo).

using wordnet to find terms with no noun synsets or at least one noun synset

I am using WordNet 3.0. The WordNet documentation shows how to find synsets of a given word such as:
wn car -synsn
But, is there a way to find terms with
a) no noun synsets
b) with at least one noun synset and so on.
Thanks,
Sony
The short answer is:
"NO! There is no way to search based on existence or count of words in synset"
Neither the Command Line interface nor the Library API provide the ability to apply this kind of predicates to a search.
This said, it is possible to import WordNet files to a more relational type of storage, and perform this type of queries in the resulting database.
The more direct way to import WordNet data is by tapping directly into the WordNet files themselves (see in particular these two files and parsing out the desired data.
An alternative is to build some kind of scanner of the data based on the Library API, hence leveraging all the WordNet format parsing capability of the library, and to output the desired Fields to a text file more suitable for database import.

Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML)

I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.
Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.
I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.
I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.
I would highly recommend reading through the PDF spec:
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf
There isn't a "semantic structure" to the document like you might find in an HTML file; it's much more complicated.
The file format is largely based on a COS Object Tree, which is essentially a set of objects referencing each other in various manners, but not in any particular order (with some exceptions).
Some of these objects contain what you are likely after (document tages, etc). Moreover, these objects can be encoded in various ways.
Very complicated.
I would recommend looking at some of the well developed PDF libraries out there like iText:
http://itextpdf.com/
What do you mean by 'well-structured'?
If the PDFs contain marked content you can get an almost perfect extraction of semantic data. Otherwise it simply does not exist but might be 'guessed' in some cases.