WOLF (Wordnet Libre du Français, Free French Wordnet) specifications - wordnet

I am trying to create an interface for WOLF (Wordnet Libre du Français, Free French Wordnet). The goal is to replicate the AWNDatabaseManger for the Arabic Wordnet (http://www.talp.upc.edu/index.php/technology/resources/multilingual-lexicons-and-machine-translation-resources/multilingual-lexicons/72-awn), but for WOLF.
The problem I am facing is that I cannot find proper data specifications for WOLF (http://alpage.inria.fr/~sagot/wolf-en.html) or WoNeF (which is another French tranlated Wordnet http://wonef.fr/)
For the Arabic Wordnet they have given detailed Data Specifications which can be found at http://globalwordnet.org/arabic-wordnet/awn-data-spec/
I am trying to find the same for either WOLF or WoNeF.
Otherwise how do i map the two files?
For example an word and its relation in awn look like:
<item itemid="$ajarap_AlS~amog_n1AR" offset="111586059" lexfile="" name="شَجَرَة الصَّمْغ " type="synset" headword="" POS="n" source="" gloss="" authorshipid="80" />
<word wordid="$ajarap__1" value="شَجَرَة الصَّمْغ " synsetid="$ajarap_AlS~amog_n1AR" frequency="" corpus="" authorshipid="11461" />
<link type="has_hyponym" link1="$ajarap_AlS~amog_n1AR" link2=">ukAlibotws_n1AR" authorshipid="35038" />
<link type="has_hyponym" link1="$ajarap_n1AR" link2="$ajarap_AlS~amog_n1AR" authorshipid="35041" />
The word defintion (item) and it's relations (link) are seperated with different attributes.
whereas in WOLF a word and it's relations look like:
<SYNSET>
<ILR type="near_antonym">eng-30-00002098-a</ILR>
<ILR type="be_in_state">eng-30-05200169-n</ILR>
<ILR type="be_in_state">eng-30-05616246-n</ILR>
<ILR type="eng_derivative">eng-30-05200169-n</ILR>
<ILR type="eng_derivative">eng-30-05616246-n</ILR>
<ID>eng-30-00001740-a</ID>
<SYNONYM>
<LITERAL lnote="2/2:fr.csbgen,fr.csen">comptable</LITERAL>
</SYNONYM>
<DEF>(usually followed by `to') having the necessary means or skill or know-how or authority to do something
</DEF>
<USAGE>able to swim</USAGE>
<USAGE>she was able to program her computer</USAGE>
<USAGE>we were at last able to buy a car</USAGE>
<USAGE>able to get a grant for the project</USAGE>
<BCS>3</BCS>
<POS>a</POS>
</SYNSET>
I can make assumptions that awn attribute gloss is equal to wolf tag usage, and awn attribute pos is equal to wolf tag pos.
But the point is I don't want to make assumptions, i am looking for proper documentation from which I can be sure and conclude the mappings between the two files.
Could anyone please point me to the right docs?

Depending on your needs, a workaround could be to use the NLTK Python library which integrates some French synsets coming probably from WOLF
>>> from nltk.corpus import wordnet as wn
>>> [synset.lemma_names('fra') for synset in wn.synsets(u'chien'.decode('utf-8'), lang='fra')]
[[u'canis_familiaris', u'chien'], [u'aboyeur', u'chien', u'chienchien', u'clébard', u'toutou'], [u'chien', u'chien_de_chasse'], [u'chien'], [u'chien', u'clic', u'cliquer', u'cliquet'], [u'chien', u'franc', u'hot-dog'], [u'achille', u'chien', u'quignon', u'talon'], [u'chien'], [u'chien']]

The WOLF database is formatted based on VisDic defined here:
https://nlp.fi.muni.cz/trac/deb2/wiki/WordNetFormat
The XSD is available here: http://deb.fi.muni.cz/debvisdic.xsd

Related

How can I have named entities in asciidoctor?

I'm using asciidoctor with the docbook backend for books. In the past I wrote DocBook, which allows me to declare named entities that I use throughout the book:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE book [
<!ENTITY class "Galactic TOP SECRET">
<!ENTITY project "World Domination">
<!ENTITY product "Illuminati Mind Control Chemtrail Spray System CSS-2020">
]>
<book ...>
...
What about our &class; &project;?
Is our &product; working?
...
</book>
:-)
I haven't found a way to tell asciidoctor to insert the DOCTYPE declaration between the XML processing instruction and the <book> element. So I resorted to --no-header-footer and prepending the header and footer lines. Is there a better way to do this? Something like a named entity definition directive? An include mechanism?
Do you have to use Docbook entity declarations? Asciidoctor has "attributes" that can serve the same purpose: https://asciidoctor.org/docs/user-manual/#attributes
For example, you can define an attribute within your document:
:class: Galactic TOP SECRET
Then later in your document, you can use the attribute:
"Billy, come up to the front and address the {class}." said the teacher.
When you transform your document to Docbook, you would see:
<simpara>"Billy, come up to the front and address the Galactic TOP SECRET."
said the teacher.</simpara>
If you do have to use Docbook entity declarations, you might use some XSL to transform the XML you get into the XML you want.

How to validate xml:lang ATTLIST inside XML with DTD?

Many articles on the internet (like this one) suggest using xml:lang or some custom attribute to encode meta-information about language inside XML tags. They mention that these codes have to comply with BCP47 standard.
Let's see what would happen if I encode language attribute as articles suggest:
Inside DTD: <!ATTLIST text xml:lang NMTOKEN #IMPLIED>
Inside XML: <text xml:lang="YODU991Yklew-e-ijsw02ijwk">...</text>
What is the expected result?
DTD validator would check if YODU991Yklew-e-ijsw02ijwk code is a real BCP47 language code, if country and script exist and mark it red, if those codes that are incorrect. Exactly the same way as http://schneegans.de/ helps validating these codes (WRONG code vs. CORRECT code).
What happens instead?
Validator percieves this attribute only as some text and does not validate, if it as a real language code or some gibberish.

Creating Tagged PDF Document in Seam iText

I am trying to create an accessible PDF using Seam and their iText implementation. I cannot find any references to whether or not this is possible. It seems that iText itself can handle it; the PDF on this example is tagged. But all of the PDFs that we create aren't and I can't seem to figure out how to add it.
Here's some sample code from one of our documents:
<?xml version="1.0" encoding="UTF-8"?>
<p:document xmlns:p="http://jboss.com/products/seam/pdf" xmlns:ui="http://java.sun.com/jsf/facelets" xmlns:f="http://java.sun.com/jsf/core" xmlns:s="http://jboss.com/products/seam/taglib" xmlns:h="http://java.sun.com/jsf/html" type="PDF" pageSize="letter" title="Letter" margins="15.0 40.0 20.0 10.0">
<f:facet name="header">
<p:font size="10" name="TIMES-ROMAN" style="bold">
<p:header borderWidth="0"/>
<p:footer borderWidthTop="0" borderWidthBottom="0" alignment="center">
FY #{handler.form.year}<p:text value=" #{handler.form.name}"/><p:text value=" "/>CAN #{handler.form.number}<p:text value=" "/>Object Class #{handler.form.class}<p:text value=" "/>#{handler.form.time}
</p:footer>
</p:font>
</f:facet>
<p:font size="10" name="TIMES-ROMAN">
<p:table columns="3" widthPercentage="100" widths="1 2 1">
<p:cell borderWidth="0">
<p:image alignment="left" value="/assets/img/logo.PNG" scalePercent="5"/>
</p:cell>
<p:cell borderWidth="0" horizontalAlignment="center" paddingTop="30">
<p:paragraph>
WORKSHEET
</p:paragraph>
</p:cell>
... snip ...
I realize that's not the best code (I'm just pulling from a document I'll need to clean up). Still, any ideas on if Seam can actually put in PDF tags?
Out-of-the-box tagged PDF is supported since iText 5.4.0 (which is the most recent version).
When you use the high-level objects such as Paragraph, PdfPTable, etc... and you use PdfWriter.setTagged(), then you get good quality Tagged PDF. You can even choose your own roles.
It would surprise me if jBoss/SEAM would be using such a recent version of iText. I've reached out to them to upgrade and the SEAM team never responded. (Who am I? I'm the CEO of iText Software.)

Docbook publishing for different target audiences

I like to have one docbook xml document that has content for several target audiences. Is there a filter that enables me to filter out the stuff only needed for "advanced" users?
The level attribute is invented by me to express what I have in mind.
<?xml version="1.0"?>
<book>
<title lang="en">Documentation</title>
<chapter id="introduction" level="advanced">
<title>Introduction for advanced users</title>
</chapter>
<chapter id="introduction" level="basic">
<title>Introduction for basic users</title>
</chapter>
<chapter id="ch1">
<para level="basic">Just press the button</para>
<para level="advanced">
Go to preferences to set your
needs and then start the process
by pressing the button.
</para>
</chapter>
</book>
DocBook does not have a level attribute. Perhaps you meant userlevel?
If you are using the DocBook XSL stylesheets to transform your documents, they have built-in support for profiling (conditional text). To use it you need to
use the profiling-enabled version of the stylesheet (e.g. use html/profile-docbook.xsl instead of the usual html/docbook.xsl), and
specify the attribute values you want to profile on via a parameter (e.g. set profile.userlevel to basic).
Chapter 26 of Bob Stayton's DocBook XSL: The Complete Guide has all the details.
Two ways, off the top of my head:
Write a quick script that takes the level as a parameter and, using XPath or regular expressions, that only spits out the XML you want.
Write an XSLT transformation that will spit out the XML you want.
(2) is cleaner, but (1) is probably faster to write up.

list=alllinks confusion

I'm doing a research project for the summer and I've got to use get some data from Wikipedia, store it and then do some analysis on it. I'm using the Wikipedia API to gather the data and I've got that down pretty well.
What my questions is in regards to the links-alllinks option in the API doc here
After reading the description, both there and in the API itself (it's down and bit and I can't link directly to the section), I think I understand what it's supposed to return. However when I ran a query it gave me back something I didn't expect.
Here's the query I ran:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&list=alllinks&alunique&allimit=40&format=xml
Which in essence says: Get the last revision of the Google page, include the id, timestamp, user, comment and content of each revision, and return it in XML format.
The allinks (I thought) should give me back a list of wikipedia pages which point to the google page (In this case the first 40 unique ones).
I'm not sure what the policy is on swears, but this is the result I got back exactly:
<?xml version="1.0"?>
<api>
<query><normalized>
<n from="google" to="Google" />
</normalized>
<pages>
<page pageid="1092923" ns="0" title="Google">
<revisions>
<rev revid="366826294" parentid="366673948" user="Citation bot" timestamp="2010-06-08T17:18:31Z" comment="Citations: [161]Tweaked: url. [[User:Mono|Mono]]" xml:space="preserve">
<!-- The page content, I've replaced this cos its not of interest -->
</rev>
</revisions>
</page>
</pages>
<alllinks>
<!-- offensive content removed -->
</alllinks>
</query>
<query-continue>
<revisions rvstartid="366673948" />
<alllinks alfrom="!2009" />
</query-continue>
</api>
The <alllinks> part, its just a load of random gobbledy-gook and offensive comments. No nearly what I thought I'd get. I've done a fair bit of searching but I can't seem to find a direct answer to my question.
What should the list=alllinks option return?
Why am I getting this crap in there?
You don't want a list; a list is something that iterates over all pages. In your case you simply "enumerate all links that point to a given namespace".
You want a property associated with the Google page, so you need prop=links instead of the alllinks crap.
So your query becomes:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions|links&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&format=xml