How can I decode Wordnet entites? - wordnet

in my experiment, I am using WordNet18 data, which contains triplets of the form (subject, predicate, object), some examples are as follows:
03964744 _hyponym 04371774
00260881 _hypernym 00260622
02199712 _member_holonym 02188065
I would like to know what the entity IDs, like 03964744, stand for. Is there anybody knows how to decode the entities?
Thank you in advance. 

The 8-digit numbers you see are probably the byte offset of the entry in the data files. See http://wordnet.princeton.edu/wordnet/man/wnintro.5WN.html
After quite a bit of hunting around, I think you are looking at the numbers of WordNet 3.0 (the byte offsets for a given synset differ between versions; 3.1 is the latest version).
Your first entry seems to be saying that swing is a type of toy:
http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?usrname=&gridmode=grid&synset=04371774-n&lang=eng&lang2=eng
http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?synset=03964744-n

Related

Reading Fortran binary file in Python

I'm having trouble reading an unformatted F77 binary file in Python.
I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...
The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).
open (unit=21, file=bendnm, status='new'
. ,form='unformatted')
write (21) bcube1
close (unit=21)
I can successfully read it in IDL using the following (also not my code, copied from colleague):
bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun
The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).
Now I try to get the same effect using Python, but no luck. I've tried the following methods.
In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')
which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.
Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).
In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030, 4.97284231e-020, -1.06514594e+299, ...,
8.97359707e-029, 6.79921640e-316, -1.79102266e-037])
I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.
It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?
I played a bit around with it, and I think I have an idea.
How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:
The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.
The header usually stores the length of the record in bytes, and is repeated at the end.
Then again, it is not standardized, so no guarantees.
The endianness, little or big.
Technically for both header and values, but I assume they're the same.
Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.
When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')
When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:
vals = ff.read_reals('>f8')
Look here for a description of the syntax of the data type.
If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.
Fortran has record demarcations which are poorly documented, even in binary files.
So every write to an unformatted file:
integer*4 Test1
real*4 Matrix(3,3)
open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)
Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)
The above could be read in Python via numpy as:
input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)
Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:
Matrix = np.transpose(data[0]['MatrixT']
Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.

Schema of codec is part of the data stream

I'm currently evaluating if scodec is the right tool for my task. I have to parse an InputStream (file or network) which is structured the following:
| Header - FieldDesc1 - FieldDesc2 - ... \
- FieldDescM - Record1 - Record2 - ... - RecordN |
This means the stream starts with some metadata, which descibes what will follow. Each element is separated by a delimiter ( - ) which identifies what type it is. The N field descriptions contain the information which structure and size each of the N records will have.
I was readily able to parse header as well as the sequence of fields, because I was able to formulate a codec which is known at compile time. But I'm kind of puzzled how to build a codec at runtime due to the information from the field descriptions.
Is that possible? If yes, perhaps you can point me to an example?
Here is a starting point if it's still relevant:
Use a DiscriminatorCodec (and a potentially related question), or
Use consume() on the codec decoding the type identifier (which I guess is a simple number), and pass the type to a function returning the correct wanted Codec.
For example using consume(), you can determine what codec to use at decode time:
def variableTypeC =
int8.consume(tid => selectCodec(tid))(selectTypeId(_))
I had to work on a similar problem and went for the consume() solution (as I had the feeling it provided me with a bit more flexbility and was only discovering scodec at the time).
I'd be happy to build an example using DiscriminatorCodec if there is any need for it :).

how do you flatten and unflatten an array of doubles in labview?

I have created a simple LabView program shown below that attempts to flatten an array [1,0,3] and then unflatten it and print out the contents.
However, I am unsuccessful in doing so. What am I doing wrong?
What am I doing wrong?
You're not going through tutorials or you're not reading the context help for the unflatten function (Ctrl+H) or you're not reading the full help for the function (right click>>Help) or you're not looking at the examples (from the help or Help>>Find Examples). Take your pick (preferably all four).
If you want an actual answer it is that LV is strictly typed, and therefore you need to tell the unflatten function which data type you want it to output (1D DBL array) and you're not doing that, but the real answer is what's in the previous paragraph - you should use those tools to learn how to find such an answer yourself.
The string returned by Flatten to String only contains the data, not the description of what data type was passed in, so in order to unflatten it again you need to tell Unflatten from String what type it was. You do this by wiring some data of the appropriate type (any data - if it's an array it can be an empty one) to the Type terminal.
I don't think this is immediately obvious from the LabVIEW 2012 help but I think it's fairly clear if you follow the link from the Unflatten from String help page to one of the examples. The Read Flattened Data.vi example has an array wired to the Type input.

How Do I Convert a Byte Stream to a Text String?

I'm working on a licensing system for my application. I'd like to put all licensing information (licensee name, expiration date, and enabled features) into an object, encrypt that object with a private key, then represent the encrypted data as a single text string which I can send via email to my customers.
I've managed to get the encrypted data into a byte stream, but I don't know how to convert that byte stream into a text value -- something that contains no control characters or whitespace. Can anyone offer advice on how to do that? I've been researching the Encoding class, but I can't find a text-only encoding.
I'm using Net 2.0 -- mostly VB, but I can do C# also.
Use a Base64Encoder to convert it to a text string that can be decoded with a Base64Decoder. It is great for representing arbitary binary data in a text friendly manner, only upper and lower case A-Z and 0-9 digits.
BinHex is an example of one way to do that. It may not be exactly what you want -- for example, you might want to encode your data such that it's impossible to inadvertently spell words in your string, and you may or may not care about maximizing the density of information. But it's an example that may help you come up with your own encoding.
I've found Base32 useful for license keys before. There are some C# implementations linked from this answer. My own license code is based on this implementation, which avoids ambiguous characters to make it easier to retype the keys.

Extract terms from query for highlighting

I'm extracting terms from the query calling ExtractTerms() on the Query object that I get as the result of QueryParser.Parse(). I get a HashTable, but each item present as:
Key - term:term
Value - term:term
Why are the key and the value the same? And more why is term value duplicated and separated by colon?
Do highlighters only insert tags or to do anything else? I want not only to get text fragments but to highlight the source text (it's big enough). I try to get terms and by offsets to insert tags by hand. But I worry if this is the right solution.
I think the answer to this question may help.
It is because .Net 2.0 doesnt have an equivalent to java's HashSet. The conversion to .Net uses Hashtables with the same value in key/value. The colon you see is just the result of Term.ToString(), a Term is a fieldname + the term text, your field name is probably "term".
To highlight an entire document using the Highlighter contrib, use the NullFragmenter