Creating, Visualizing and Querying simple Data Structures - sql

Simple and common tree like data structures
Data Structure example
Animated Cartoons have 4 extremities (arm, leg,limb..)
Human have 4 ext.
Insects have 6 ext.
Arachnids have 6 ext.
Animated Cartoons have 4 by extremity
Human have 5 by ext.
Insects have 1 by ext.
Arachnids have 1 by ext.
Some Kind of Implementation
Level/Table0
Quantity, Item
Level/Table1
ItemName, Kingdom
Level/Table2
Kingdom, NumberOfExtremities
Level/Table3
ExtremityName, NumberOfFingers
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon
skeet, 3 Atomic ant, 2 Shelob (spider)
Querying.. "Number of fingers"
Number = 1*4*4 + 1*4*4 + 1*4*5 + 3*6*1 + 2*6*1 = 82 fingers (Let Jon be a Human)
I wonder if there is any tool for define it parseable for automatic create the inherited data, and drawing this kind of trees, (with the plus of making this kind of data access, if where posible..)
It could be drawn manually with for example FreeMind, but AFAIK it dont let you define datatype or structures to automatically create inherited branch of items, so it's really annoying to have to repeat and repeat a structure by copying (and with the risk of mistake). Repeated Work over Repeated Data, (an human running repeated code), it's a buggy feature.
So I would like to write the data in the correct language that let me reuse it
for queries and visualization, if all data is in XML, or Java Classes, or in a Database File, etc.. there is some tool for viewing the tree and making the query?
PD : Creating nested folders in a filesystem and using Norton Commander in tree view, is not an option, I hope (just because It have to be builded manually)

Your answer is mostly going to depend on what programming skills you already have and what skills you are willing to acquire. I can tell you what I would do with what I know.
I think for drawing trees you want a LaTeX package like qtree. If you don't like this one, there are a bunch of others out there. You'd have to write a script in whatever your favorite scripting language is to parse your input into the LaTeX code to generate the trees, but this could easily be done with less than 100 lines in most languages, if I properly understand your intentions. I would definitely recommend storing your data in an XML format using a library like Ruby's REXML, or whatever your favorite scripting language has.
If you are looking to generate more interactive trees, check out the Adobe Flex Framework. Again, if you don't like this specific framework, there are bunches of others out there (I recommend the blog FlowingData).
Hope this helps and I didn't miserably misunderstand your question.

Data structure that You are describing looks like it can fit in xml format. Take a look at Exist XML database, and if I can say so it is the most complete xml database. It comes with many tools to get you started fast ! like XQuery Sandbox option in admin http interface.
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon skeet, 3 Atomic ant, 2 Shelob (spider)
I am assuming that there are 2 instances of jon skeet, 3 instances of Atomic ant and 2 instances of Shelob
Here is a XQuery example:
let $doc :=
<root>
<definition>
<AnimatedCartoons>
<extremities>4</extremities>
<fingers_per_ext>4</fingers_per_ext>
</AnimatedCartoons>
<Human>
<extremities>4</extremities>
<fingers_per_ext>5</fingers_per_ext>
</Human>
<Insects>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Insects>
<Arachnids>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Arachnids>
</definition>
<subject><name>Homer Simpson</name><kind>AnimatedCartoons</kind></subject>
<subject><name>Ralph Wiggum</name><kind>AnimatedCartoons</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
</root>
let $definitions := $doc/definition/*
let $subjects := $doc/subject
(: here goes some query logic :)
let $fingers := fn:sum(
for $subject in $subjects
return (
for $x in $definitions
where fn:name($x) = $subject/kind
return $x/extremities * $x/fingers_per_ext
)
)
return $fingers

XML Schema Editor with visualization is perhaps what I am searching for
http://en.wikipedia.org/wiki/XML_Schema_Editor
checking it..

Related

Generating similar named entities/compound nouns

I have been trying to create distractors (false answers) for multiple choice questions. Using word vectors, I was able to get decent results for single-word nouns.
When dealing with compound nouns (such as "car park" or "Donald Trump"), my best attempt was to compute similar words for each part of the compound and combine them. The results are very entertaining:
Car park -> vehicle campground | automobile zoo
Fire engine -> flame horsepower | fired motor
Donald Trump -> Richard Jeopardy | Jeffrey Gamble
Barrack Obama -> Obamas McCain | Auschwitz Clinton
Unfortunately, these are not very convincing. Especially in case of named entities, I want to produce other named entities, which appear in similar contexts; e.g:
Fire engine -> Fire truck | Fireman
Donald Trump -> Barrack Obama | Hillary Clinton
Niagara Falls -> American Falls | Horseshoe Falls
Does anyone have any suggestions of how this could be achieved? Is there are a way to generate similar named entities/noun chunks?
I managed to get some good distractors by searching for the named entities on Wikipedia, then extracting entities which are similar from the summary. Though I'd prefer to find a solution using just spacy.
If you haven't seen it yet, you might want to check out sense2vec, which allows learning context-sensitive vectors by including the part-of-speech tags or entity labels. Quick usage example of the spaCy extension:
s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(u"A sentence about natural language processing.")
most_similar = doc[3]._.s2v_most_similar(3)
# [(('natural language processing', 'NOUN'), 1.0),
# (('machine learning', 'NOUN'), 0.8986966609954834),
# (('computer vision', 'NOUN'), 0.8636297583580017)]
See here for the interactive demo using a sense2vec model trained on Reddit comments. Using this model, "car park" returns things like "parking lot" and "parking garage", and "Donald Trump" gives you "Sarah Palin", "Mitt Romney" and "Barack Obama". For ambiguous entities, you can also include the entity label – for example, "Niagara Falls|GPE" will show similar terms to the geopolitical entitiy (GPE), e.g. the city as opposed to the actual waterfalls. The results obviously depend on what was present in the data, so for even more specific similarities, you could also experiment with training your own sense2vec vectors.

offline GTFS connection from A to B

The project I am working on involves static offline GTFS data in a mobile app. All the GTFS data is available inside realm-objects (or SQLite if needed).
Now, I would like to establish all train- or bus-connections from A to B (starting after a certain departure-time).
How do I query the GTFS-data in order to get a connection from A to B ???
I reealized to get all trips leaving from A.
I realized to get all station-names along that trip including times.
But I find it very hard to get the connection information between two locations A and B. What SQL queries do i have to set up in order to get that information ?
Any help appreciated !
If you just want to dynamically calculate shortest travel routes between static hubs, in an offline application, determined like the following image, you can use the following formula:
   (Source)
Here's the pseudo code:
1 function Dijkstra(Graph, source):
2
3 create vertex set Q
4
5 for each vertex v in Graph: // Initialization
6 dist[v] ← INFINITY // Unknown distance from source to v
7 prev[v] ← UNDEFINED // Previous node in optimal path from source
8 add v to Q // All nodes initially in Q (unvisited nodes)
9
10 dist[source] ← 0 // Distance from source to source
11
12 while Q is not empty:
13 u ← vertex in Q with min dist[u] // Node with the least distance
14 // will be selected first
15 remove u from Q
16
17 for each neighbor v of u: // where v is still in Q.
18 alt ← dist[u] + length(u, v)
19 if alt < dist[v]: // A shorter path to v has been found
20 dist[v] ← alt
21 prev[v] ← u
22
23 return dist[], prev[]
(Source)
Alright, I'll admit so far my answer was a tad facetious...
I suspect you don't realize just how complicated it is to do what you're asking, especially in an offline environment. Companies like Google and Microsoft have spent millions on research with huge teams of data scientists.
If this is something you are serious about, I'd encourage you to start with a 10×10 grid and work on the logic of getting from "Point A → Point B" when you start adding barriers in random places (this simulating roads beginning & ending). Recently I was surprised how complicated a seemingly-simple, somewhat-related Stack Overflow "pipe sizes conversion" question that I answered had become.
If you didn't have the "offline" condition, I would've suggested looking into getting a [free] API Key for Google Web Services Directions API. Basically you tell it a where Points A & B are, and it gives you detailed route information, via transit or other methods.
Another of Google's many API's that could be helpful is the Snap to Roads API, which turns partial or error-ridden paths into driveable ones.
The study Route Logic and Shortest Path Logic is actually really fascinating stuff, and there are some amazing resources to learn about the related theories (see below), including a video from Google explaining how they went about it.
...but unfortunately this isn't something that's going to be accomplished with a simple SQL Query.
Actual Resources:
YouTube : Google Tech Talk: Fast Route Planning
Wikipedia : Shortest path problem
Wikipedia : Dijkstra's algorithm
...and some slightly Lighter Reading:
Wikipedia : Travelling Salesman Problem
Why UPS drivers don’t turn left and you probably shouldn’t either
Google Web Services : Directions API Developer's Guide
Stack Overflow : Shortest Route of converters between two different pipe sizes?
Wikipedia : Seven Bridges of Königsberg

ssis import xml attributes as elements

I have the following (this is just a sample) xml that is received from a third party (and we have no influence on changing the structure) that we need to import to SQL Server. Each of these files have multiple top level nodes (excuse me if the terminology is incorrect, but I mean the "CardAuthorisation" element). So some are CardFee, Financial etc etc
The issue is that the detail is in attributes. This file is from a new vendor. There is an xml file currently being received from another vendor which is a lot easier to import as the data is in elements and not in attributes.
Here is a sample:
<CardAuthorisation>
<RecType>ADV</RecType>
<AuthId>32397275</AuthId>
<AuthTxnID>11606448</AuthTxnID>
<LocalDate>20140612181918</LocalDate>
<SettlementDate>20140612</SettlementDate>
<Card PAN="2009856214560271" product="MCRD" programid="DUMMY1" branchcode=""></Card>
<Account no="985621456" type="00"></Account>
<TxnCode direction="debit" Type="atm" Group="fee" ProcCode="30" Partial="NA" FeeWaivedOff="0"></TxnCode>
<TxnAmt value="0.0000" currency="826"></TxnAmt>
<CashbackAmt value="0.00" currency="826"></CashbackAmt>
<BillAmt value="0.00" currency="826" rate="1.00"></BillAmt>
<ApprCode>476274</ApprCode>
<Trace auditno="305330" origauditno="305330" Retrefno="061200002435"></Trace>
<MerchCode>BOIA </MerchCode>
<Term code="S1A90971" location="PO NORFOLK STR 3372308 CAMBRIDGESHI3 GBR" street="" city="" country="GB" inputcapability="5" authcapability="7"></Term>
<Schema>MCRD</Schema>
<Txn cardholderpresent="0" cardpresent="yes" cardinputmethod="5" cardauthmethod="1" cardauthentity="1"></Txn>
<MsgSource value="74" domesticMaestro="yes"></MsgSource>
<PaddingAmt value="0.00" currency="826"></PaddingAmt>
<Rate_Fee value="0.00"></Rate_Fee>
<Fixed_Fee value="0.20"></Fixed_Fee>
<CommissionAmt value="0.20" currency="826"></CommissionAmt>
<Classification RCC="" MCC="6011"></Classification>
<Response approved="YES" actioncode="0" responsecode="00" additionaldesc=" PO NORFOLK STR 3372308 CAMBRIDGESHI3 GBR"></Response>
<OrigTxnAmt value="0.00" currency="826"></OrigTxnAmt>
<ReversalReason></ReversalReason>
</CardAuthorisation>
And what we need to do is be able to import this to various tables (one for each top level element type).
So for example CardAuthorisation should be imported to the "Authorisation" table, the CardFinancial should go to the "Financial" table etc.
So the question is what is the best method to employ to import this data.
Having read a bit, I understand xslt can be used for this and would be able to make the above into:
<CardAuthorisation>
<RecType>ADV</RecType>
<AuthId>32397275</AuthId>
<AuthTxnID>11606448</AuthTxnID>
<LocalDate>20140612181918</LocalDate>
<SettlementDate>20140612</SettlementDate>
<PAN>"2009856214560271"</PAN>
<product>MCRD</product>
<programid>DUMMY1</programid>
<branchcode>1</branchcode>
<Accountno>"985621456"</Accountno>
<type>"00"</type>
<TxnCodedirection>"debit"</TxnCodedirection
<TxnCodeType>"atm" </TxnCodeType>
<TxnCodeGroup>"fee" </TxnCodeGroup>
<TxnCodeProcCode>"30" </TxnCodeProcCode>
<TxnCodePartial>"NA" </TxnCodePartial>
<TxnCodeFeeWaivedOff>"0"</TxnCodeFeeWaivedOff>
<TxnAmtvalue>"0.0000"</TxnAmtvalue>
<TxnAmtcurrency>"826"</TxnAmtcurrency>
<CashbackAmtvalue>"0.00"</CashbackAmtvalue>
<CashbackAmtcurrency>"826"</CashbackAmtcurrency>
<BillAmtvalue>"0.00" </BillAmtvalue>
<BillAmtcurrency>"826" </BillAmtcurrency>
<BillAmtrate=>1.00"></BillAmtrate>
<ApprCode>476274</ApprCode>
etc etc
</CardAuthorisation>
But the info I read was quite old (4-5 yrs old) and I know SSIS is always being improved so not sure if it was still valid advice today?
Thanks in advance for your thoughts.

Hadoop Pig - Replace strings in a relation with their corresponding values in a map

I have a relation called conversations_grouped made up of bags of tuples of varying sizes, like so:
DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})
Each L[0-9]+ is a tag corresponding to a string. For example, L194 might be "Hello, how are you doing?" and L195 might be "fine, how are you?". This correspondence is maintained by a map called line_map. Here's a sample:
DUMP line_map;
...
([L666324#Do you think she might be interested in someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])
What I'm trying to do now is parse through each line and replace the L[0-9]+ tags with their corresponding text from line_map. Is it possible to make references to line_map from within a Pig FOREACH statement, or is there something else I have to do?
The first issue with this is that in a map the key must be a quoted string. So you can't use a schema value to access the map. E.G. This will not work.
C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;
The solution that comes to mind is to FLATTEN conversations_grouped. Then do a join between conversations_grouped and line_map on the L[0-9]+ tag. You'll probably want to project out some of the extra fields (like the L[0-9]+ tag after the join) to make the next step faster. After that you'll have to regroup the data, and massage it into the correct format.
This won't work unless each bag has it's own unique ID for the regrouping, but if each of the L[0-9]+ tags appear in only one bag (conversation) you can use this to create a unique id.
-- A is dumped conversations_grouped
B = FOREACH A {
-- Pulls out an element from the bag to use as the id
id = LIMIT tags 1 ;
-- Flattens B into id, tag form. Each group of tags will have the same id.
GENERATE FLATTEN(id), FLATTEN(tags) ;
}
The schema and output for B is:
B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)
Assuming that the tags are unique, the rest is done like:
-- A2 is line_map, loaded in tag/message pairs instead of a map
-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
-- This generate removes the tag
GENERATE id, message ;
-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id)
-- This step limits the output to just messages
GENERATE C.(message) AS messages ;
Schema and output from D:
D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})
NOTE: If at worst, (the L[0-9]+ tags aren't unique) you can give each line of the input file(s) a sequential, integer id before you load it into pig.
UPDATE: If you are using pig 0.11, then you can also use the RANK operator.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.