How to parse xml data through hive using serde? - hive

I have a .xml file having data like:
<book>
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price discount="0.15">44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with XML.</description>
</book>
<book>
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price discount="0.15">4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,centipedes, scorpionsand other insects.</description>
</book>
<book>
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price discount="0.15">49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description>
</book>
I am trying to parse the xml through hive by creating external table on top of xml file on hdfs using serde. Please find my code below
I first added the jar
add jar hdfs://xtlinno1vftsnxg:8020/user/hdfs/hivexmlserde-1.0.5.3.jar;
CREATE EXTERNAL TABLE hive_test_xml(
. . . . . . . . . . . . . . . . . . . . . . .> col1 string,
. . . . . . . . . . . . . . . . . . . . . . .> col2 string,
. . . . . . . . . . . . . . . . . . . . . . .> col3 string)
. . . . . . . . . . . . . . . . . . . . . . .> ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
. . . . . . . . . . . . . . . . . . . . . . .> WITH SERDEPROPERTIES (
. . . . . . . . . . . . . . . . . . . . . . .> "column.xpath.col1"="/book/author/text()",
. . . . . . . . . . . . . . . . . . . . . . .> "column.xpath.col2"="/book/title/text()",
. . . . . . . . . . . . . . . . . . . . . . .> "column.xpath.col3"="/book/genre/text()"
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> STORED AS
. . . . . . . . . . . . . . . . . . . . . . .> INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
. . . . . . . . . . . . . . . . . . . . . . .> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
. . . . . . . . . . . . . . . . . . . . . . .> LOCATION 'hdfs://xtlinno1vftsnxg:8020/user/poctest2/testxml2.xml'
. . . . . . . . . . . . . . . . . . . . . . .> TBLPROPERTIES (
. . . . . . . . . . . . . . . . . . . . . . .> "xmlinput.start"="<book",
. . . . . . . . . . . . . . . . . . . . . . .> "xmlinput.end"="</book>");
error that I am getting is
Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1)
I am not sure how to resolve this error..Please help!!

Most likely the metastore_db is corrupted.Set the validation to ignore and repair the table name in metastore.
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

Related

How to store conditional data in RDF

I want to store some facts in pure RDF data (ttl) like :
:Person :hasGender :Male, :Female ;
:drink :Liquor, :softDrinks .
if (:someone :hasGender :Male) then
:someone :drink :Liquor ;
else
:someone :drink :softDrinks .
:Susan a :Person ;
:hasGender :Female .
and then read this rdf data (ttl) by Sparql or other app (rdf4j or rdflib) , and get :
:Susan :drink :softDrinks .
I want to use pure RDF as much as possible to store data, rather than OWL, N3, RDF-star or SHACL, but I can reconstruct a new N3 or shacl file from these rdf data, and then get infered result.
I wonder if this is possible in RDF, an how can I modify this RDF data? Thanks.
I got It done. here is data.ttl :
#prefix : <http://example.org/#> .
:Person :drink :someLiquid .
:someLiquid :hasRule :rule1, :rule2.
:rule1 :forGender :Male ;
:beverage :liquor .
:rule2 :forGender :Female ;
:beverage :softDrinks .
:Susan a :Person ;
:hasGender :Female .
:John a :Person ;
:hasGender :Male .
here is the sparql query:
PREFIX : <http://example.org/#>
SELECT ?person ?gender ?beverage
WHERE {
:Person :drink :someLiquid .
:someLiquid :hasRule ?rule .
?rule :forGender ?forGender ;
:beverage ?beverage .
?person a :Person;
:hasGender ?gender .
FILTER ( ?gender = ?forGender )
}
and I get the result:
1 person, gender, beverage
2 :Susan, :Female, :softDrinks
3 :John, :Male, :liquor

SPARQL property paths based on a new property defined in a CONSTRUCT subquery

Given the following schema, "driver-passenger" lineages can be easily seen:
tp:trip a owl:Class ;
rdfs:label "trip"#en ;
rdfs:comment "an 'asymmetric encounter' where someone is driving another person."#en .
tp:driver a owl:ObjectProperty ;
rdfs:label "driver"#en ;
rdfs:comment "has keys."#en ;
rdfs:domain tp:trip ;
rdfs:range tp:person .
tp:passenger a owl:ObjectProperty ;
rdfs:label "passenger"#en ;
rdfs:comment "has drinks."#en ;
rdfs:domain tp:trip ;
rdfs:range tp:person .
Consider the following data:
<alice> a tp:person .
<grace> a tp:person .
<tim> a tp:person .
<ruth> a tp:person .
<trip1> a tp:trip ;
tp:participants <alice> , <grace> ;
tp:driver <alice> ;
tp:passenger <grace> .
<trip2> a tp:trip ;
tp:participants <alice> , <tim> ;
tp:driver <alice> ;
tp:passenger <tim> .
<trip3> a tp:trip ;
tp:participants <tim> , <grace> ;
tp:driver <tim> ;
tp:passenger <grace> .
<trip4> a tp:trip ;
tp:participants <grace> , <ruth> ;
tp:driver <grace> ;
tp:passenger <ruth> .
<trip5> a tp:trip ;
tp:participants <grace> , <tim> ;
tp:driver <grace> ;
tp:passenger <tim> .
Now let a "driver-passenger descendent" be any tp:passenger at the end of a trip sequence where the tp:passenger of one trip is the tp:driver of the next trip
Ex. <ruth> is a descendent of <alice> according to the following sequence of trips:
<trip2> -> <trip3> -> <trip4>.
Question:
How to get the (ancestor,descendent) pairs of all driver-passenger lineages?
Attempt 1:
I initially tried the following CONSTRUCT subquery to define an object property: tp:drove, which can be easily used in a property path. However, this did not work on my actual data:
SELECT ?originalDriver ?passengerDescendent
WHERE {
?originalDriver tp:drove+ ?passengerDescendent .
{
CONSTRUCT { ?d tp:drove ?p . }
WHERE { ?t a tp:trip .
?t tp:driver ?d .
?t tp:passenger ?p .}
}
}
Attempt 2:
I tried to create property path which expresses an ancestor as the driver of a passenger, but I don't think I've properly understood how this is supposed to work:
(tp:driver/^tp:passenger)+
Regarding MWE: Is there some kind of RDF sandbox that would allow me to create an MWE by defining a simple ontology like tp above, along with some sample data? The following "playgrounds" are available but none of them seem to support defining a toy ontology: SPARQL Playground, SPARQL Explorer.
Notes on related content:
This question is directly related to a previous question, but no longer requires saving the paths themselves, a feature not directly supported by SPARQL 1.1.
This answer by Joshua Taylor seems relevant, but doesn't address the identification of specific types of paths, such as the lineages defined above.
This one seems to do the trick:
select ?driver ?passenger where {
?driver (^tp:driver/tp:passenger)+ ?passenger .
filter( ?driver != ?passenger)
}
The filter condition can be removed if you want to also see relationships that lead back to the same person.

Length of path between nodes, allowing for multiple relation types (SPARQL)

I have to write a SPARQL query that returns the length of the path between two nodes (:persA and :persD) that are connected by these relationships:
#prefix : <http://www.example.org/> .
:persA :knows :knowRelation1 .
:knowRelation1 :hasPerson :persB .
:persB :knows :knowRelation2 .
:knowRelation2 :hasPerson :persC .
:persC :knows :knowRelation3 .
:knowRelation3 :hasPerson :persD .
I tried with this query:
PREFIX : <http://www.example.org/>
SELECT (COUNT(?mid) AS ?length)
WHERE
{
:persA (:knows | :hasPerson)* ?mid .
?mid (:knows | :hasPerson)+ :persD .
}
the result seems to be a infinite loop.
Any advice/examples of how this can be done?
After fixing some syntax errors in your initial post, the provided triples and query work for me in GraphDB Free 8.2 and BlazeGraph 2.1.1. I have since applied these edits to your post itself.
added a trailing / to your definition of the empty prefix
added a trailing . to your prefix definition line (required if you want to start with a #)
fixed the spelling of length (OK, that's just a cosmetic fix)
.
#prefix : <http://www.example.org/> .
:persA :knows :knowRelation1 .
:knowRelation1 :hasPerson :persB .
:persB :knows :knowRelation2 .
:knowRelation2 :hasPerson :persC .
:persC :knows :knowRelation3 .
:knowRelation3 :hasPerson :persD .
.
PREFIX : <http://www.example.org/>
SELECT (COUNT(?mid) AS ?length)
WHERE
{ :persA (:knows|:hasPerson)* ?mid .
?mid (:knows|:hasPerson)+ :persD
}
result:
length
"6"^^xsd:integer

Query evaluation took too long

I am using GraphDB Free 7.1 and I have created a repository with the default settings. I have uploaded a ttl file with 2.7 million triplets. I am trying to issue a query (not very complex, but quite complex) that should return 200k answers and the Workbench displays just 1k answers and the GraphDB log displays an exception
10:52:19.580 [repositories/PaaSport] INFO c.o.f.sesame.RepositoryController - POST query -1325396809
10:52:29.594 [repositories/PaaSport] ERROR o.o.h.s.r.TupleQueryResultView - Query interrupted
org.openrdf.query.QueryInterruptedException: Query evaluation took too long
...
10:52:29.594 [repositories/PaaSport] INFO o.o.h.s.r.TupleQueryResultView - Request for query -1325396809 is finished
The query I'm using is:
SELECT DISTINCT ?offering ?Value
WHERE {
?offering a paasport:Offering ;
DUL:satisfies ?groundDescription .
?groundDescription paasport:offers ?characteristic .
?characteristic a paasport:Storage ;
DUL:hasParameter ?par .
?par a paasport:StorageCapacity ;
DUL:hasParameterDataValue ?Value ;
DUL:parametrizes ?qualityValue .
?qualityValue uomvocab:measuredIn ?Units .
?Units a ?AppParMeasureUnitType .
ucum:GB a ?AppParMeasureUnitType .
?Units a uomvocab:SimpleDerivedUnit .
ucum:GB a uomvocab:SimpleDerivedUnit .
ucum:GB uomvocab:derivesFrom ?BasicUnit .
?Units uomvocab:derivesFrom ?BasicUnit .
ucum:GB uomvocab:modifierPrefix ?prefix1 .
?Units uomvocab:modifierPrefix ?prefix2 .
?prefix1 uomvocab:factor ?Factor1 .
?prefix2 uomvocab:factor ?Factor2 .
FILTER( xsd:double(?Factor2)*?Value = xsd:double(?Factor1)*4)
}
Since the query timeout is set to 0, I am not sure what causes the query interruption exception; most probably memory problems?
Very simple queries (e.g. return all instances of a certain class) work OK.
Are there any hints? Any help would be appreciated.
I can provide more details if needed.
Best,
Nick
Actually I have managed to cut down the query to the minimum, in order to be answered. The problem was mainly due to the following triple patterns:
ucum:GB rdf:type ?AppParMeasureUnitType .
ucum:GB rdf:type uomvocab:SimpleDerivedUnit .
ucum:GB uomvocab:derivesFrom ?BasicUnit .
If these are omitted and the corresponding variables in the original query are replaced by constant resources, then the query is answered.
Here is the resulting query:
SELECT DISTINCT ?offering ?Value
WHERE {
?offering rdf:type paasport:Offering .
?offering DUL:satisfies ?groundDescription .
?groundDescription paasport:offers ?characteristic .
?characteristic rdf:type paasport:Storage .
?characteristic DUL:hasParameter ?par .
?par rdf:type paasport:StorageCapacity .
?par DUL:hasParameterDataValue ?Value .
?par DUL:parametrizes ?qualityValue .
?qualityValue uomvocab:measuredIn ?Units .
?Units rdf:type ucum:UnitOf-infotech .
?Units rdf:type uomvocab:SimpleDerivedUnit .
?Units uomvocab:derivesFrom <http://purl.oclc.org/NET/muo/ucum/unit/amount-of-information/byte> .
ucum:GB uomvocab:modifierPrefix ?prefix1 .
?Units uomvocab:modifierPrefix ?prefix2 .
?prefix1 uomvocab:factor ?Factor1 .
?prefix2 uomvocab:factor ?Factor2 .
FILTER( xsd:double(?Factor2)*?Value >= xsd:double(?Factor1)*2)
}
Do you really need DISTINCT? That always slows down things, since it has to fetch all results in memory, sort them and uniq them before starting to serve.
Do you need = or >= in the FILTER? If = then replace the filter with a BIND(...as ?Factor2), and put these things before searching for ?prefix2
Have you profiled the query? http://graphdb.ontotext.com/documentation/standard/explain-plan.html

SQL query to return rows, starting with the latest, whose values sum to value in another table

I have two tables, Table1 has historical transactions and Table2 has a field that stores the balance of those transactions for each account.
I need a SQL query that will return the transactions in Table1 for a specific account, starting from the latest transaction that sum up to the current balance in Table2.
Any help would be greatly appreciated.
Table1
UniqID . AcctNum . TranType . TranDate . TranAmt
1 . . . . . 1001123 . . . . A . . . . . 11/1/13 . . . . 100
2 . . . . . 1010877 . . . . B . . . . . 12/2/13 . . . . . 10
7 . . . . . 1010877 . . . . C . . . . . 12/2/13 . . . . . 22
10. . . .. 1001123 . . . . A . . . . . 12/2/14 . . . .-100
11. . . .. 1001123 . . . . B . . . . . 12/6/13 . . . . 145
12. . . .. 1003699 . . . . A . . . . . 12/8/13 . . . . 250
13. . . .. 1001123 . . . . B . . . . . 1/2/14 . . . . . 145
14. . . . .1003699 . . . . C . . . . . 1/4/14 . . . . . 110
15. . . . .1003699 . . . . C . . . . . 1/4/14 . . . . .-110
19. . . . .1003699 . . . . B . . . . . 1/8/14 . . . . . . 25
21. . . . .1001123 . . . . B . . . . . 1/2/14 . . . . . . 80
22. . . . .1001123 . . . . B . . . . . 1/8/14 . . . . . . 45
26. . . . .1001123 . . . . A . . . . . 1/21/14 . . . .-145
Table2
AcctNum . TranBal
1001123 . . . . 270
1003699 . . . . 275
1010877 . . . . . 32
Expected Result for account 1001123
UniqID . AcctNum . TranType . TranDate . TranAmt
11. . . .. 1001123 . . . . B . . . . . 12/6/13 . . . . 145
13. . . .. 1001123 . . . . B . . . . . 1/2/14 . . . . . 145
21. . . . .1001123 . . . . B . . . . . 1/2/14 . . . . . . 80
22. . . . .1001123 . . . . B . . . . . 1/8/14 . . . . . . 45
26. . . . .1001123 . . . . A . . . . . 1/21/14 . . . .-145
OK I think i got it now. First you will need a running total of TranAmt per AcctNum. In SQL Server 2012 this can be conveniently done by OVER PARTITION:
SELECT t.AcctNum, t.TranAmt,t.TranDate, SUM(t.TranAmt) OVER(PARTITION BY t.AcctNum ORDER BY t.TranDate,t.UniqID)
FROM Transactions t
I used UniqID because TranDate is not unique per AcctNum. In SQL Server 2008 you can't use OVER PARTITION because the ORDER BY clause was only added in SQL Server 2012 so we must use a JOIN here:
SELECT t1.AcctNum, t1.TranDate, t1.TranAmt, SUM(t2.TranAmt) running_total
FROM Transactions t1,
Transactions t2
WHERE t1.TranDate >= t2.TranDate AND t1.AcctNum=t2.AcctNum AND t1.UniqID>=t2.UniqID
GROUP BY t1.AcctNum, t1.TranDate, t1.TranAmt
ORDER BY t1.AcctNum, t1.TranDate
Now we need only the rows where the running_total is 0 because they give us the threshold date before which the transaction data can be discarded:
SELECT t1.AcctNum, t1.TranDate, t1.TranAmt, SUM(t2.TranAmt) running_total
FROM Transactions t1,
Transactions t2
WHERE t1.TranDate >= t2.TranDate AND t1.AcctNum=t2.AcctNum AND t1.UniqID>=t2.UniqID
GROUP BY t1.AcctNum, t1.TranDate, t1.TranAmt
HAVING SUM(t2.TranAmt)=0
ORDER BY t1.AcctNum, t1.TranDate
Finally we have to link those Threshold dates to the Transaction table again. I use a left join and to add the row with the threshold date to any row in Transaction which is equal or before that date (per AcctNum). Then we can discard those and keep only the rows which have date newer than threshold date.
SELECT *
FROM Transactions t
LEFT JOIN (SELECT t1.AcctNum, t1.TranDate ThresholdDate, t1.TranAmt, SUM(t2.TranAmt) running_total
FROM Transactions t1,
Transactions t2
WHERE t1.TranDate >= t2.TranDate AND t1.AcctNum=t2.AcctNum AND t1.UniqID>=t2.UniqID
GROUP BY t1.AcctNum, t1.TranDate, t1.TranAmt
HAVING SUM(t2.TranAmt)=0) td ON t.AcctNum=td.AcctNum AND t.TranDate<=td.ThresholdDate
WHERE td.AcctNum is null
ORDER BY t.AcctNum, t.TranDate