Extract data from XML string in Hive Table without using XPath - sql

I am trying to use a view to extract a string(value) from a large XML string that sits in a single column in a hive table. I need to get the associated FOO_STRING_VALUE for COMPANY_ID, SALE_IND, and CLOSING_IND.
<Message>
<Header>
<FOO_STRING>
<FOO_STRING_NAME>COMPANY_ID</FOO_STRING_NAME>
<FOO_STRING_VALUE>44-1235</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>SALE_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>CLOSING_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
</Header>
</Message>
The XML file can have up to 50 "FOO_STRINGS" and there is no guarantee in what order they will be in so I can not use XPATH unless I have 50 xpath_string calls for each Name/Value pair and matched them up later. I am using xpath like this .....
xpath_string(xml_txt, '/Message/Header/FOO_STRING[1]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[2]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[3]/FOO_STRING_VALUE') AS String_Val_3
However, if the order changes than it doesn't work. I'm wondering if there is a quick way to get to find the FOO_STRING_NAME needed the and get the corresponding Value using regexp_extract() or some other way? I am not familiar with Regex so any help or suggestions would be helpful, Thank you a ton

" if the order changes than it doesn't work "
Don't use position, then.
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="COMPANY_ID"]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="SALE_IND"]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="CLOSING_IND"]/FOO_STRING_VALUE') AS String_Val_3

Related

Check if XML nodes are empty in SQL

Hi I am new to XML manipulation, my question would be if there is a possibility of detecting if the XML node is an empty node like this: <gen:nodeName />
I am able to manipulate single nodes however I would be interested if there is an approach like a loop or recursive function that could save some time doing manual labor looking trough every single node. I have no idea how to approach this problem though.
Thanks for help.
You did not specify the dialect of SQL ([sql] is not enough, please specify always the RDBMS incl. version).
This is for SQL-Server, but the semantics should be the same.
DECLARE #xml XML=
N'<root>
<SelfClosing />
<NoContent></NoContent>
<BlankContent> </BlankContent>
<HasContent>blah</HasContent>
<HasContent>other</HasContent>
</root>';
SELECT #xml.query(N'/root/*') AS AnyBelowRoor --All elements
,#xml.query(N'/root/*[text()]') AS AnyWithTextNode --blah and other
,#xml.query(N'/root/*[not(text())]') AS NoText --no text
,#xml.query(N'/root/*[text()="blah"]') AS AnyWithTextNode--blah only
The <SelfClosing /> is semantically the same as the <NoContent><NoContent>. There is no difference.
It might be a surprise, but a blank as content is taken as empty too.
So the check for empty or not empty is the check for the existance of a text() node. one can negate this with not() to find all without a text().
Interesting: The result for NoText comes back as this (SQL-Server)
<SelfClosing />
<NoContent />
<BlankContent />
The three elements are implicitly returned in the shortest format.

Export SQL XML field to grid [duplicate]

I have something like the following XML in a column of a table:
<?xml version="1.0" encoding="utf-8"?>
<container>
<param name="paramA" value="valueA" />
<param name="paramB" value="valueB" />
...
</container>
I am trying to get the valueB part out of the XML via TSQL
So far I am getting the right node, but now I can not figure out how to get the attribute.
select xmlCol.query('/container/param[#name="paramB"]') from LogTable
I figure I could just add /#value to the end, but then SQL tells me attributes have to be part of a node. I can find a lot of examples for selecting the child nodes attributes, but nothing on the sibling atributes (if that is the right term).
Any help would be appreciated.
Try using the .value function instead of .query:
SELECT
xmlCol.value('(/container/param[#name="paramB"]/#value)[1]', 'varchar(50)')
FROM
LogTable
The XPath expression could potentially return a list of nodes, therefore you need to add a [1] to that potential list to tell SQL Server to use the first of those entries (and yes - that list is 1-based - not 0-based). As second parameter, you need to specify what type the value should be converted to - just guessing here.
Marc
Depending on the the actual structure of your xml, it may be useful to put a view over it to make it easier to consume using 'regular' sql eg
CREATE VIEW vwLogTable
AS
SELECT
c.p.value('#name', 'varchar(10)') name,
c.p.value('#value', 'varchar(10)') value
FROM
LogTable
CROSS APPLY x.nodes('/container/param') c(p)
GO
-- now you can get all values for paramB as...
SELECT value FROM vwLogTable WHERE name = 'paramB'

How to query database using saxon sql:query where db.id='value of xml attribute'

I have a requirement where i need to query database using saxon sql;query by applying where clause, where database_table.ProductID should match with incoming xml input productId
Here is what i tried so far:
<sql:query connection="$sql.conn" table="table_name" column="Product_ID" row-tag="row" column-tag="col" where="Product_ID="<xsl:value-of select="ProductItem/ProductItemId/text()"/>"" />
I am getting following Exception:
SXXP0003: Error reported by XML parser: Element type "sql:query" must be followed by either attribute specifications, ">" or "/>".
I am finding it difficult to format the where clause in XPath, can any one suggest what would be correct format. Thanks in Advance.
Try to use to use an attribute value template where you put the XPath expression in curly braces {...}, as in
<sql:query connection="$sql.conn" table="table_name" column="Product_ID" row-tag="row" column-tag="col" where="Product_ID="{ProductItem/ProductItemId}""/>

Get an XML Element via XPath when attributes are irrelevant

I'm looking for a way to receive a XML Element (the id of an entry) from a YouTube feed (e.g. http://gdata.youtube.com/feeds/api/users/USERNAME/uploads).
The feed looks like this:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:gd="http://schemas.google.com/g/2005" xmlns:yt="http://gdata.youtube.com/schemas/2007" gd:etag="W/"DUcFQncyfCp7I2A9WhVUFE4."">
<id>tag:youtube.com,2008:user:USERNAME:uploads</id>
<updated>2012-05-19T14:16:53.994Z</updated>
...
<entry gd:etag="W/"DE8NSX47eCp7I2A9WhVUFE4."">
<id>tag:youtube.com,2008:video:MfPpj7f6Jj0</id>
<published>2012-05-18T13:30:38.000Z</published>
...
I want to get the first tag in entry (tag:youtube.com, 2008 ...).
After googling for some hours and looking through the GDataXML wiki, I'm clueless because neither XPath nor GData could deliver the right element.
My first guess is, they can't ignore the attributes in the feed and entry tags.
A solution using XPath would be great, but one in Objective-C is equally welcome.
You might be having an issue trying to get XPath to work because of the default namespace.
If you just want the first tag in entry, you can use this:
/*/*[name()='entry']/*[1]
If you want the first id specifically, you can use this:
/*/*[name()='entry']/*[name()='id'][1]
Also if you can use XPath 2.0, you can skip the predicate entirely and use * for the namespace prefix:
/*/*:entry/*:id[1]

XPath to search for elements in a sequence

With xml like:
<a>
<b>
<c>1</c>
<c>2</c>
</c>3</c>
</b>
I'm trying to create an xpath expression (for a postgresql query) that will return if is a particular value and not that is all three values. What I currently have (which does not work) is:
select * from someTable where xpath ('//uim:a/text()', job, ARRAY[ ARRAY['uim','http://www.cmpy.com/uim'] ])::text[] IN (ARRAY['1','3']);
If I try with ARRAY['1'] this will not return any values but with ARRAY['1','2','3'] it will return all three.
How can I select based on a single element in a sequence?
Thanks.
If you're asking how to get the value of a 1 or more XML elements within your XML segment the easiest way is likely to simply utilize a custom SQL CLR library and XPath analysis from within it to assemble and return whatever information you desire. At least that would be my approach.