Extracting dates from html meta data in FAST-ESP - fast-esp

During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).
<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data
Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.
#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes
Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of specifying "all meta attributes" as input.
Has anyone done something similar and can offer any advice on the best way to do this.
Thanks
Neil

I suppose that a custom stage that takes all the needed date attributes as an input, processes a comparison between all them (to find the newest date), and outputs the most up-to-date field will do the job.

Related

Extract data from XML string in Hive Table without using XPath

I am trying to use a view to extract a string(value) from a large XML string that sits in a single column in a hive table. I need to get the associated FOO_STRING_VALUE for COMPANY_ID, SALE_IND, and CLOSING_IND.
<Message>
<Header>
<FOO_STRING>
<FOO_STRING_NAME>COMPANY_ID</FOO_STRING_NAME>
<FOO_STRING_VALUE>44-1235</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>SALE_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>CLOSING_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
</Header>
</Message>
The XML file can have up to 50 "FOO_STRINGS" and there is no guarantee in what order they will be in so I can not use XPATH unless I have 50 xpath_string calls for each Name/Value pair and matched them up later. I am using xpath like this .....
xpath_string(xml_txt, '/Message/Header/FOO_STRING[1]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[2]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[3]/FOO_STRING_VALUE') AS String_Val_3
However, if the order changes than it doesn't work. I'm wondering if there is a quick way to get to find the FOO_STRING_NAME needed the and get the corresponding Value using regexp_extract() or some other way? I am not familiar with Regex so any help or suggestions would be helpful, Thank you a ton
" if the order changes than it doesn't work "
Don't use position, then.
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="COMPANY_ID"]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="SALE_IND"]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="CLOSING_IND"]/FOO_STRING_VALUE') AS String_Val_3

Transforming a single xml to many documents

I have to create a mapping between two xsd schemas, where the input document contains a list (sequence) of elements, each of which maps to a single output document. Moreover, each output document should include top level input data that is not a part of the list. To illustrate the problem, the input document contains data about a customer (contact info, etc) and list of invoices for them, and the output should be multiple documents, each containing one invoice and the customer data.
Can I somehow do this using DataMapper or some other approach? If I create a mapping between the input list elements and the output document, DataMapper will output an aggregation of all the created output documents. It also seems that I can not refer to the input top level elements from inside the "list element to output document" mapping.
Supposing root element in your source XSD contained a list of "Item" elements, you could first split the document into Items:
<splitter expression="#[xpath('//Item')]" doc:name="Splitter" enableCorrelation="IF_NOT_SET"/>
And then after the splitter use a DataMapper to map the Item elements to a target element in your other XSD. DataMapper requires that "Item" also be a root element in your source XSD in order to do the mapping from XSD to XSD. If it's not possible/desirable to make "Item" a root element in the source XSD, then you could create a sample XML and use DataMapper to generate an XSD from that. Otherwise you could roll your own transformer or use the XSLT transformer.

Get an XML Element via XPath when attributes are irrelevant

I'm looking for a way to receive a XML Element (the id of an entry) from a YouTube feed (e.g. http://gdata.youtube.com/feeds/api/users/USERNAME/uploads).
The feed looks like this:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:gd="http://schemas.google.com/g/2005" xmlns:yt="http://gdata.youtube.com/schemas/2007" gd:etag="W/"DUcFQncyfCp7I2A9WhVUFE4."">
<id>tag:youtube.com,2008:user:USERNAME:uploads</id>
<updated>2012-05-19T14:16:53.994Z</updated>
...
<entry gd:etag="W/"DE8NSX47eCp7I2A9WhVUFE4."">
<id>tag:youtube.com,2008:video:MfPpj7f6Jj0</id>
<published>2012-05-18T13:30:38.000Z</published>
...
I want to get the first tag in entry (tag:youtube.com, 2008 ...).
After googling for some hours and looking through the GDataXML wiki, I'm clueless because neither XPath nor GData could deliver the right element.
My first guess is, they can't ignore the attributes in the feed and entry tags.
A solution using XPath would be great, but one in Objective-C is equally welcome.
You might be having an issue trying to get XPath to work because of the default namespace.
If you just want the first tag in entry, you can use this:
/*/*[name()='entry']/*[1]
If you want the first id specifically, you can use this:
/*/*[name()='entry']/*[name()='id'][1]
Also if you can use XPath 2.0, you can skip the predicate entirely and use * for the namespace prefix:
/*/*:entry/*:id[1]

Sharepoint 2010 Client Object Model - Large Library - Find item without iteration

I have a large document library (at the moment ~6000 documents) and I need to find a document based on a custom field value (custom column on the library).
Is there a way of getting this document back without iterating through all 6000 documents?
I understand that an iteration must occur at some point, but I would prefer it to happen on the SharePoint server side, rather than transfer them all to the client side then cherry pick the document.
Thanks
You can query Sharepoint. You issue a CAML query which is executed on the server and brings back only items that match the criteria that you specified. You specify the name of the custom column to search on and you specify the value to find. For efficiency , you can ask only for a few fields back (document url for example). So, you do not need to iterate over documents in the list to find the item.
You can find some discussion here:
http://msdn.microsoft.com/en-us/library/ee956524.aspx and you can also find examples how to do it from javascript or silvelight.
Example CAML:
CamlQuery camlQuery = new CamlQuery();
camlQuery.ViewXml =
#"<View>
<Query>
<Where>
<Eq>
<FieldRef Name='FileLeafRef'/>
<Value Type='Text'>Test.docx</Value>
</Eq>
</Where>
<RowLimit>1</RowLimit>
</Query>
</View>";

XML Import with "alternate" form or xml formatting

I have successfully imported an XML file parsing elements info table attributes using this xml data formating:
<PN>
<guid>aaaa</guid>
<dataInput>0</dataInput>
<deleted>false</deleted>
<customField1></customField1>
<customField2></customField2>
<customField3></customField3>
<description></description>
<name>name1></name>
<ccid>CC007814</ccid>
<productIds>bbbb</productIds>
</PN>
but it errors whwen I input an XML in this format:
<PN guid="aaaa"
deleted="false"
customField1=""
customField2=""
customField3=""
description=""
modified="2010-10-20T00:00:00.001"
created="2010-05-20T18:07:10.416"
name="name1"
ccid="CC006035"
productIds="bbbb"/>
Is this later form usable? Any help would be appreciated. Thanks.
It's usable, but you're looking at the difference between using tags (your first example) and attributes (your second example). Your processing is slightly different.