Camel do while loop - while-loop

I am trying to use camel dowhile loop.
I have two tables(ex : A,B) in datebase.
The goal is to insert into the data from A to B table by 500 lines for exemple. (The table is too big to use insert into at one time)
Here i want the loop to continue until insert into statement returns "0 rows".
<loop doWhile="true">
<!-- Stock last id of table B -->
<to uri="sql:{{export.table.b.last.fetched.id.query}}?dataSource=testDataSource&outputType=SelectOne" />
<setHeader headerName="last_fetched_id">
<simple>${body}</simple>
</setHeader>
<!-- Insert into query by using the last id limit 500 lines -->
<to uri="sql:{{export.table.a.insert.fetch.query}}?dataSource=testDataSource" />
<simple>${body} != 0</simple>
</loop>
I don't know what i do wrong in the codes. I have this error :
Caused by: org.xml.sax.SAXParseException: cvc-complex-type.3.2.2:
Attribute 'doWhile' is not allowed to appear in element 'loop'.

Looks like xml-validation error. You could double check that you've defined your loop correctly inside <route> and </route> tags after <from> tag. The xml-validation tests xml against the schemas (.xsd) defined in the xml-document.
Also the <simple>${body} != 0</simple> should be right after the <loop doWhile="true"> tag like shown in the documentation.
XML DSL - Documentation:
XML-DSL for Spring
XML-DSL for OSGi-Blueprint
As for sql queries select operations return body of type List<Map<String, Object>> containing query results and number of rows is stored to CamelSqlRowCount header. With update operations null body is returned and number of rows affected is stored CamelSqlUpdateCount header,
Result of the query
Headers
As for the loop you you could try setting the body to the value of CamelSqlUpdateCount:
<setBody>
<header>CamelSqlUpdateCount</header>
</setBody>
Alternatively if you're using direct consumer endpoint you could remove the loop altogether and use recursion instead by just calling the route again at the end until CamelSqlUpdateCount is zero.

Related

Extract data from XML string in Hive Table without using XPath

I am trying to use a view to extract a string(value) from a large XML string that sits in a single column in a hive table. I need to get the associated FOO_STRING_VALUE for COMPANY_ID, SALE_IND, and CLOSING_IND.
<Message>
<Header>
<FOO_STRING>
<FOO_STRING_NAME>COMPANY_ID</FOO_STRING_NAME>
<FOO_STRING_VALUE>44-1235</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>SALE_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
<FOO_STRING>
<FOO_STRING_NAME>CLOSING_IND</FOO_STRING_NAME>
<FOO_STRING_VALUE>Y</FOO_STRING_VALUE>
</FOO_STRING>
</Header>
</Message>
The XML file can have up to 50 "FOO_STRINGS" and there is no guarantee in what order they will be in so I can not use XPATH unless I have 50 xpath_string calls for each Name/Value pair and matched them up later. I am using xpath like this .....
xpath_string(xml_txt, '/Message/Header/FOO_STRING[1]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[2]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[3]/FOO_STRING_VALUE') AS String_Val_3
However, if the order changes than it doesn't work. I'm wondering if there is a quick way to get to find the FOO_STRING_NAME needed the and get the corresponding Value using regexp_extract() or some other way? I am not familiar with Regex so any help or suggestions would be helpful, Thank you a ton
" if the order changes than it doesn't work "
Don't use position, then.
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="COMPANY_ID"]/FOO_STRING_VALUE') AS String_Val_1
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="SALE_IND"]/FOO_STRING_VALUE') AS String_Val_2
xpath_string(xml_txt, '/Message/Header/FOO_STRING[FOO_STRING_NAME="CLOSING_IND"]/FOO_STRING_VALUE') AS String_Val_3

Check if XML nodes are empty in SQL

Hi I am new to XML manipulation, my question would be if there is a possibility of detecting if the XML node is an empty node like this: <gen:nodeName />
I am able to manipulate single nodes however I would be interested if there is an approach like a loop or recursive function that could save some time doing manual labor looking trough every single node. I have no idea how to approach this problem though.
Thanks for help.
You did not specify the dialect of SQL ([sql] is not enough, please specify always the RDBMS incl. version).
This is for SQL-Server, but the semantics should be the same.
DECLARE #xml XML=
N'<root>
<SelfClosing />
<NoContent></NoContent>
<BlankContent> </BlankContent>
<HasContent>blah</HasContent>
<HasContent>other</HasContent>
</root>';
SELECT #xml.query(N'/root/*') AS AnyBelowRoor --All elements
,#xml.query(N'/root/*[text()]') AS AnyWithTextNode --blah and other
,#xml.query(N'/root/*[not(text())]') AS NoText --no text
,#xml.query(N'/root/*[text()="blah"]') AS AnyWithTextNode--blah only
The <SelfClosing /> is semantically the same as the <NoContent><NoContent>. There is no difference.
It might be a surprise, but a blank as content is taken as empty too.
So the check for empty or not empty is the check for the existance of a text() node. one can negate this with not() to find all without a text().
Interesting: The result for NoText comes back as this (SQL-Server)
<SelfClosing />
<NoContent />
<BlankContent />
The three elements are implicitly returned in the shortest format.

XmlPeek empty string causes failure

So in my targets file, I've got a line that looks like this:
<XmlPeek Namespaces="" XmlInputPath="file.xml" Query="/data/#AttributeOne">
<Output TaskParameter="Result" ItemName="my_AttributeOne" />
</XmlPeek>
in "file.xml", I have:
<data AttributeOne="abc" AttributeTwo="def" />
it also reads a few other attributes.
When the attribute has data, everything works fine... but when I leave AttributeOne as an empty string (""), XmlPeek blows chunks with the following error:
The "XmlPeek" task's outputs could not be retrieved from the "Result" parameter. Parameter "includeEscaped" cannot have zero length.
if I remove the attribute ENTIRELY, it works fine (the resulting item is obviously and understandably blank)
The question is... how can I DETERMINE, WITHOUT blowing chunks, the value of a blank attribute... whether by pre-testing for a value, or by correctly handling the blank, or some other means.
CONSTRAINT: the only real requirement is to stick to the built-in tasks (XmlPeek)... I'm aware of XmlRead in the community tasks... for various reasons, I want to use out-of-the-box tasks.
Thanks in advance!
The error happens because an empty string is being used as the Item Identifier. I guess identifiers cannot be the empty string. If you remove the attribute then the result is null and no Item is created so that's why that doesn't throw an error.
Maybe try return the result as a Property instead of an Item.
If you do not need to distinguish between the attribute being omitted versus having an empty value, you can prevent the error by inserting the condition [#AttributeOne!=''] into the query as follows.
<XmlPeek Namespaces="" XmlInputPath="file.xml" Query="/data[#AttributeOne!='']/#AttributeOne">
<Output TaskParameter="Result" ItemName="my_AttributeOne" />
</XmlPeek>

Is Apache Camel's idempotent consumer pattern scalable?

I'm using Apache Camel 2.13.1 to poll a database table which will have upwards of 300k rows in it. I'm looking to use the Idempotent Consumer EIP to filter rows that have already been processed.
I'm wondering though, whether the implementation is really scalable or not. My camel context is:-
<camelContext xmlns="http://camel.apache.org/schema/spring">
<route id="main">
<from
uri="sql:select * from transactions?dataSource=myDataSource&consumer.delay=10000&consumer.useIterator=true" />
<transacted ref="PROPAGATION_REQUIRED" />
<enrich uri="direct:invokeIdempotentTransactions" />
<!-- Any processors here will be executed on all messages -->
</route>
<route id="idempotentTransactions">
<from uri="direct:invokeIdempotentTransactions" />
<idempotentConsumer
messageIdRepositoryRef="jdbcIdempotentRepository">
<ognl>#{request.body.ID}</ognl>
<!-- Anything here will only be executed for non-duplicates -->
<log message="non-duplicate" />
<to uri="stream:out" />
</idempotentConsumer>
</route>
</camelContext>
It would seem that the full 300k rows are going to be processed every 10 seconds (via consumer.delay parameter) which seems very inefficient. I would expect some sort of feedback loop as part of the pattern so that the query that feeds the filter could take advantage of the set of rows already processed.
However, the messageid column in the CAMEL_MESSAGEPROCESSED table has the pattern of
{1908988=null}
where 1908988 is the request.body.ID I've set the EIP to key on so this doesn't make it easy to incorporate into my query.
Is there a better way of using the CAMEL_MESSAGEPROCESSED table as a feedback loop into my select statement so that the SQL server is performing most of the load?
Update:
So, I've since found out that it was my ognl code that was causing the odd message id column value. Changing it to
<el>${in.body.ID}</el>
has fixed it. So, now that I have a usable messageId column, I can now change my 'from' SQL query to
select * from transactions tr where tr.ID IN (select cmp.messageid from CAMEL_MESSAGEPROCESSED cmp where cmp.processor = 'transactionProcessor')
but I still think I'm corrupting the Idempotent Consumer EIP.
Does anyone else do this? Any reason not to?
Yes, it is. But you need to use scalable storage for holding sets of already processed messages. You can use either Hazelcast - http://camel.apache.org/hazelcast-idempotent-repository-tutorial.html or Infinispan - http://java.dzone.com/articles/clustered-idempotent-consumer - depending on which solution is already in your stack. Of course, JDBC repository would work, but only if it meets performance criteria selected.

Quickest method for matching nested XML data against database table structure

I have an application which creates datarequests which can be quite complex. These need to be stored in the database as tables. An outline of a datarequest (as XML) would be...
<datarequest>
<datatask view="vw_ContractData" db="reporting" index="1">
<datefilter modifier="w0">
<filter index="1" datatype="d" column="Contract Date" param1="2009-10-19 12:00:00" param2="2012-09-27 12:00:00" daterange="" operation="Between" />
</datefilter>
<filters>
<alternation index="1">
<filter index="1" datatype="t" column="Department" param1="Stock" param2="" operation="Equals" />
</alternation>
<alternation index="2">
<filter index="1" datatype="t" column="Department" param1="HR" param2="" operation="Equals" />
</alternation>
</filters>
<series column="Turnaround" aggregate="avg" split="0" splitfield="" index="1">
<filters />
</series>
<series column="Requested 3" aggregate="avg" split="0" splitfield="" index="2">
<filters>
<alternation index="1">
<filter index="1" datatype="t" column="Worker" param1="Malcom" param2="" operation="Equals" />
</alternation>
</filters>
</series>
<series column="Requested 2" aggregate="avg" split="0" splitfield="" index="3">
<filters />
</series>
<series column="Reqested" aggregate="avg" split="0" splitfield="" index="4">
<filters />
</series>
</datatask>
</datarequest>
This encodes a datarequest comprising a daterange, main filters, series and series filters. Basically any element which has the index attribute can occur multiple times within its parent element - the exception to this being the filter within datefilter.
But the structure of this is kind of academic, the problem is more fundamental:
When a request comes through, XML like this is sent to SQLServer as a parameter to a stored proc. This XML is shredded into a de-normalised table and then written iteratively to normalised tables such as tblDataRequest (DataRequestID PK), tblDataTask, tblFilter, tblSeries. This is fine.
The problem occurs when I want to match a given XML defintion with one already held in the DB. I currently do this by...
Shredding the XML into a de-normalised table
Using a CTE to pull all the existing data in the database into that same de-normalised form
Matching using a huge WHERE condition (34 lines long)
..This will return me any DataRequestID which exactly matches the XML given. I fear that this method will end up being painfully slow - partly because I don't believe the CTE will do any clever filtering, it will pull all the data every single time before applying the huge WHERE.
I have thought there must be better solutions to this eg
When storing a datarequest, also store a hash of the datarequest somehow and simply match on that. In the case of collision, use the current method. I wanted however to do this using set-logic. And also, I'm concerned about irrelevant small differences in the XML changing the hash - spurious spaces etc.
Somehow perform the matching iteratively from the bottom up. Eg produce a list of filters which match on the lowest level. Use this as part of an IN to match Series. Use this as part of an IN to match DataTasks etc etc. The trouble is, I start to black-out when I think about this for too long.
Basically - Has anyone ever encountered this kind of problem before (they must have). And what would be the recommended route for tackling it? example (pseudo)code would be great :)
To get rid of the possibility of minor variances, I'd run the request through an XML transform (XSLT).
Alternatively, since you've already got the code to parse this out into a denormalized staging table that's fine too. I would then simply using FOR XML to create a new XML doc.
Your goal here is to create a standardized XML document that respects ordering where appropriate and removes inconsistencies where it is not.
Once that is done, store this in a new table. Now you can run a direct comparison of the "standardized" request XML against existing data.
To do the actual comparison, you can use a hash, store the XML as a string and do a direct string comparison, or do a full XML comparison like this: http://beyondrelational.com/modules/2/blogs/28/posts/10317/xquery-lab-36-writing-a-tsql-function-to-compare-two-xml-values-part-2.aspx
My preference, as long as the XML is never over 8000bytes, would be to create a unique string (either VARCHAR(8000) or NVARCHAR(4000) if you have special character support) and create a unique index on the column.