Get a value of the kettle pentaho flow - pentaho

I'm working with the pentaho data integration, Spoon.
Short description: I want to get the number of times a value has appeared in the flow for each line that reads from the flow.
Long description: I am doing the transformation of the fact table, and when I read the data of a csv file, I have a client has traveled in a certain airplane at a specific time. I want to add a column, available seats, that whenever it appears read a data that a customer travels in a concrete airplane, look in the previous flow what is the number of seats available of that airplane and subtract 1.
Example.
Initially Flight 1 has 160 seats available and Flight 2 has 320 seats available.
CSV
Flight | Client
1 | 1
2 | 2
1 | 3
2 | 4
I can add a column that the value is the total of available seats.
Flight | Customer | Available seats
1 | 1 | 160
2 | 2 | 320
1 | 3 | 160
2 | 4 | 320
but afterwards i do not know how to obtain the minimum value of the seats available given a certain flight in each reading of the flow.
Final output I want in my flow..
Flight | Customer | Available seats
1 | 1 | 159
2 | 2 | 319
1 | 3 | 158
2 | 4 | 318
Many thanks in advance for the time in read my question.

You can use Add value fields changing sequence (available under the "Transform" steps group) step to generate a counter for each flight. The step will require the input to be sorted by Flight field. And you will need to specify Flight field in this step, so that the counter will be reset, once the new group of flights will start.
Then you will be able to subtract the counter from Available seats field to get the current value using Calculator/JavaScript/Java formula or any other step.
Here is an example, which you can copy and paste onto Spoon canvas:
<?xml version="1.0" encoding="UTF-8"?>
<transformation-steps>
<steps>
<step>
<name>Add value fields changing sequence</name>
<type>FieldsChangeSequence</type>
<description/>
<distribute>Y</distribute>
<custom_distribution/>
<copies>1</copies>
<partitioning>
<method>none</method>
<schema_name/>
</partitioning>
<start>1</start>
<increment>1</increment>
<resultfieldName>counter</resultfieldName>
<fields>
<field>
<name>Flight</name>
</field>
</fields>
<cluster_schema/>
<remotesteps>
<input>
</input>
<output>
</output>
</remotesteps>
<GUI>
<xloc>352</xloc>
<yloc>96</yloc>
<draw>Y</draw>
</GUI>
</step>
<step>
<name>Data Grid</name>
<type>DataGrid</type>
<description/>
<distribute>Y</distribute>
<custom_distribution/>
<copies>1</copies>
<partitioning>
<method>none</method>
<schema_name/>
</partitioning>
<fields>
<field>
<name>Flight</name>
<type>Integer</type>
<format/>
<currency/>
<decimal/>
<group/>
<length>-1</length>
<precision>-1</precision>
<set_empty_string>N</set_empty_string>
</field>
<field>
<name>Customer</name>
<type>Integer</type>
<format/>
<currency/>
<decimal/>
<group/>
<length>-1</length>
<precision>-1</precision>
<set_empty_string>N</set_empty_string>
</field>
<field>
<name>Total available seats</name>
<type>Integer</type>
<format/>
<currency/>
<decimal/>
<group/>
<length>-1</length>
<precision>-1</precision>
<set_empty_string>N</set_empty_string>
</field>
</fields>
<data>
<line> <item>1</item><item>1</item><item>160</item> </line>
<line> <item>2</item><item>2</item><item>320</item> </line>
<line> <item>1</item><item>3</item><item>160</item> </line>
<line> <item>2</item><item>4</item><item>320</item> </line>
</data>
<cluster_schema/>
<remotesteps>
<input>
</input>
<output>
</output>
</remotesteps>
<GUI>
<xloc>80</xloc>
<yloc>96</yloc>
<draw>Y</draw>
</GUI>
</step>
<step>
<name>Sort rows (by flight)</name>
<type>SortRows</type>
<description/>
<distribute>Y</distribute>
<custom_distribution/>
<copies>1</copies>
<partitioning>
<method>none</method>
<schema_name/>
</partitioning>
<directory>%%java.io.tmpdir%%</directory>
<prefix>out</prefix>
<sort_size>1000000</sort_size>
<free_memory/>
<compress>N</compress>
<compress_variable/>
<unique_rows>N</unique_rows>
<fields>
<field>
<name>Flight</name>
<ascending>Y</ascending>
<case_sensitive>N</case_sensitive>
<collator_enabled>N</collator_enabled>
<collator_strength>0</collator_strength>
<presorted>N</presorted>
</field>
</fields>
<cluster_schema/>
<remotesteps>
<input>
</input>
<output>
</output>
</remotesteps>
<GUI>
<xloc>192</xloc>
<yloc>96</yloc>
<draw>Y</draw>
</GUI>
</step>
<step>
<name>Calculator</name>
<type>Calculator</type>
<description/>
<distribute>Y</distribute>
<custom_distribution/>
<copies>1</copies>
<partitioning>
<method>none</method>
<schema_name/>
</partitioning>
<calculation>
<field_name>Available seats</field_name>
<calc_type>SUBTRACT</calc_type>
<field_a>Total available seats</field_a>
<field_b>counter</field_b>
<field_c/>
<value_type>Integer</value_type>
<value_length>-1</value_length>
<value_precision>-1</value_precision>
<remove>N</remove>
<conversion_mask/>
<decimal_symbol/>
<grouping_symbol/>
<currency_symbol/>
</calculation>
<cluster_schema/>
<remotesteps>
<input>
</input>
<output>
</output>
</remotesteps>
<GUI>
<xloc>496</xloc>
<yloc>96</yloc>
<draw>Y</draw>
</GUI>
</step>
</steps>
<order>
<hop>
<from>Add value fields changing sequence</from>
<to>Calculator</to>
<enabled>Y</enabled>
</hop>
<hop>
<from>Data Grid</from>
<to>Sort rows (by flight)</to>
<enabled>Y</enabled>
</hop>
<hop>
<from>Sort rows (by flight)</from>
<to>Add value fields changing sequence</to>
<enabled>Y</enabled>
</hop>
</order>
<notepads>
</notepads>
<step_error_handling>
</step_error_handling>
</transformation-steps>

Related

How to use tokens from 2 time range inputs in single Splunk dashboard query?

I'm using Splunk classic dashboards where I have 2 time range inputs. I want to compare data for 2 time frames in a single table. Essentially, I want to perform query which counts errors by type for period A and B, then join the searches by error type so that I can see how many errors of each type there were in period A as opposed to period B.
I added a panel as follows:
because I want to use tokens from both time inputs for the query:
(index=myindex) earliest="$runATimeInput.earliest$" latest="$runATimeInput.latest$" environment="$runAEnvironment$" level=ERROR
| spath input=message
| stats count by logIdentifier
| sort count desc
| join left=L right=R where L.logIdentifier = R.logIdentifier
[| search (index=myindex) earliest="$runBTimeInput.earliest$" latest="$runBTimeInput.latest$" environment="$runBEnvironment$" level=ERROR
| spath input=message
| stats count by logIdentifier ]
The problem is that the query doesn't return any results although it should. The main query returns results:
(index=myindex) earliest="$runATimeInput.earliest$" latest="$runATimeInput.latest$" environment="$runAEnvironment$" level=ERROR
| spath input=message
| stats count by logIdentifier
| sort count desc
However the subsearch query doesn't return any results (although a separate search for the same period in a new tab returns results):
[| search (index=myindex) earliest="$runBTimeInput.earliest$" latest="$runBTimeInput.latest$" environment="$runBEnvironment$" level=ERROR
| spath input=message
| stats count by logIdentifier ]
When I click on Run Search in Splunk panel in order to open the search in a new tab I see strange values for earliest/latest tokens. For the main query the values are: earliest="1669500000" latest="1669506493.677" where 1669500000 is the Tue Jan 20 1970 09:45:00 and 1669506493.677 is Sun Nov 27 2022 01:48:13 whereas the timeframe for period 1 was Sun Nov 27 2022 00:00:00 - Sun Nov 27 2022 01:48:13. That being said the main query works and it respects the original time frame.
The values for the second query are earliest="1669813200" latest="1669816444.909" where 1669813200 is Tue Jan 20 1970 09:45:00 and 1669816444.909 is Wed Nov 30 2022 15:54:04 whereas the period 2 timeframe was Wed Nov 30 2022 15:00:04 - Wed Nov 30 2022 15:54:04`.
Am I doing something wrong in the panel settings or the query? Or maybe there's another way to do this in Splunk?
Below is the dashboard XML:
<form>
<label>My Dashboard</label>
<description>My Dashboard</description>
<fieldset submitButton="false" autoRun="true">
<input type="time" token="runATimeInput" searchWhenChanged="true">
<label>Run A</label>
<default>
<earliest>-24h#h</earliest>
<latest>now</latest>
</default>
</input>
<input type="dropdown" token="runAEnvironment" searchWhenChanged="true">
<label>Run A Environment</label>
<choice value="prod">prod</choice>
<default>prod</default>
</input>
<input type="time" token="runBTimeInput" searchWhenChanged="true">
<label>Run B</label>
<default>
<earliest>-24h#h</earliest>
<latest>now</latest>
</default>
</input>
<input type="dropdown" token="runBEnvironment" searchWhenChanged="true">
<label>Run B Environment</label>
<choice value="prod">prod</choice>
<default>prod</default>
</input>
</fieldset>
<row>
<panel>
<title>Top Exceptions</title>
<table>
<title>Top Exceptions</title>
<search>
<query>(index=distapps) earliest="$runATimeInput.earliest$" latest="$runATimeInput.latest$" environment="$runAEnvironment$" level=ERROR | spath input=message
| stats count by logIdentifier
| sort count desc
| join left=L right=R where L.logIdentifier = R.logIdentifier
[| search (index=myindex) earliest="$runBTimeInput.earliest$" latest="$runBTimeInput.latest$" environment="$runBEnvironment$" level=ERROR
| spath input=message
| stats count by logIdentifier ]</query>
<earliest>$runATimeInput.earliest$</earliest>
<latest>$runBTimeInput.latest$</latest>
</search>
<option name="drilldown">none</option>
<option name="refresh.display">progressbar</option>
</table>
</panel>
</row>
</form>
Here's a test dashboard I created that uses two timepickers. It produces results for both time periods. How is yours different? Could it be the count field is used in both the main and subsearches?
<form version="1.1">
<label>test</label>
<fieldset submitButton="false">
<input type="time" token="runATimeInput">
<label>A</label>
<default>
<earliest>-24h#h</earliest>
<latest>now</latest>
</default>
</input>
<input type="time" token="runBTimeInput">
<label>B</label>
<default>
<earliest>-48h#h</earliest>
<latest>-24h#h</latest>
</default>
</input>
</fieldset>
<row>
<panel>
<table>
<search>
<query>(index=_internal) earliest="$runATimeInput.earliest$" latest="$runATimeInput.latest$"
| stats count as countA by component
| join component [| search (index=_internal) earliest="$runBTimeInput.earliest$" latest="$runBTimeInput.latest$"
| stats count as countB by component ]</query>
<earliest>$runATimeInput.earliest$</earliest>
<latest>$runATimeInput.latest$</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="drilldown">none</option>
<option name="refresh.display">progressbar</option>
</table>
</panel>
</row>
</form>
Don't use any tokens or time selector on the panel itself
You should be able to reference your two time tokens' .earliest and .latest just fine in any searches on the dashboard

Select from parent-children xml content

I have been looking for solution for selecting parent-child relation from table [Group] which contains a xml column.
[Group] table has the following structure:
ID - int
Content - xml
There is xml data - parent-child relation in column Content
<root>
<person name="John">
<device name="notebook" />
<device name="xbox" />
</person>
<person name="Jane">
<device name="TV" />
</person>
<person name="Mark">
</person>
</root>
I would like to select data in the following format:
Group Id
PersonName
DeviceName
1
John
notebook
1
John
xbox
1
Jane
TV
Because Mark has no device assigned, there is no row for Mark in result.
Is it possible to achieve this result in a SELECT query?
As I mentioned, you can use XQuery for this. As you don't want any rows for Mark, I go straight to the device node, in the nodes method, as this means that no rows for Mark will be found. Then you can go back up one level to get the person's name:
SELECT V.ID AS GroupID,
p.d.value('../#name','nvarchar(50)') AS PersonName,
p.d.value('#name','nvarchar(50)') AS DeviceName
FROM(VALUES(1,CONVERT(xml,'<root>
<person name="John">
<device name="notebook" />
<device name="xbox" />
</person>
<person name="Jane">
<device name="TV" />
</person>
<person name="Mark">
</person>
</root>')))V(ID, Content)
CROSS APPLY V.Content.nodes('root/person/device') p(d);

How to Select distinct values in FOR XML PATH?

Given the following tables where T_DATA.ID = PARENT_ID or CHILD.ID
Name: T_DATA
+----+------+--------+
| ID | CODE | VALUE |
+----+------+--------+
| 1 | 3186 | value1 |
| 2 | 3186 | value2 |
| 3 | 3189 | value3 |
| 4 | 3189 | value4 |
| 5 | 3190 | value5 |
+----+------+--------+
Name: T_DATA_LINK
+-----------+----------+
| PARENT_ID | CHILD_ID |
+-----------+----------+
| 1 | 3 |
| 1 | 4 |
+-----------+----------+
I want to return an xml structure like this:
<ITEM_LIST>
<ITEM>
<CODE>3186</CODE>
<ROWS>
<ROW>
<ID>1</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value1</VALUE>
</ROW>
<ROW>
<ID>2</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value2</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<CODE>3189</CODE>
<ROWS>
<ROW>
<ID>3</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value3</VALUE>
</ROW>
<ROW>
<ID>4</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value4</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<CODE>3190</CODE>
<VALUE>value5</VALUE>
</ITEM>
</ITEM_LIST>
The ROW_INDEX is incremented by 1 for every ROW.
I need the T_DATA_LINK table to know whether an ITEM has a parent or not.
If it has a parent it means that there is more than one record with that CODE value and they need to be displayed as ROWS, otherwise it has to be displayed as a single ITEM.
UPDATE
I actually need to check the T_DATA_LINK table since there may be cases where an ITEM has a parent and only one record, but it still need to be displayed as a ROW.
#Shnugo I tried your solution, but even if now I get the correct values inside the ROWS, I get duplicates for each ITEM that has more than one record.
This is probably because I had to add to the GROUP BY the other fields I need to return with the SELECT which I didn't add to the example in order to keep it simpler.
For example, the ID need to be displayed at the ITEM level for the items which don't have any ROWS.
UPDATE 2
#Shnugo you are correct. Items 3 and 4 are the children of Item 1, but you don't see the relationship in the xml.
All the items are unique, always.
The items that are referenced in T_DATA_LINK are still unique, but are linked to each other in my application where they are displayed inside a table.
Basically the PARENT is the first column of the table and the children are the others columns.
This is the updated output I want to get.
ID should be always -1 for the items that have rows.
PARENT_CODE should be the CODE of the parent (if the item is a parent then it is equal to the CODE)
<ITEM_LIST>
<ITEM>
<ID>-1</ID>
<CODE>3186</CODE>
<PARENT_CODE>3186</PARENT_CODE>
<ROWS>
<ROW>
<ID>1</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value1</VALUE>
</ROW>
<ROW>
<ID>2</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value2</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<ID>-1</ID>
<CODE>3189</CODE>
<PARENT_CODE>3186</PARENT_CODE>
<ROWS>
<ROW>
<ID>3</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value3</VALUE>
</ROW>
<ROW>
<ID>4</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value4</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<ID>5</ID>
<CODE>3190</CODE>
<VALUE>value5</VALUE>
</ITEM>
</ITEM_LIST>
This is a new answer... Please try to put all needed information into the initial question...
DECLARE #t_data TABLE(ID INT,CODE INT,VALUE VARCHAR(100));
INSERT INTO #t_data VALUES
(1,3186,'value1')
,(2,3186,'value2')
,(3,3189,'value3')
,(4,3189,'value4')
,(5,3190,'value5');
DECLARE #t_data_link TABLE(PARENT_ID INT, CHILD_ID INT)
INSERT INTO #t_data_link VALUES
(1,3)
,(1,4);
--The CTE links the two tables and allows to handle them as one derived table
WITH Combined AS
(
SELECT d.*
,d2.CODE AS PARENT_CODE
,COUNT(*) OVER(PARTITION BY d.CODE) AS CountRows
FROM #t_data AS d
LEFT JOIN #t_data_link AS dl ON d.ID=dl.CHILD_ID
LEFT JOIN #t_data AS d2 ON dl.PARENT_ID=d2.ID
)
SELECT CASE WHEN c.CountRows>1 THEN -1 END AS ID
,CASE WHEN c.CountRows>1 THEN c.CODE END AS CODE
,CASE WHEN c.CountRows>1 THEN ISNULL(c.PARENT_CODE,c.CODE) END AS PARENT_CODE
--This part for elements with just one row per code
,(
SELECT d2.ID
,d2.CODE
,d2.VALUE
FROM #t_data AS d2
WHERE c.CODE=d2.CODE
AND c.CountRows=1
FOR XML PATH(''),TYPE
)
--This part for elements with more rows per code
,(
SELECT d2.ID
,ROW_NUMBER() OVER(ORDER BY (SELECT NULL))-1 AS ROW_INDEX
,d2.VALUE
FROM #t_data AS d2
WHERE c.CODE=d2.CODE
AND c.CountRows>1
FOR XML PATH('ROW'),ROOT('ROWS'),TYPE
)
FROM Combined AS c
GROUP BY c.CODE,c.CountRows,c.PARENT_CODE
FOR XML PATH('ITEM'),ROOT('ITEM_LIST');
The result
<ITEM_LIST>
<ITEM>
<ID>-1</ID>
<CODE>3186</CODE>
<PARENT_CODE>3186</PARENT_CODE>
<ROWS>
<ROW>
<ID>1</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value1</VALUE>
</ROW>
<ROW>
<ID>2</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value2</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<ID>-1</ID>
<CODE>3189</CODE>
<PARENT_CODE>3186</PARENT_CODE>
<ROWS>
<ROW>
<ID>3</ID>
<ROW_INDEX>0</ROW_INDEX>
<VALUE>value3</VALUE>
</ROW>
<ROW>
<ID>4</ID>
<ROW_INDEX>1</ROW_INDEX>
<VALUE>value4</VALUE>
</ROW>
</ROWS>
</ITEM>
<ITEM>
<ID>5</ID>
<CODE>3190</CODE>
<VALUE>value5</VALUE>
</ITEM>
</ITEM_LIST>
XML will omit any NULL value. The WHERE clause in the subselects will return with NULL if there's nothing found...

Querying SQL for grouped data and receiving it in BizTalk?

I have a query that returns data organized by group. I am wanting to have them come out in a grouped XML similar to the format below. I am planning to pass it in as a XML message into BizTalk using the WCF-SQL port adapter.
Data:
ID GroupID Item ID FileName
1 1 1 File001.txt
2 1 2 File001.txt
3 2 3 File002.txt
4 2 4 File002.txt
5 2 5 File002.txt
6 3 6 File003.txt
7 3 7 File003.txt
8 null 8 File004.txt
9 null 9 File005.txt
XML Ouput (input to BizTalk)
<GroupInfo ID=1 FileName=File001.txt>
<Items>
<Item ID=1 />
<Item ID=2 />
</Items>
</GroupInfo>
<GroupInfo ID=2 FileName=File002.txt>
<Items>
<Item ID=3 />
<Item ID=4 />
<Item ID=5 />
</Items>
</GroupInfo>
<GroupInfo ID=3 FileName=File003.txt>
<Items>
<Item ID=6 />
<Item ID=7 />
</Items>
</GroupInfo>
<GroupInfo FileName=File004.txt>
<Items>
<Item ID=8 />
</Items>
</GroupInfo>
<GroupInfo FileName=File005.txt>
<Items>
<Item ID=9 />
</Items>
</GroupInfo>
I'm not sure what to do to get the output in the required format. Please help.
To accomplish this, you would have to use FOR XML results from the Stored Procedure.
All documented here: http://technet.microsoft.com/en-us/library/ms178107.aspx
In this case, you might find it easier to join the table to itself on GroupID as FOR XML AUTO will automatically create child records for the joined rows.

How to convert nested hierarchy of xml to sql table

Using MSSQL 2008 and XQUERY
Consider the following XML stored in a table:
<ROOT>
<WrapperElement>
<ParentElement ID=1>
<Title>parent1</Title>
<Description />
<ChildElement ID="6">
<Title>Child 4</Title>
<Description />
<StartDate>2010-01-25T00:00:00</StartDate>
<EndDate>2010-01-25T00:00:00</EndDate>
</ChildElement>
<ChildElement ID="0">
<Title>Child1</Title>
<Description />
<StartDate>2010-01-25T00:00:00</StartDate>
<EndDate>2010-01-25T00:00:00</EndDate>
</ChildElement>
<ChildElement ID="8">
<Title>Child6</Title>
<Description />
<StartDate>2010-01-25T00:00:00</StartDate>
<EndDate>2010-01-25T00:00:00</EndDate>
</ChildElement>
</ParentElement>
</WrapperElement>
</Root>
I want to decompose this xml into something like
PE!ID | PE!Title | PE!Description | CE!ID | CE!Title | CE!StartDate |...
1 | parent1 | | 6 | child 4 | 2010-... |
1 | parent1 | | 0 | child1 | 2010-... |
etc.
Note: there may be many ChildElements per ParentElement, in this example.
I've been experimenting with xquery however i've not been able to navigate through complex elements as such.
Basically, i'm trying to do the exact opposite of what FOR XML does to a table, only with a much more simplistic set of data to work with.
Any ideas on where to go next or how to accomplish this?
Thanks
How about this (I declared #input to be a XML datatype variable with your XML content - replace accordingly):
SELECT
Parent.Elm.value('(#ID)[1]', 'int') AS 'ID',
Parent.Elm.value('(Title)[1]', 'varchar(100)') AS 'Title',
Parent.Elm.value('(Description)[1]', 'varchar(100)') AS 'Description',
Child.Elm.value('(#ID)[1]', 'int') AS 'ChildID',
Child.Elm.value('(Title)[1]', 'varchar(100)') AS 'ChildTitle',
Child.Elm.value('(StartDate)[1]', 'DATETIME') AS 'StartDate',
Child.Elm.value('(EndDate)[1]', 'DATETIME') AS 'EndDate'
FROM
#input.nodes('/ROOT/WrapperElement/ParentElement') AS Parent(Elm)
CROSS APPLY
Parent.Elm.nodes('ChildElement') AS Child(Elm)
You basically iterate over all the /ROOT/WrapperElement/ParentElemet nodes (as Parent(Elm) pseudo table), and for each of those entries, you then do a CROSS APPLY for the child elements contained inside that ParentElement and pluck out the necessary information.
Should work - I hope!