Pentaho to convert tree structure data - pentaho

I have a stream of data from a CSV. It is a flat structured database.
E.g.:
a,b,c,d
a,b,c,e
a,b,f
This essentially transforms into:
Node id,Nodename,parent id,level
100, a , 0 , 1
200, b , 100 , 2
300, c , 200 , 3
400, d , 300 , 4
500, e , 300 , 4
600, f , 200 , 3
Can this be done using Pentaho? I have gone through the transformation steps. But nothing strikes me as usable for this purpose. Please let me know if there is any step that I may have missed.

Your CSV file contains graph or tree definition. The output format is rich (node_id needs to be generated, parent_id needs to be resolved, level needs to be set). There are few issues you will face when processing this kind of CSV file in Pentaho Data Integration:
Data loading & processing:
Rows do not have same length (sometimes 4 nodes, sometimes 3 node).
Load whole rows. And then split rows to nodes and process one node per record stream item.
You can calculate output values in the same step as where the nodes are split.
Solution Steps:
CSV file input: Load data from CSV. Settings: No header row; Delimiter = ';'; One output column named rowData
Modified Java Script Value: Split rowData to nodes and calculate output values: nodeId, nodeName, parentId, nodeLevel [See the code below]
Sort rows: Sort rows by nodeName. [a,b,c,d,a,b,c,e,a,b,f >> a,a,a,b,b,c,c,d,e,f]
Unique rows: Delete duplicate rows by nodeName. [a,a,a,b,b,c,c,d,e,f >> a,b,c,d,e,f]
Text file output: Write out results.
Modified Java Script Value Code:
function writeRow(nodeId, nodeName, parentId, nodeLevel){
newRow = createRowCopy(getOutputRowMeta().size());
var rowIndex = getInputRowMeta().size();
newRow[rowIndex++] = nodeId;
newRow[rowIndex++] = nodeName;
newRow[rowIndex++] = parentId;
newRow[rowIndex++] = nodeLevel;
putRow(newRow);
}
var nodeIdsMap = {
a: "100",
b: "200",
c: "300",
d: "400",
e: "500",
f: "600",
g: "700",
h: "800",
}
// rowData from record stream (CSV input step)
var nodes = rowData.split(",");
for (i = 0; i < nodes.length; i++){
var nodeId = nodeIdsMap[nodes[i]];
var parentNodeId = (i == 0) ? "0" : nodeIdsMap[nodes[i-1]];
var level = i + 1;
writeRow(nodeId, nodes[i], parentNodeId, level);
}
trans_Status = SKIP_TRANSFORMATION;
Modified Java Script Value Field Settings:
Fieldname; Type; Replace value'Fieldname' or 'Rename to'
nodeId; String; N
nodeName; String; N
parent_id; String; N
nodeLevel; String; N

Related

How to remove all JSON attributes with certain value in PostgreSQL

given this table
parent
payload
1
{ a: 7, b: 3 }
2
{ a: 7, c: 3 }
1
{ d: 3, e: 1, f: 3 }
I want to update children of 1 and remove any attribute X where payload->X is 3.
after executing the query the records should look like this:
parent
payload
1
{ a: 7 }
2
{ a: 7, c: 3 }
1
{ e: 1 }
update records set payload=?? where parent = 1 and ??
There is no built-inf function for this, but you can write your own:
create function remove_keys_by_value(p_input jsonb, p_value jsonb)
returns jsonb
as
$$
select jsonb_object_agg(t.key, t.value)
from jsonb_each(p_input) as t(key, value)
where value <> p_value;
$$
language sql
immutable;
Then you can do:
update records
set payload = remove_key_by_value(payload, to_jsonb(3))
where parent = 1;
This assumes that payload is defined as jsonb (which it should be). If it's not, you have to cast it: payload::jsonb
Try this
update records
set payload = payload - 'x'
where parent = 1 and payload->>'x'::int = 3

Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value

I have data in a dataframe , which was obtained from azure eventhub.
Then I convert this data to json object and stored the required data into a dataset as shown below.
Code for obtaining data from eventhub and store it into a dataframe.
val connectionString = ConnectionStringBuilder(<ENDPOINT URL>)
.setEventHubName(<EVENTHUB NAME>).build
val currTime = Instant.now
val ehConf = EventHubsConf(connectionString)
.setConsumerGroup("<CONSUMER GRP>")
.setStartingPosition(EventPosition
.fromEnqueuedTime(currTime.minus(Duration.ofMinutes(30))))
.setEndingPosition(EventPosition.fromEnqueuedTime(currTime))
val reader = spark.read.format("eventhubs").options(ehConf.toMap).load()
var SIGNALS = reader
.select(get_json_object(($"body").cast("string"),"$.NUM").alias("NUM"),
get_json_object(($"body").cast("string"),"$.SIG1").alias("SIG1"),
get_json_object(($"body").cast("string"),"$.SIG2").alias("SIG2"),
get_json_object(($"body").cast("string"),"$.SIG3").alias("SIG3"),
get_json_object(($"body").cast("string"),"$.SIG4").alias("SIG4")
)
val SIGNALSFiltered = SIGNALS.filter(col("SIG1").isNotNull &&
col("SIG2").isNotNull && col("SIG3").isNotNull && col("SIG4").isNotNull)
The data obtained at SIGNALSFiltered is shown below.
+-----------------+--------------------+--------------------+--------------------+--------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|XXXXX01|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX02|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|
|XXXXX03|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX04|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX05|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX06|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|
|XXXXX07|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX08|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
If we check entire data for a single row it will be as below.
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825},{"TIME":1569560475000,"VALUE":3.7812},{"TIME":1569560483000,"VALUE":3.7812},{"TIME":1569560491000,"VALUE":34.7875}]|
[{"TIME":1569560537000,"VALUE":3.7825},{"TIME":1569560481000,"VALUE":34.7825},{"TIME":1569560489000,"VALUE":34.7825},{"TIME":1569560497000,"VALUE":34.7825}]|
[{"TIME":1569560505000,"VALUE":34.7825},{"TIME":1569560513000,"VALUE":34.7825},{"TIME":1569560521000,"VALUE":34.7825},{"TIME":1569560527000,"VALUE":34.7825}]|
[{"TIME":1569560535000,"VALUE":34.7825},{"TIME":1569560479000,"VALUE":34.7825},{"TIME":1569560487000,"VALUE":34.7825}]
I want only the highest TIME pair from each column, not the entire TIME VALUE pairs. Output should be as shown below.
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":4.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":5.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":6.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":7.7825}]|
|XXXXX03|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":8.7825}]|
How to Iterate through each column in each row and get the highest TIME-VALUE pair?
After getting highest in each columns (SIG1,....SIG4) have to update only the value of TIME in all columns with highest among them.
Is there Any way to convert the base dataset as below?. Each elements in a column should be converted to a new row.
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]| null |[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|```
Any leads or help is appreciated! Thanks in Advance.
You have to write one user defined function like below. which will loop your data and get Max Time Value.
Note: UDF is just for reference, you can change it as per requirement
How to Iterate through each column in each row and get the highest TIME-VALUE pair?
scala> import org.apache.spark.sql.expressions.{UserDefinedFunction}
scala> def MaxTime:UserDefinedFunction = udf((json:String) => {
val pars = JSON.parseFull(json)
var output=""
pars.foreach{ x => val y = x.asInstanceOf[List[Any]]
var i = 1
var TimeMap = scala.collection.mutable.Map[String, Long]()
var ValueMap = scala.collection.mutable.Map[String, Double]()
y.foreach{ zz => val z = zz.asInstanceOf[Map[String,Double]]
TimeMap(i.toString) = z("TIME").toLong
ValueMap(i.toString) = z("VALUE")
i = i + 1
}
output = """[{"TIME" : """ + TimeMap.maxBy(_._2)._2.toString + """ ,"VALUE": """ + ValueMap(TimeMap.maxBy(_._2)._1) + """}]"""
}
output})
scala> SIGNALSFiltered.withColumn("SIG1", MaxTime(col("SIG1")).withColumn("SIG2", MaxTime(col("SIG2")))).withColumn("SIG3", MaxTime(col("SIG3"))).withColumn("SIG4", MaxTime(col("SIG4"))).show(false)
After getting highest in each columns (SIG1,....SIG4) have to update only the value of TIME in all columns with highest among them.
Write same UDF like above and pass complete row as a parameter. Then parse each column value into Map and get Maximum among all columns.

how to iterate on selected rows with datables + select extension

I use the select extension an d try to 'alert' with the id of the selected rows.
the following code fails:
let sels = jqTable.api().rows({ selected: true });
let st = '';
sels.each(function (value, index) {
st += ',' + sels.row(value).id();
});
alert(st);
The function is called once independently of selected rows:
0 row: value = [], index = 0
>=1 : value = [0, 2], index = 0
The following code succeeds:
let sels = jqTable.api().rows({ selected: true });
let st = '';
for (let i = 0; i < sels.count(); i++) {
st += ',' + sels.row(sels[0][i]).id();
}
alert(st);
what do I missunderstand with each() :
Iterate over the contents of the API result set.
I notice that the following code runs:
sels.data().each(function (value, index) {
st += ',' + value.IdFile;
});
But using it cancels the advantage of rowId : 'IdFile' in the datatable configuration.
each() is used when the dataset returns an array of results within the API objects - in the case of rows() this isn't the case - it returns a single result, which happen to be an array containing the rowIDs of the selected rows.
Your first code block fails as there's only one iteration (the results are a single array).
Your second block works, because you're iterating over that single array (sels[0]).
And your third also works, as the rows().data() does generate an array containing the data of all the selected rows.
This example will hopefully help!

Lua get index name of table as table

Is there any way to get every index value of a table?
Example:
local mytbl = {
["Hello"] = 123,
["world"] = 321
}
I want to get this:
{"Hello", "world"}
local t = {}
for k, v in pairs(mytbl) do
table.insert(t, k) -- or t[#t + 1] = k
end
Note that the order of how pairs iterates a table is not specified. If you want to make sure the elements in the result are in a certain order, use:
table.sort(t)

Hive combine column values based upon condition

I was wondering if it is possible to combine column values based upon a condition. Let me explain...
Let say my data looks like this
Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164
An my output should be this
Id fullname offsets
1 Jan Janssen [ 100, 160 ]
I would like to combine the name values from two rows where the offset of the two rows are no more apart then 1 character.
My question is if this type of data manipulation is possible with and if it is could someone share some code and explaination?
Please be gentle but this little piece of code return some what what I want...
ArrayList<String> persons = new ArrayList<String>();
// write your code here
String _previous = "";
//Sample output form entities.txt
//USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
//USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
File file = new File("entities.txt");
try {
//
// Create a new Scanner object which will read the data
// from the file passed in. To check if there are more
// line to read from it we check by calling the
// scanner.hasNextLine() method. We then read line one
// by one till all line is read.
//
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
if(_previous == "" || _previous == null)
_previous = scanner.nextLine();
String _current = scanner.nextLine();
//Compare the lines, if there offset is = 1
int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]);
int y = Integer.parseInt(_current.split(",")[4]);
if(y-x == 1){
persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]);
if(scanner.hasNextLine()){
_current = scanner.nextLine();
}
}else{
persons.add(_previous.split(",")[1]);
}
_previous = _current;
}
} catch (Exception e) {
e.printStackTrace();
}
for(String person : persons){
System.out.println(person);
}
Working of this piece sample data
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Richard,PERSON,7,2732
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2740
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2756
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3093
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3195
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,3220
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,10858
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,11063
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Ken,PERSON,3,11186
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,11234
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,17073
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,17095
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Stephanie,PERSON,9,17330
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Putt,PERSON,4,17340
Which produces this output
Richard Marottoli
Marottoli
Marottoli
Marottoli
Berkowitz
Berkowitz
Marottoli
Lea
Lea
Ken
Marottoli
Berkowitz
Lea
Stephanie Putt
Kind regards
Load the table using below create table
drop table if exists default.stack;
create external table default.stack
(junk string,
name string,
cat string,
len int,
off int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/....';
Use below query to get your desired output.
select max(name), off from (
select CASE when b.name is not null then
concat(b.name," ",a.name)
else
a.name
end as name
,Case WHEN b.off1 is not null
then b.off1
else a.off
end as off
from default.stack a
left outer join (select name
,len+off+ 1 as off
,off as off1
from default.stack) b
on a.off = b.off ) a
group by off
order by off;
I have tested this it generates your desired result.