Merging 2 rows in pentaho kettle transformation - pentaho

My KTR is:
MongoDB Json Input gives the JSON as follows:
{ "_id" : { "$oid" : "525cf3a70fafa305d949ede0"} , "asset" :
"RO2500AS1" , "Salt Rejection" : "82%" , "Salt Passage" : "18%" ,
"Recovery" : "56.33%" , "Concentration Factor" : "2.3" , "status" :
"critical" , "Flow Alarm" : "High Flow"}
And one Table input which returns 2 rows:
In StreamLookUp step, Key to look up is configured as asset = AssetName
My final Output is returning 2 jsons:
{"data":[{"Estimated Cost":"USD
15","AssetName":"RO2500AS1","Description":"Pump
Maintenance","Index":1,"json":"{ \"_id\" : { \"$oid\" :
\"525cf3a70fafa305d949ede0\"} , \"asset\" : \"RO2500AS1\" , \"Salt
Rejection\" : \"82%\" , \"Salt Passage\" : \"18%\" , \"Recovery\" :
\"56.33%\" , \"Concentration Factor\" : \"2.3\" , \"status\" :
\"critical\" , \"Flow Alarm\" : \"High
Flow\"}","Type":"Service","DeadLine":"13 November 2013"}]}
{"data":[{"Estimated Cost":"USD
35","AssetName":"RO2500AS1","Description":"Heat
Sensor","Index":2,"json":"{ \"_id\" : { \"$oid\" :
\"525cf3a70fafa305d949ede0\"} , \"asset\" : \"RO2500AS1\" , \"Salt
Rejection\" : \"82%\" , \"Salt Passage\" : \"18%\" , \"Recovery\" :
\"56.33%\" , \"Concentration Factor\" : \"2.3\" , \"status\" :
\"critical\" , \"Flow Alarm\" : \"High
Flow\"}","Type":"Replacement","DeadLine":"26 November 2013"}]}
I want my final JSON output to merge show result something like:
{"data": [{"Estimated Cost":"USD 15", "AssetName":"RO2500AS1",
"Description":"Pump Maintenance", "Index":1, "Type":"Service",
"DeadLine":"13 November 2013"}, {"Estimated Cost":"USD 35",
"AssetName":"RO2500AS1", "Description":"Heat Sensor", "Index":2,
"Type":"Replacement", "DeadLine":"26 November 2013"}],
"json":{ "_id" : "525cf3a70fafa305d949ede0"} , "asset" : "RO2500AS1"
, "Salt Rejection" : "82%" , "Salt Passage" : "18%" , "Recovery" :
"56.33%" , "Concentration Factor" : "2.3" , "status" : "critical" ,
"Flow Alarm" : "High Flow"}
which means merging 2 rows.
Can anybody help please

you can use MergeJoin after Tableinput. That will merge the rows from Mysql output rows and you will have only one JSON as output...

You would want to use the Merge step for your purpose. Don't forget to sort the input streams.
Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine. When you sorted the data outside of PDI, you may run into issues with the internal case sensitive/insensitive flag

Related

Data not being populated after altering the hive table

I have a hive table which is being populated by the underlying parquet file in HDFS location. Now, I have altered the table schema by changing the column name,but the column is now getting populated with NULL instead of original data in parquet.
Give this a try. Open up the .avsc file. For the column, you will find something like
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"sqlType" : "12"
}
Add an alias
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"aliases" : [ "sta_dte" ],
"sqlType" : "12"
}

HIve Create Json Array that not contains duplicate

I want to create an array of jsons that not contain duplicate . I had used LATERAL VIEW EXPLODE to break the initial Array , and now i want to group the string json i received and create merged jsons based on a key.
For example if i have :
Col1 :
{"key" : ke , "value" : 1 }
{"key" : ke , "value" : 2 }
{"key" : ke1 , "value" : 5 }
I would like to have
{"key" : ke , "value" : 3 }
{"key" : ke1 , "value" : 5 }
Can you help me?
select concat('{"key":"',jt.key,'","value":',sum(jt.value),'}')
from mytable t
lateral view json_tuple(Col1, 'key', 'value') jt as key,value
group by jt.key
;

return 0 instead of nothing for attribute value that is not available in any collections

I have a document as follow:
{
"_id" : ObjectId("5491d65bf315c2726a19ffe0"),
"tweetID" : NumberLong(535063274220687360),
"tweetText" : "19 RT Toronto #SunNewsNetwork: WATCH: When it comes to taxes, regulations, and economic freedom, is Canada more \"American\" than America? http://t.co/D?",
"retweetCount" : 1,
"Added" : ISODate("2014-11-19T04:00:00.000Z"),
"tweetLat" : 0,
"tweetLon" : 0,
"url" : "http://t.co/DH0xj0YBwD ",
"sentiment" : 18
}
now I want to get all document like this where Added is between 2014-11-19 and 2014-11-23 but we should note that there might be no data in for example this date : 2014-11-21 and now the problem starts: here when this happens I want 0 for sum of sentiment for this date instead of returning nothing( I know I can check this in java but it is not reasonable), my code is as follow which works fine except for the date that is not available it returns nothing instead of 0:
andArray.add(new BasicDBObject("Added", new BasicDBObject("$gte",
startDate)));
andArray.add(new BasicDBObject("Added", new BasicDBObject("$lte",
endDate)));
DBObject where = new BasicDBObject("$match", new BasicDBObject("$and",
andArray));
stages.add(where);
DBObject groupFields = new BasicDBObject("_id", "$Added");
groupFields.put("value",
new BasicDBObject("$avg", "$sentiment"));
DBObject groupBy = new BasicDBObject("$group", groupFields);
stages.add(groupBy);
DBObject project = new BasicDBObject("_id", 0);
project.put("value", 1);
project.put("Date", "$_id");
stages.add(new BasicDBObject("$project", project));
DBObject sort = new BasicDBObject("$sort", new BasicDBObject("Date", 1));
stages.add(sort);
AggregationOutput output = collectionG.aggregate(stages);
Now I want value 0 for the date that is not available in the collections that I have,
For example consider 2014-11-21 in the following :
[ { "value" : 6.0 , "Date" : { "$date" : "2014-11-19T04:00:00.000Z"}} , { "value" : 20.0 , "Date" : { "$date" : "2014-11-20T04:00:00.000Z"}},{ "value" : 0 , "Date" : { "$date" : "2014-11-21T04:00:00.000Z"}}]
instead of :
[ { "value" : 6.0 , "Date" : { "$date" : "2014-11-19T04:00:00.000Z"}} , { "value" : 20.0 , "Date" : { "$date" : "2014-11-20T04:00:00.000Z"}}}]
Is it possible to do that?
Why is checking in Java not reasonable? Setting average to 0 for 'nothing' is reasonable?
Depending on the context of your problem, one solution is for you to insert dummy records with 0 sentiment.

JSON Bulk load with Apache Phoenix

I have a problem with loading data from json files. How can i export data from json files into the table in Hbase?
Here is json-structure:
{ "_id" : { "$oid" : "53ba5e86eb07565b53374901"} , "_api_method" : "database.getSchools" , "id" : "0" , "date_insert" : "2014-07-07 11:47:02" , "unixdate" : 1404722822 , "city_id" : "1506490" , "response" : [ 1 , { "id" : 354053 , "title" : "шк. Аджамская"}]};
Help me please!
For your json format, you could not use importtsv. I suggest you write a Mapreduce to parse you json data and put data to HBase.

How to Update a document by replacing it with a new document in a collection of mongoDB

I am using ObjCMongoDB as a cocoa wrapper for accessing mongoDB. I am facing difficulty in a scenario where I have to find and replace a document with a new document. Can any one help me by pointing out the code/API of ObjCMongoDB to use.
For example:
{
"_id" : { "$oid" : "51de4ed737965b2d233f4862"} ,
"milestone" : "Application 7.1 release" ,
"pendingtasklist" : [ task1 , task2 , task3]
}
here I have to replace pendingtasklist with new list and result should be
{
"_id" : { "$oid" : "51de4ed737965b2d233f4862"} ,
"milestone" : "Application 7.1 release" ,
"someotherlist" : [ task12 , task33 , task32]
}
I have attached the code I am using to achieve this, but without success
NSError *connectionError = nil;
MongoConnection *dbConn = [MongoConnection connectionForServer:#"127.0.0.1:27017" error:&connectionError];
MongoDBCollection *collection = [dbConn collectionWithName:#"mydb.milestones"];
MongoKeyedPredicate *predicate = [MongoKeyedPredicate predicate];
[predicate keyPath:#"milestone" matches:#"Application 7.1 release"];
MongoUpdateRequest *updateReq = [MongoUpdateRequest updateRequestWithPredicate:predicate firstMatchOnly:YES];
NSDictionary *milestoneDict = #{#"problemlist": #[#"12345",#"112244",#"55543",#"009009"],#"milestone":#"Application 7.1 release"};
[updateReq replaceDocumentWithDictionary:milestoneDict];
BOOL result = [collection updateWithRequest:updateReq error:&connectionError];
Before my collection will have documents like this:
{ "_id" : { "$oid" : "51de4ed737965b2d233f4862"} , "milestone" : "Application 7.1 Release" , "problemlist" : [ 12345 , 112244 , 55543]}
{ "_id" : { "$oid" : "51de4ed737965b2d233f4864"} , "milestone" : "Application 7.1 UAT" , "problemlist" : [ 33545 , 7654 , 8767]}
If the value were staying the same, you would just rename the key:
-[MongoUpdateRequest keyPath:renameToKey:]
But since the values are changing, you should just unset the old key and set the new one.
-[MongoUpdateRequest unsetValueForKeyPath:]
-[MongoUpdateRequest keyPath:setValue:]
As I mentioned above you can do this with a single update request.
In order to rename a field, you need to remove the old one, and add the new one. In this case, you would have to run two separate queries for this.