JSON Bulk load with Apache Phoenix - apache

I have a problem with loading data from json files. How can i export data from json files into the table in Hbase?
Here is json-structure:
{ "_id" : { "$oid" : "53ba5e86eb07565b53374901"} , "_api_method" : "database.getSchools" , "id" : "0" , "date_insert" : "2014-07-07 11:47:02" , "unixdate" : 1404722822 , "city_id" : "1506490" , "response" : [ 1 , { "id" : 354053 , "title" : "шк. Аджамская"}]};
Help me please!

For your json format, you could not use importtsv. I suggest you write a Mapreduce to parse you json data and put data to HBase.

Related

Is there a way to match avro schema with Bigquery and Bigtable?

I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.
For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?
For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb

BigQuery: --[no]use_avro_logical_types flag doesn't work

I try to use bq command with --[no]use_avro_logical_types flag to load avro files into BigQuery table which does not exist before executing the command. The avro schema contains timestamp-millis logical type value. When the command is executed, a new table is created but the schema of its column becomes INTEGER.
This is a recently released feature so that I cannot find examples and I don't know what I am missing. Could anyone give me a good example?
My avro schema looks like following,
...
}, {
"name" : "timestamp",
"type" : [ "null", "long" ],
"default" : null,
"logicalType" : [ "null", "timestamp-millis" ]
}, {
...
And executing command is this:
bq load --source_format=AVRO --use_avro_logical_types <table> <path/to/file>
To use the timestamp-millis logical type, you can specify the field in the following way:
{
"name" : "timestamp",
"type" : {"type": "long", "logicalType" : "timestamp-millis"}
}
In order to provide an optional 'null' value, you can try out the following spec:
{
"name" : "timestamp",
"type" : ["null", {"type" : "long", "logicalType" : "timestamp-millis"}]
}
For a full list of supported Avro logical types please refer to the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types.
According to https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro, the avro type, timestamp-millis, is converted to an INTEGER once loaded in BigQuery.

HIve Create Json Array that not contains duplicate

I want to create an array of jsons that not contain duplicate . I had used LATERAL VIEW EXPLODE to break the initial Array , and now i want to group the string json i received and create merged jsons based on a key.
For example if i have :
Col1 :
{"key" : ke , "value" : 1 }
{"key" : ke , "value" : 2 }
{"key" : ke1 , "value" : 5 }
I would like to have
{"key" : ke , "value" : 3 }
{"key" : ke1 , "value" : 5 }
Can you help me?
select concat('{"key":"',jt.key,'","value":',sum(jt.value),'}')
from mytable t
lateral view json_tuple(Col1, 'key', 'value') jt as key,value
group by jt.key
;

Merging 2 rows in pentaho kettle transformation

My KTR is:
MongoDB Json Input gives the JSON as follows:
{ "_id" : { "$oid" : "525cf3a70fafa305d949ede0"} , "asset" :
"RO2500AS1" , "Salt Rejection" : "82%" , "Salt Passage" : "18%" ,
"Recovery" : "56.33%" , "Concentration Factor" : "2.3" , "status" :
"critical" , "Flow Alarm" : "High Flow"}
And one Table input which returns 2 rows:
In StreamLookUp step, Key to look up is configured as asset = AssetName
My final Output is returning 2 jsons:
{"data":[{"Estimated Cost":"USD
15","AssetName":"RO2500AS1","Description":"Pump
Maintenance","Index":1,"json":"{ \"_id\" : { \"$oid\" :
\"525cf3a70fafa305d949ede0\"} , \"asset\" : \"RO2500AS1\" , \"Salt
Rejection\" : \"82%\" , \"Salt Passage\" : \"18%\" , \"Recovery\" :
\"56.33%\" , \"Concentration Factor\" : \"2.3\" , \"status\" :
\"critical\" , \"Flow Alarm\" : \"High
Flow\"}","Type":"Service","DeadLine":"13 November 2013"}]}
{"data":[{"Estimated Cost":"USD
35","AssetName":"RO2500AS1","Description":"Heat
Sensor","Index":2,"json":"{ \"_id\" : { \"$oid\" :
\"525cf3a70fafa305d949ede0\"} , \"asset\" : \"RO2500AS1\" , \"Salt
Rejection\" : \"82%\" , \"Salt Passage\" : \"18%\" , \"Recovery\" :
\"56.33%\" , \"Concentration Factor\" : \"2.3\" , \"status\" :
\"critical\" , \"Flow Alarm\" : \"High
Flow\"}","Type":"Replacement","DeadLine":"26 November 2013"}]}
I want my final JSON output to merge show result something like:
{"data": [{"Estimated Cost":"USD 15", "AssetName":"RO2500AS1",
"Description":"Pump Maintenance", "Index":1, "Type":"Service",
"DeadLine":"13 November 2013"}, {"Estimated Cost":"USD 35",
"AssetName":"RO2500AS1", "Description":"Heat Sensor", "Index":2,
"Type":"Replacement", "DeadLine":"26 November 2013"}],
"json":{ "_id" : "525cf3a70fafa305d949ede0"} , "asset" : "RO2500AS1"
, "Salt Rejection" : "82%" , "Salt Passage" : "18%" , "Recovery" :
"56.33%" , "Concentration Factor" : "2.3" , "status" : "critical" ,
"Flow Alarm" : "High Flow"}
which means merging 2 rows.
Can anybody help please
you can use MergeJoin after Tableinput. That will merge the rows from Mysql output rows and you will have only one JSON as output...
You would want to use the Merge step for your purpose. Don't forget to sort the input streams.
Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine. When you sorted the data outside of PDI, you may run into issues with the internal case sensitive/insensitive flag

How to Update a document by replacing it with a new document in a collection of mongoDB

I am using ObjCMongoDB as a cocoa wrapper for accessing mongoDB. I am facing difficulty in a scenario where I have to find and replace a document with a new document. Can any one help me by pointing out the code/API of ObjCMongoDB to use.
For example:
{
"_id" : { "$oid" : "51de4ed737965b2d233f4862"} ,
"milestone" : "Application 7.1 release" ,
"pendingtasklist" : [ task1 , task2 , task3]
}
here I have to replace pendingtasklist with new list and result should be
{
"_id" : { "$oid" : "51de4ed737965b2d233f4862"} ,
"milestone" : "Application 7.1 release" ,
"someotherlist" : [ task12 , task33 , task32]
}
I have attached the code I am using to achieve this, but without success
NSError *connectionError = nil;
MongoConnection *dbConn = [MongoConnection connectionForServer:#"127.0.0.1:27017" error:&connectionError];
MongoDBCollection *collection = [dbConn collectionWithName:#"mydb.milestones"];
MongoKeyedPredicate *predicate = [MongoKeyedPredicate predicate];
[predicate keyPath:#"milestone" matches:#"Application 7.1 release"];
MongoUpdateRequest *updateReq = [MongoUpdateRequest updateRequestWithPredicate:predicate firstMatchOnly:YES];
NSDictionary *milestoneDict = #{#"problemlist": #[#"12345",#"112244",#"55543",#"009009"],#"milestone":#"Application 7.1 release"};
[updateReq replaceDocumentWithDictionary:milestoneDict];
BOOL result = [collection updateWithRequest:updateReq error:&connectionError];
Before my collection will have documents like this:
{ "_id" : { "$oid" : "51de4ed737965b2d233f4862"} , "milestone" : "Application 7.1 Release" , "problemlist" : [ 12345 , 112244 , 55543]}
{ "_id" : { "$oid" : "51de4ed737965b2d233f4864"} , "milestone" : "Application 7.1 UAT" , "problemlist" : [ 33545 , 7654 , 8767]}
If the value were staying the same, you would just rename the key:
-[MongoUpdateRequest keyPath:renameToKey:]
But since the values are changing, you should just unset the old key and set the new one.
-[MongoUpdateRequest unsetValueForKeyPath:]
-[MongoUpdateRequest keyPath:setValue:]
As I mentioned above you can do this with a single update request.
In order to rename a field, you need to remove the old one, and add the new one. In this case, you would have to run two separate queries for this.