How to load json data in form of map in spark sql? - sql

I have json data as shown
"vScore": {
"300x600": {
"v1": "0.50",
"v2": "0.67",
"v3": "ATF",
"v4": "H2",
"v5": "0.11"
},
"728x90": {
"v1": "0.48",
"v2": "0.57",
"v3": "Unknown",
"v4": "H2",
"v5": "0.51"
},
"300x250": {
"v1": "0.64",
"v2": "0.77",
"v3": "ATF",
"v4": "H2",
"v5": "0.70"
},
I want to load this json data in the form of map i.e. I want to load vScores in the map so that 300x250 becomes the key and the nested v1...v5 becomes the value of map.
How to do it in spark sql in scala?

You need to load your data using
data = sqlContext.read.json("file")
you can check how your data was loaded
data.printSchema()
get your data with "Select" query , using
data.select....
More:
How to parse jsonfile with spark

Related

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

BigQuery: Convert a JSON stored as string to a struct dynamically

I have JSONs stored to a string field in BQ in the following manner:
select 12345 some_id, '{"163": {"sign": 1, "14": {"20": {"sign": 0}}, "13": {"28": {"sign": 1}}},"154": {"sign": 1, "12": {"21": {"sign": 1}}}}' as x
I want to export this data to JSON file, but the JSON string is exported as follows:
[
{
"some_id": "12345",
"x": "{\"163\": {\"sign\": 1, \"14\": {\"20\": {\"sign\": 0}}, \"13\": {\"28\": {\"sign\": 1}}},\"154\": {\"sign\": 1, \"12\": {\"21\": {\"sign\": 1}}}}"
}
]
Tried to use the BQ JSON functions, as well as the JS temp function which makes use of stringify that appears in many threads, but non of them considers dynamic JSON keys. Creative (or even simple) solution anyone?

Filter an object array to modify json with circe

I am evaluating Circe and couldn't find out how to use filter for arrays to transform a JSON. I read the guide on its website and API doc, still no clue. Help much appreciated.
Sample data:
{
"Department" : "HR",
"Employees" :[{ "name": "abc", "age": 25 }, {"name":"def", "age" : 30 }]
}
Task:
How to use a filter for Employees to transform the JSON to another JSON, for example, all employees with age older than 50?
For some reason I can't filter from data source before JSON is generated, in case you ask.
Thanks
One possible way of doing this is by
val data = """{"Department" : "HR","Employees" :[{ "name": "abc", "age": 25 }, {"name":"def", "age":30}]}"""
def ageFilter(j:Json): Json = j.withArray { x =>
Json.fromValues(x.filter(_.hcursor.downField("age").as[Int].map(_ > 26).getOrElse(false)))
}
val y: Either[ParsingFailure, Json] = parse(data).map( _.hcursor.downField("Employees").withFocus(ageFilter).top.get)
println(s"$y")

Is it possible to turn an array returned by the Mongo GeoNear command (using Ruby/Rails) into a Plucky object?

As a total newbie I have been trying to get the geoNear command working in my rails application and it appear to be working fine. The major annoyance for me is that it is returning an array with strings rather than keys which I can call on to pull out data.
Having dug around, I understand that MongoMapper uses Plucky to turn the the query resultant into a friendly object which can be handled easily but I haven't been able to find out how to transform the result of my geoNear query into a plucky object.
My questions are:
(a) Is it possible to turn this into a plucky object and how do i do that?
(b) If it is not possible how can I most simply and systematically extract each record and each field?
here is the query in my controller
#mult = 3963 * (3.14159265 / 180 ) # Scale to miles on earth
#results = #db.command( {'geoNear' => "places", 'near'=> #search.coordinates , 'distanceMultiplier' => #mult, 'spherical' => true})
Here is the object i'm getting back (with document content removed for simplicity)
{"ns"=>"myapp-development.places", "near"=>"1001110101110101100100110001100010100010000010111010", "results"=>[{"dis"=>0.04356444023196527, "obj"=>{"_id"=>BSON::ObjectId('4ee6a7d210a81f05fe000001'),...}}], "stats"=>{"time"=>0, "btreelocs"=>0, "nscanned"=>1, "objectsLoaded"=>1, "avgDistance"=>0.04356444023196527, "maxDistance"=>0.0006301239824196907}, "ok"=>1.0}
Help is much appreciated!!
Ok so lets say you store the results into a variable called places_near:
places_near = t.command( {'geoNear' => "places", 'near'=> [50,50] , 'distanceMultiplier' => 1, 'spherical' => true})
This command returns an hash that has a key (results) which maps to a list of results for the query. The returned document looks like this:
{
"ns": "test.places",
"near": "1100110000001111110000001111110000001111110000001111",
"results": [
{
"dis": 69.29646421910687,
"obj": {
"_id": ObjectId("4b8bd6b93b83c574d8760280"),
"y": [
1,
1
],
"category": "Coffee"
}
},
{
"dis": 69.29646421910687,
"obj": {
"_id": ObjectId("4b8bd6b03b83c574d876027f"),
"y": [
1,
1
]
}
}
],
"stats": {
"time": 0,
"btreelocs": 1,
"btreelocs": 1,
"nscanned": 2,
"nscanned": 2,
"objectsLoaded": 2,
"objectsLoaded": 2,
"avgDistance": 69.29646421910687
},
"ok": 1
}
To iterate over the responses just iterate as you would over any list in ruby:
places_near['results'].each do |result|
# do stuff with result object
end