AWS Glue / Hive struct with undetermined struct - hive

Adding data to a AWS Glue table where one of the columns is a struct where one of the values has undetermined form.
More specifically there's a known key called 'name', that is a string and another called 'metadata' that can be a dict with any structure.
Ex:
# Row 1
{
"name": "Jane",
"metadata": {
"foo": 123,
"bar": "something"
}
}
# Row 2
{
"name": "Bill",
"metadata": {
"baz": "something else"
}
}
Note how metadata is a different dictionary in the two entries.
How can this be specified as a struct?
struct<
name:string,
metadata:?
>

Ended up doing what I mentioned in the comment, which is to make the column a string and have the JSON blob serialized to string.
SQL queries will then need to deserialize the JSON blob, which is supported in several different implementations, including AWS Athena (the one I'm using).

Related

How to extract a field from an array of JSON objects in AWS Athena?

I have the following JSON data structure in a column in AWS Athena:
[
{
"event_type": "application_state_transition",
"data": {
"event_id": "-3368023833341021830"
}
},
{
"event_type": "application_state_transition",
"data": {
"event_id": "5692882176024811076"
}
}
]
I would like to somehow extract the values of event_id field, e.g. in the form of a list:
["-3368023833341021830", "5692882176024811076"]
(Though I don't insist on exactly this as long as I can get my event IDs.)
I wanted to use the JSON_EXTRACT function and thought it uses the very same syntax as jq. In jq, I can easily get what I want using the following query syntax:
.[].data.event_id
However, in AWS Athena this results in an error, as apparently the syntax is not entirely compatible with jq. Is there an alternative way to achieve the result I want?
JSON_EXTRACT supports quite limited set of json paths. Depending on Athena engine version you can either process column by casting it to array of maps and processing this array via array functions:
-- sample data
with dataset(json_col) as (
values ('[
{
"event_type": "application_state_transition",
"data": {
"event_id": "-3368023833341021830"
}
},
{
"event_type": "application_state_transition",
"data": {
"event_id": "5692882176024811076"
}
}
]')
)
-- query
select transform(
cast(json_parse(json_col) as array(map(varchar, json))),
m -> json_extract(m['data'], '$.event_id'))
from dataset;
Output:
_col0
["-3368023833341021830", "5692882176024811076"]
Or for 3rd Athena engine version you can try using Trino's json_query:
-- query
select JSON_QUERY(json_col, 'lax $[*].data.event_id' WITH ARRAY WRAPPER)
from dataset;
Note that return type of two will differ - in first case you will have array(json) and in the second one - just varchar.

How to map Elasticsearch Spring Data AggregationsContainer contents to custom model?

I am using Elsaticsearch Spring Data. I have a custom repository that uses ElasticsearchOperations based on examples on docs. I need some aggregation query results and I successfully get the intended results. but I need to map those results to a model. But currently I'm unable to access contents of AggregationsContainer.
override fun getStats(startTime: Long, endTime: Long, pageable: Pageable): AggregationsContainer<*>?
{
val query: Query = NativeSearchQueryBuilder()
.withQuery(QueryBuilders.rangeQuery("time").from(startTime).to(endTime))
.withAggregations(AggregationBuilders.sum("discount").field("discount"))
.withAggregations(AggregationBuilders.sum("price").field("price"))
.withPageable(pageable)
.build()
val searchHits: SearchHits<Product> = operations.search(query, Product::class.java)
return searchHits.aggregations
}
I return the result of the following code:
val stats = repository.getTotalStats(before, currentTime, pageable)?.aggregations()
the result is :
{
"asMap": {
"discount": {
"name": "discount",
"metadata": null,
"value": 8000.0,
"valueAsString": "8000.0",
"type": "sum",
"fragment": true
},
"price": {
"name": "price",
"metadata": null,
"value": 9000.0,
"valueAsString": "9000.0",
"type": "sum",
"fragment": true
}
},
"fragment": true
}
How can I convert above output to an intended output model like following? as I tested contents of aggregations() are inaccessible and the type is Any :
{
"priceSum":9000.0,
"discountSum":8000
}
There is no data model in the Elasticsearch RestHighLevelClient classes for aggregations, and there is no on in Spring Data Elasticsearch. Therefore the original Aggregations object is returned to the caller (contained in that AggregationContainer, because that will change with new new client implementation, and then the container will hold a different object).
You have to parse this by yourself, I had something in the answer of another question (https://stackoverflow.com/a/63105356/4393565). The interesting thing for you is the last codeblock where the aggregations are passed. You basically have to iterate over the elements, cast them to the appropriate type and evaluate them.

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

BigQuery: Get field names of a STRUCT

I have some data in a STRUCT in BigQuery. Below I have visualised an example of the data as JSON:
{
...
siblings: {
david: { a: 1 }
sarah: { b: 1, c: 1 }
}
...
}
I want to produce a field from a query that resembles ["david", "sarah"]. Essentially I just want to get the keys from the STRUCT (object). Note that every user will have different key names in the siblings STRUCT.
Is this possible in BigQuery?
Thanks,
A
Your structs schema must be consistent throughout the table. They can't change keys because they're part of the table schema. To get the keys you simply take a look at the table schema.
If values change, they're probably values in an array - I guess you might have something like this:
WITH t AS (
SELECT 1 AS id, [STRUCT('david' AS name, 33 as age), ('sarah', 42)] AS siblings
union all
SELECT 2, [('ken', 19), ('ryu',21), ('chun li',23)]
)
SELECT * FROM t
If you tried to introduce new keys in the second row or within the array, you'd get an error Array elements of types {...} do not have a common supertype at ....
The first element of the above example in json representation looks like this:
{
"id": "1",
"siblings": [
{
"name": "david",
"age": "33"
},
{
"name": "sarah",
"age": "42"
}
]
}

Cloudant search document by attributes of nested objects

My documents in cloudant have the following structure
{
"_id" : "1234",
"name" : "test",
"objects" : [
{
"type" : "TYPE1"
"time" : "1215
},
{
"type" : "TYPE2"
"time" : "1115"
}
]
}
Now I need to query my documents by a list of types.
Examples
1) If I would query with TYPE1 then all the documents where there is an object with this type would return. (The example doc would return)
2) If I would query with TYPE1 and TYPE3 it would return all documents which contain either of them (The example doc would return)
3) If I would query with TYPE3, TYPE4 and TYPE5 it would return all documents which contain either of them (The example doc would not return)
How would the code in the _design document look like and how would my API request look like?
One option is to use Cloudant Search.
Sample design document named types, which indexes each type property in your objects array
{
"_id": "_design/types",
"views": {},
"language": "javascript",
"indexes": {
"one-of": {
"analyzer": "standard",
"index": "function (doc) {\n for(var i in doc.objects) {\n index(\"type\", doc.objects[i].type); \n }\n}"
}
}
}
Query examples:
Search for one key (type=val)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1
Search for multiple keys (type=val1 OR type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20OR%20type%3ATYPE2
Search for multiple keys (type=val1 AND type=val2)
GET https://$HOST/$DATABASE/_design/$DDOC/_search/one-of?q=type%3ATYPE1%20AND%20type%3ATYPE2
To include the documents in the response append &include_docs=true.