Azure Data Factory JSON syntax - azure-data-factory-2

In Azure Data Factory, I have a copy activity. The data source is the response body from a REST API POST request.
The sink is a SQL table. The problem is that, even though my JSON data contains multiple rows, only the first row is getting copied.
The source data looks like the following:
{
"offset": 0,
"limit": 1000,
"total": 65,
"loaded": 34,
"unloaded": 31,
"cubeCaches": [
{
"id": "MxMUVDN0Q1MzAk5MDg6RDkxREQxMUU5RDBDNzR2NMTk6YWNsZGxwMTJtc3QuY2952aXppZW50aW5==",
"projectId": "15D91DD11E9D0C74B3319",
"source": {
"name": "12302021",
"id": "07EF95111EC7F954158",
"type": "cube"
},
"state": {
"active": true,
"dirty": false,
"infoDirty": false,
"persisted": true,
"processing": false,
"loadedState": "loaded"
},
"lastUpdateTime": "2022-01-24T14:22:30Z",
"lastHitTime": "2022-02-14T20:02:02Z",
"hitCount": 1,
"size": 798720,
"creatorId": "D4E8BFD56085",
"lastUpdateJob": 18937,
"openViewCount": 0,
"creationTime": "2022-01-24T15:07:24Z",
"historicHitCount": 22,
"dataLanguages": [],
"rowCount": 2726,
"columnCount": 9
},
{
"id": "UYwMTIxMUFNjkxMUU5RDBDMTRCNkMwMDgwRUYzNUQ0MUI6YWNsZjLmNvbQ==",
"projectId": "120D0C1480EF35D41B",
"source": {
"name": "All Clients (YTD)",
"id": "49E5B13466251CD0B54E8F",
"type": "cube"
},
"state": {
"active": true,
"dirty": false,
"infoDirty": false,
"persisted": true,
"processing": false,
"loadedState": "loaded"
},
"lastUpdateTime": "2022-01-03T01:00:01Z",
"hitCount": 0,
"size": 82488152,
"creatorId": "1E2AFB011E80EF35FF14",
"lastUpdateJob": 364091,
"openViewCount": 0,
"creationTime": "2022-02-14T01:04:55Z",
"historicHitCount": 0,
"dataLanguages": [],
"rowCount": 8146903,
"columnCount": 13
}
}
I want to add a row in the Sink table (SQL) for every "id" in the JSON. However, when I run the activity, only the first record gets copied. It's mapped correctly, but I want it to copy all rows in the JSON, not just 1.
My Mapping tab in Azure Data Factory looks like this:
What am I doing wrong here? I'm thinking there is something wrong with my "Source" syntax for each of the columns...

In $cubeCashes[0][...] you're explicitly mapping the first element from this array into columns, and that's why only one row lands in the Sink.
I don’t know a way to achieve what you intend with copy activity only. I would use the Mapping Data Flow here, and inlide I would flatten (Flatten activity) your data to get the array of objects.
Then from this flattened dataset you could use a Derived Column to map the fields in JSON into columns of your target, Select, to remove unwanted original fields, and Sink it into your target location.

Related

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

bigquery nested object : No such field

I have a table with this schema :
I'm trying to upload some data from Google Coud Storage using the python client. The file is JSON newline delimited. Most of my lines don't have the field "passenger_origin.accuracy" but when the filed is present I have the following error :
Error while reading
data, error message: JSON parsing error in row starting at position
2122510: No such field: driver_origin.accuracy. (error code: invalid)
Error while reading
data, error message: JSON parsing error in row starting at position
2126317: No such field: passenger_origin.accuracy. (error code:
invalid)
Example of an invalid row :
{
"id": 1479443,
"is_obsolete": 0,
"seat_count": 1,
"is_ticket_checked": 0,
"score": 0.3709318902,
"is_multimodal": 0,
"fake_paths": 0,
"passenger_origin": {
"id": 2204,
"poi_uuid": "15b4e52c-7c58-442c-98df-1eb06079f6bb",
"user_id": 1987,
"accuracy": 250.0,
"disabled": 0,
"last_update": "2017-03-10T15:15:39",
"created": "2016-02-05T17:06:26",
"modified_by_user": 1,
"is_recurrent": 0,
"source": 1,
"hidden_by_user": 0,
"kind": 2,
},
"driver_origin": {
"id": 412491,
"poi_uuid": "47e90b6d-e178-4e02-9f02-f4ea5f8beaa1",
"user_id": 71471,
"disabled": 0,
"last_update": "2017-11-02T10:09:09",
"created": "2017-11-02T10:09:09",
"modified_by_user": 0,
"is_recurrent": 0,
"source": 1,
"hidden_by_user": 0,
"kind": 2,
},
"passenger_destination": {
"id": 2203,
"poi_uuid": "c531c3ca-47f0-4003-8098-1272fee8d018",
"user_id": 1987,
"accuracy": 250.0,
"disabled": 0,
"last_update": "2017-03-10T15:12:42",
"created": "2016-02-05T17:06:19",
"modified_by_user": 1,
"is_recurrent": 0,
"source": 1,
"hidden_by_user": 0,
"kind": 1,
}
}
The table is created before the upload of the data and is not modified since. I don't understand why the upload is failing on theses fields ? Do the RECORD fields have to be REPEATED ?
To ignore the fields that aren't present in the schema, use a combination of:
configuration.load.ignoreUnknownValues
configuration.load.maxBadRecords
Setting the first to true and the second to some arbitrarily-high number, e.g. 100000, will enable the load to succeed even if there are extra fields.
The problem was configuration.load.autodetect was set to True. I set it to False and the problem was fix

Append to Specific Array Index from another table dynamically SQL Server

I have a json string that I need to modify
{"RecordCount":3,"Top":10,"Skip":0,"SelectedSort":"Seed asc","value":[{"AccountProductListId":22091612871138,"Name":"April 4th 2018","AccountId":256813438078643,"IsPublic":false,"Comment":"Test order sheet","Quantity":3},{"AccountProductListId":166305848801939,"Name":"test","AccountId":256813438078643,"IsPublic":false,"Comment":"","Quantity":1},{"AccountProductListId":21177711287586,"Name":"Test Order sheet","AccountId":256813438078643,"IsPublic":true,"Comment":"the very first sheet","Quantity":2}]}
Inside value the array looks like this:
"value": [{
"AccountProductListId": 22091612871138,
"Name": "April 4th 2018",
"IsPublic": false,
"Comment": "Test order sheet",
"Quantity": 3
}, {
"AccountProductListId": 166305848801939,
"Name": "test",
"IsPublic": false,
"Comment": "",
"Quantity": 1
}, {
"AccountProductListId": 21177711287586,
"Name": "Test Order sheet",
"IsPublic": true,
"Comment": "the very first sheet",
"Quantity": 2
}],
What I need to do is append some data from another table:
AccountProductListId ProductID
21177711287586 97096131867163|32721319938943
22091612871138 97096131867163|145461009584740|130005306921282
166305848801939 8744071222157
As you can see the AccountProductListId is already in the JSON result so I should know which array it should go to. The only problem is I don't know the syntax to merge the ProductID data into its specific array index. The JSON array could have more than 3 items.
Essentially ending up with something like this:
"value": [{
"AccountProductListId": 22091612871138,
"Name": "April 4th 2018",
"IsPublic": false,
"Comment": "Test order sheet",
"Quantity": 3,
"ProductID": "97096131867163|145461009584740|130005306921282"
}, {
"AccountProductListId": 166305848801939,
"Name": "test",
"IsPublic": false,
"Comment": "",
"Quantity": 1,
"ProductID": "8744071222157"
}, {
"AccountProductListId": 21177711287586,
"Name": "Test Order sheet",
"IsPublic": true,
"Comment": "the very first sheet",
"Quantity": 2,
"ProductID": "97096131867163|32721319938943"
}],
Any information would be greatly appreciated. Thanks.
Process with SQL Server
Prior to SQL Server 2016, there is not in-built support to read or write JSON.
Starting SQL Server 2016, you can use OPENJSON rowset function to read JSON and FOR JSON clause to write JSON.
See https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server
The approach would be to use OPENJSON to read the JSON string as a rowset, join it with the table to pickup ProductID and use FOR JSON to convert back to JSON.
Process outside SQL Server
Depending on your situation, it might be simpler to parse JSON outside SQL Server. If going that route, then you could
Collect all the AccountProductListIDs from the parsed JSON
Send the collected id to SQL Server via a stored procedure that takes a TVP input and outputs the AccountProductListID -> ProductID mapping
Inject the ProductIDs into JSON object and serialize back to string

Turned off the dynamic mapping in Elasticsearch, but the custom mapping still not work?

my problem is: I have a JsonObject like this:
{
"success": true,
"type": "message",
"body": {
"_id": "5215bdd32de81e0c0f000005",
"id": "411c79eb-a725-4ad9-9d82-2db54dfc80ee",
"type": "metaModel",
"title": "testchang",
"authorId": "5215bd552de81e0c0f000001",
"drawElems": [
{
"type": "App.draw.metaElem.ModelStartPhase",
"id": "27re7e35-550j",
"x": 60,
"y": 50,
"width": 50,
"height": 50,
"title": "problem engagement",
"isGhost": true,
"pointTo": "e88e2845-37a4-4c45-a030-d02a3c3e03f9",
"bindingId": "90f79d70-0afc-11e3-98d2-83967d2ad9a6",
"model": "meta",
"entityType": "phase",
"domainId": "411c79eb-a725-4ad9-9d82-2db54dfc80ee",
"authorId": "5215bd552de81e0c0f000001",
"userData": {},
"_id": "5215f4c5d89f629c1700000d"
},
{...}
]
}
}
And I tried to define a mapping as follows to index only parts of this object.
String mapping = XContentFactory.jsonBuilder()
.startObject()
.startObject("domaindata").field("dynamic","false")
.startObject("properties")
.startObject("id").field("type","string").field("store","yes").endObject()
.startObject("type").field("type","string").field("store","yes").endObject()
.startObject("title").field("type","integer").field("store","yes").endObject()
.startObject("drawElems")
.startObject("properties")
.startObject("type").field("store","yes").field("type","string").endObject()
.startObject("title").field("store","yes").field("type","string").endObject()
.endObject().endObject().endObject().endObject().endObject().string();
after adding this mapping into my type with:
node.client().admin()
.indices().prepareCreate("test")
.addMapping("domaindata", mapping)
.execute().actionGet();
I still got all of the jsonobject in my indexresponse, it seems that my mapping does not work.
Could anybody help me? Thanks a lot!
The problem here is that using static mapping only means that fields that are not already present in the mapping won't be added to it, thus won't be indexed either. But as they are part of the source document that you sent, they are returned as part of the _source field.
Same goes if you disable a specific object in the mapping ("enable":false) as mentioned here. That object won't be parsed nor indexed, but will still be part of the stored _source field.
If you want to avoid storing part of the _source you can use the source includes/excludes feature as described here.

Freebase search_api and excluding results by specified type

is anyone know, how to exclude some topics with specified type(s) using search api and mql?
For example i'm try to find all topics "Voodoo People", and exclude only those, that have composition and release types, and sort result by score desc: http://tinyurl.com/3tjkb7y.
Sorting work perfect, but i can't find functionality for excluding :(
I'm try to use mql_filter: http://tinyurl.com/644xkow, but releases still there.
And one more question: i see in type_strict param possible values: "all", "any", "should". But there is no value "not" or "not in". Is needed result can be obtained in any other way?
The syntax that you're looking for is "optional" : "forbidden". In your query that would look like this:
[{
"search": {
"query": "Voodoo People",
"score": null,
"mql_filter": [{
"type": {
"id": "/music/release",
"optional": "forbidden"
}
}]
},
"name": null,
"id": null,
"type": [],
"/common/topic/notable_for": {
},
"limit": 15,
"sort": "-search.score"
}]​