GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...} - google-bigquery

Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?
bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))
[UPDATE] Adding result that I try to produce:
"displayData": [
{
"key": "table",
"namespace": "....",
"strValue": "..."
},
{
"key": "datasetName",
"strValue": "..."
}
]

From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.
I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.

Related

Proper way to convert Data type of a field in MongoDB

Possible Replication of How to change the type of a field?
I am currently newly learning MongoDB and I am facing problem while converting Data type of field value to another data type.
Below is an example of my document
[
{
"Name of Restaurant": "Briyani Center",
"Address": " 336 & 338, Main Road",
"Location": "XYZQWE",
"PriceFor2": "500.0",
"Dining Rating": "4.3",
"Dining Rating Count": "1500",
},
{
"Name of Restaurant": "Veggie Conner",
"Address": " New 14, Old 11/3Q, Railway Station Road",
"Location": "ABCDEF",
"PriceFor2": "1000.0",
"Dining Rating": "4.4",
}]
Like above I have 12k documents. Notice the datatype of PriceFor2 is a string. I would like to convert the data type to Integer data type.
I have referred many amazing answers given in the above link. But when I try to run the query, I get .save() is not a function error. Please advice what is the problem.
Below is the code I used
db.chennaiData.find().forEach( function(x){ x.priceFor2= new NumberInt(x.priceFor2);
db.chennaiData.save(x);
db.chennaiData.save(x);});
This is the error I am getting..
TypeError: db.chennaiData.save is not a function
From MongoDB's save documentation:
Starting in MongoDB 4.2, the
db.collection.save()
method is deprecated. Use db.collection.insertOne() or db.collection.replaceOne() instead.
Likely you are having a MongoDB with version 4.2+, so the save function is no longer available. Consider migrate to the usage of insertOne and replaceOne as suggested.
For your specific scenario, it is actually preferred to do with a single update as mentioned in another SO answer. It only does one db call(while your approach fetches all documents in the collection to the application level) and performs n db call to save them back.
db.collection.update({},
[
{
$set: {
PriceFor2: {
$toDouble: "$PriceFor2"
}
}
}
],
{
multi: true
})
Mongo Playground

How to map Elasticsearch Spring Data AggregationsContainer contents to custom model?

I am using Elsaticsearch Spring Data. I have a custom repository that uses ElasticsearchOperations based on examples on docs. I need some aggregation query results and I successfully get the intended results. but I need to map those results to a model. But currently I'm unable to access contents of AggregationsContainer.
override fun getStats(startTime: Long, endTime: Long, pageable: Pageable): AggregationsContainer<*>?
{
val query: Query = NativeSearchQueryBuilder()
.withQuery(QueryBuilders.rangeQuery("time").from(startTime).to(endTime))
.withAggregations(AggregationBuilders.sum("discount").field("discount"))
.withAggregations(AggregationBuilders.sum("price").field("price"))
.withPageable(pageable)
.build()
val searchHits: SearchHits<Product> = operations.search(query, Product::class.java)
return searchHits.aggregations
}
I return the result of the following code:
val stats = repository.getTotalStats(before, currentTime, pageable)?.aggregations()
the result is :
{
"asMap": {
"discount": {
"name": "discount",
"metadata": null,
"value": 8000.0,
"valueAsString": "8000.0",
"type": "sum",
"fragment": true
},
"price": {
"name": "price",
"metadata": null,
"value": 9000.0,
"valueAsString": "9000.0",
"type": "sum",
"fragment": true
}
},
"fragment": true
}
How can I convert above output to an intended output model like following? as I tested contents of aggregations() are inaccessible and the type is Any :
{
"priceSum":9000.0,
"discountSum":8000
}
There is no data model in the Elasticsearch RestHighLevelClient classes for aggregations, and there is no on in Spring Data Elasticsearch. Therefore the original Aggregations object is returned to the caller (contained in that AggregationContainer, because that will change with new new client implementation, and then the container will hold a different object).
You have to parse this by yourself, I had something in the answer of another question (https://stackoverflow.com/a/63105356/4393565). The interesting thing for you is the last codeblock where the aggregations are passed. You basically have to iterate over the elements, cast them to the appropriate type and evaluate them.

How to convert Json to CSV and send it to big query or google cloud bucket

I`m new to nifi and I want to convert big amount of json data to csv format.
This is what I am doing at the moment but it is not the expected result.
These are the steps:
processes to create access_token and send request body using InvokeHTTP(This part works fine I wont name the processes since this is the expected result) and getting the response body in json.
Example of json response:
[
{
"results":[
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdaasdasdad",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdasda"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
{
"customer":{
"resourceName":"customers/123456789",
"id":"11111111"
},
"campaign":{
"resourceName":"customers/123456789/campaigns/222456422222",
"name":"asdasdasdasd",
"id":"456456546546"
},
"adGroup":{
"resourceName":"customers/456456456456/adGroups/456456456456",
"id":"456456456546",
"name":"asdasdasdas"
},
"metrics":{
"clicks":"11",
"costMicros":"43068982",
"impressions":"2079"
},
"segments":{
"device":"DESKTOP",
"date":"2021-11-22"
},
"incomeRangeView":{
"resourceName":"customers/456456456/incomeRangeViews/456456546~456456456"
}
},
....etc....
]
}
]
Now I am using:
===>SplitJson ($[].results[])==>JoltTransformJSON with this spec:
[{
"operation": "shift",
"spec": {
"customer": {
"id": "customer_id"
},
"campaign": {
"id": "campaign_id",
"name": "campaign_name"
},
"adGroup": {
"id": "ad_group_id",
"name": "ad_group_name"
},
"metrics": {
"clicks": "clicks",
"costMicros": "cost",
"impressions": "impressions"
},
"segments": {
"device": "device",
"date": "date"
},
"incomeRangeView": {
"resourceName": "keywords_id"
}
}
}]
==>> MergeContent( here is the problem which I don`t know how to fix)
Merge Strategy: Defragment
Merge Format: Binary Concatnation
Attribute Strategy Keep Only Common Attributes
Maximum number of Bins 5 (I tried 10 same result)
Delimiter Strategy: Text
Header: [
Footer: ]
Demarcator: ,
What is the result I get?
I get a json file that has parts of the json data
Example: I have 50k customer_ids in 1 json file so I would like to send this data into big query table and have all the ids under the same field "customer_id".
The MergeContent uses the split json files and combines them but I will still get 10k customer_ids for each file i.e. I have 5 files and not 1 file with 50k customer_ids
After the MergeContent I use ==>>ConvertRecord with these settings:
Record Reader JsonTreeReader (Schema Access Strategy: InferSchema)
Record Writer CsvRecordWriter (
Schema Write Strategy: Do Not Write Schema
Schema Access Strategy: Inherit Record Schema
CSV Format: Microsoft Excel
Include Header Line: true
Character Set UTF-8
)
==>>UpdateAttribute (custom prop: filename: ${filename}.csv) ==>> PutGCSObject(and put the data into the google bucket (this step works fine- I am able to put files there))
With this approach I am UNABLE to send data to big query(After MergeContent I tried using PutBigQueryBatch and used this command in bq sheel to get the schema I need:
bq show --format=prettyjson some_data_set.some_table_in_that_data_set | jq '.schema.fields'
I filled all the fields as needed and Load file type: I tried NEWLINE_DELIMITED_JSON or CSV if I converted it to CSV (I am not getting errors but no data is uploaded into the table)
)
What am I doing wrong? I basically want to map the data in such a way that each fields data will be under the same field name
The trick you are missing is using Records.
Instead of using X>SplitJson>JoltTransformJson>Merge>Convert>X, try just X>JoltTransformRecord>X with a JSON Reader and a CSV Writer. This skips a lot of inefficiency.
If you really need to split (and you should avoid splitting and merging unless totally necessary), you can use MergeRecord instead - again with a JSON Reader and CSV Writer. This would make your flow X>Split>Jolt>MergeRecord>X.

Pentaho: Getting only one in row from JSON input file

I am getting one JSON file from SFTP and trying it to insert into oracle but in the preview section i am getting only one row and only one row inserting into table. I already tried to modify to number to preview rows to 10000 but nothing working out.
{"postal_code":"XX","build_id":"XX","categories":[],"closed":false,"closed_reasons":[],"email":["XX"],"external_link":{"facebook":[],"yelp":[""]},"hq":false,"location":{"lat":xxxxx,"lon":xxxxx},"metro":"Cxxxxx, IL","naics_codes":[{"category_code":"XX","category_description":"XX"},{"category_code":"XX","category_description":"xxxxx "},{"category_code":"XX","category_description":"xxxxx "},{"category_code":xxxxx","category_description":"XX"}],"name":"XX","place_id":"XX","place_ids":["xxx","xx"],"sic_codes":[{"category_code":"XX","category_description":"XX"},{"category_code":"XX","category_description":"XX"}]}
Your example json is:
{
"postal_code":"XX",
"build_id":"XX",
"categories":[],
"closed":false,
"closed_reasons":[],
"email":["XX"],
"external_link":{
"facebook":[],
"yelp":[""]
},
"hq":false,
"location":{
"lat":xxxxx,
"lon":xxxxx
},
"metro":"Cxxxxx, IL",
"naics_codes":[
{
"category_code":"XX",
"category_description":"XX"
},
{
"category_code":"XX",
"category_description":"xxxxx "
},
{
"category_code":"XX",
"category_description":"xxxxx "
},
{
"category_code":xxxxx",
"category_description":"XX"
}
],
"name":"XX",
"place_id":"XX",
"place_ids":["xxx","xx"],
"sic_codes":[
{
"category_code":"XX",
"category_description":"XX"
},
{
"category_code":"XX",
"category_description":"XX"
}
]
}
So, if that's the total values in the response you're getting from the SFTP server, Pentaho is behaving well, because it's only one record. If you want PDI to recognize the fields inside that json, and split it's content, you need to specify the path to each field on the "Path" field available on the "Fields" tab on the "Json Input" step.
Try using the input json file as follows:
{
"data" : [
{"key":"val","key":"val"},
{"key":"val","key":"val"},
{"key":"val","key":"val"},
{"id":"666","name":"jnit"}
]
}

Possible to use angular-datatables with serverside array sourced data instead of object sourced data

I'm trying to use angular-datatables with serverside processing. However, it seems that angular-datatables expects that the data from the server is in object format (object vs array data described) with column names preceding each table datapoint. I'd like to configure angular-datatables to accept array based data since I can't modify my server side output which only outputs data in array format.
I'm configuring Datatables in my javascript like so:
var vm = this;
vm.dtOptions = DTOptionsBuilder.newOptions()
.withOption('ajax', {
url: 'output/ss_results/' + $routeParams.uuid,
type: 'GET'
})
.withDataProp('data')
.withOption('processing', true)
.withOption('serverSide', true);
My data from the server looks like this in array format:
var data = [
[
"Tiger Nixon",
"System Architect",
"$3,120"
],
[
"Garrett Winters",
"Director",
"$5,300"
]
]
But as far as I can tell, angular-datatables is expecting the data in object format like so:
[
{
"name": "Tiger Nixon",
"position": "System Architect",
"extn": "5421"
},
{
"name": "Garrett Winters",
"position": "Director",
"extn": "8422"
}
]
I tried not defining dtColumns or setting it to an empty array like vm.dtColumns = []; but I get an error message when I do that. When I configure dtColumns with a promise to load the column data via ajax I get datatables error #4 because it can't find the column name preceding my table datapoints in the data retrieved from the server.
Is it possible to configure angular-datatables to accept array based data? I can't find anything on the angular-datatables website that indicates it can be configured this way.
Edit: So I removed the .withDataProp('data') which I think was causing the problem. The table works a little better now but it's still broken. After it loads, I get the message No matching records found. Even though right below it it says Showing 1 to 10 of 60,349 entries
Previous1…456…6035Next Does anyone know why this might be?
If you want to use an array of arrays instead of an array of objects, simply refer to the array indexes instead of the object names :
$scope.dtColumns = [
DTColumnBuilder.newColumn(0).withTitle('Name'),
DTColumnBuilder.newColumn(1).withTitle('Position'),
DTColumnBuilder.newColumn(2).withTitle('Office'),
DTColumnBuilder.newColumn(3).withTitle('Start date'),
DTColumnBuilder.newColumn(4).withTitle('Salary')
]
demo using the famous "Tiger Nixon" array loaded via AJAX -> http://plnkr.co/edit/16UoRqF5hvg2YpvAP8J3?p=preview