Loading data into Google Big Query - google-bigquery

my question is the following:
Let's say I have a json file that I want to load into big query.
It contains these two lines of data.
{"value":"123"}
{"value": 123 }
I have defined the following schema for my data.
[
{ "name":"value", "type":"String"}
]
When I try to load the json file into big query it will fail with the following error:
Field:value: Could not convert value to string
Is there a way to get around this issue other than transforming the data in the json file?
Thanks!

You can set the maxBadRecords property on the load job to skip a number of errors but still load the data.
Following your example, you could still load the data if you set it as:
"configuration": {
"load": {
"maxBadRecords": 1,
}
}
This is a way to get around the issue while still loading your JSON data into the table, just that the erroneous rows will be skipped. If loading a list of files, you could set it to be a function of the number of files that you are loading (e.g. maxBadRecords = 20 * fileCount)

Related

Need Pentaho JSON without array

I wanted to output json data not as array object and I did the changes mentioned in the pentaho document, but the output is always array even for the single set of values. I am using PDI 9.1 and I tested using the ktr from the below link
https://wiki.pentaho.com/download/attachments/25043814/json_output.ktr?version=1&modificationDate=1389259055000&api=v2
below statement is from https://wiki.pentaho.com/display/EAI/JSON+output
Another special case is when 'Nr. rows in a block' = 1.
If used with empty json block name output will looks like:
{
"name" : "item",
"value" : 25
}
My output comes like below
{ "": [ {"name":"item","value":25} ] }
I have resolved myself. I have added another JSON input step and defined as below
$.wellDesign[0] to get the array as string object

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

JMeter pass JSON response value to next request

I am using JMETER to test a web app.
First I perform a http GET request which returns a JSON array such as:
[
{
"key1":
{
"subKey":
[
9.120968,
39.255417
]
},
key2 : 1
},
{
"key1":
{
"subKey":
[
9.123852,
39.243237
]
},
key2 : 10
}
]
Basically I want to take randomly one element, take the elements of key1 and create 2 variables in JMeter that will be used for the next query (if randomly it is not possible than just the 1st element).
I tried using JSON Extractor with the following settings (the example shows a single variable case):
and in the next http GET request referencing the parameter as ${var1}.
How to set JSON Extractor to extract a value, save into a JMeter variable to be used in the next http GET request?
Correct JSON Path query would be something like:
$..key1.subKey[${__Random(0,1,)}]
You need to switch Apply to value either to Main sample only or to Main sample and sub-samples
In the above setup:
Match No: 0 - tells JMeter to get random value from key1 subkey
${__Random(0,1,)} - gets a random element from the array, i.e. 9.120968 or 39.255417
More information:
Jayway Jsonpath
API Testing With JMeter and the JSON Extractor
"JMeter variable name to use" option that you've switched on there means that you'd be examining the content of this variable INSTEAD of Sample result.
So the fix is obvious: if you intend to extract whatever you extracting from Sample result - change it back to it.
PS If you intend the opposite (process the variable content, not the sample result) - let me know please.

Error in bq load "Could not convert value to string"

I tried to load logs from Google Cloud Storage to BigQuery by the bq command
and I've got this error "Could not convert value to string".
my example data
{"ids":"1234,5678"}
{"ids":1234}
my example schema
[
{ "name":"ids", "type":"string" }
]
It seems IDs can't convert by none quote at single ID.
Data is made with fluent-plugin-s3, but more than one ID connected by a comma can be bound up with a quotation and isn't made single id.
How can I load these data to BigQuery?
Thanks in advance
Well check different fluentd plugins that can help you, maybe
https://github.com/lob/fluent-plugin-json-transform
https://github.com/tarom/fluent-plugin-typecast

Extracting tfidf-vectors by key without destroying the fileformat

I have about 200000 tfidf-vectors in the output-format seq2sparse delivers. Now I need to extract 500 but not randomly like with the split-function. I know the keys of 500 of them and I need them in the same dataformat like the one from seq2sparse.
When I open the sequencefile with the 200000 entries I can see that the keys are coded with
org.apache.hadoop.io.Text and the values with org.apache.mahout.math.VectorWritable.
But when I try to use
https://github.com/kevinweil/elephant-bird/blob/master/mahout/src/main/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java
and
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
in Pig Latin for reading and writing them the output has org.apache.hadoop.io.Text for both key and value.
I really need exactly those 500 entries in this format because I want to use them in trainnb and testnb.
Basically it would be enough to know how I can do something like the reverse of mahout seqdumper.
While there's no specific Mahout command to do this you could write a relatively simple utility function Using Mahout's:
org.apache.mahout.common.Pair;
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
org.apache.mahout.math.VectorWritable;
and:
org.apache.hadoop.io.SequenceFile;
org.apache.hadoop.io.Text;
com.google.common.io.Closeables;
You could do something like the following:
// load up the 500 desired keys with some function
Vector<Text>desiredKeys = getDesiredKeys();
//create a new SequenceFile writer for the 500 Desired Vectors
SequenceFile.Writer writer =
SequenceFile.createWriter(fs, conf, output500filePath ,
Text.class,
VectorWritable.class);
try {
// create an iterator over the tfidfVector sequence file
SequenceFileIterable<Text, VectorWritable>seqFileIterable =
new SequenceFileIterable<Text, VectorWritable>(
tfidfVectorPath, true, conf)
// loop over tfidf sequence file and write out only Pairs with keys
// contained in the desiredKeys Vector to the output500file
for (Pair<Text, VectorWritable> pair : seqFileIterable) {
if(desiredKeys.contains(pair.getFirst())){
writer.append(pair.getFirst(),pair.getSecond());
}
}
}finally {
Closeables.close(writer, false);
}
And use the path to the "output500file" for the input to trainnb. Using vector.contains() is not the the most efficient way to do it, but this would be the general idea.