Cannot insert new value to BigQuery table after updating with new column using streaming API - google-bigquery

I'm seeing some strange behaviour with my bigquery table, I've just created added a new column to a table, it looks good on the interface and getting the schema via the api.
But when adding a value to the new column I get the following error:
{
"insertErrors" : [ {
"errors" : [ {
"message" : "no such field",
"reason" : "invalid"
} ],
"index" : 0
} ],
"kind" : "bigquery#tableDataInsertAllResponse"
}
I'm using the java client and streaming API, the only thing I added is:
tableRow.set("server_timestamp", 0)
Without that line it works correctly :(
Do you see anything wrong with it (the name of the column is server_timestamp, and it is defined as an INTEGER)

Updating this answer since BigQuery's streaming system has seen significant updates since Aug 2014 when this question was originally answered.
BigQuery's streaming system caches the table schema for up to 2 minutes. When you add a field to the schema and then immediately stream new rows to the table, you may encounter this error.
The best way to avoid this error is to delay streaming rows with the new field for 2 minutes after modifying your table.
If that's not possible, you have a few other options:
Use the ignoreUnknownValues option. This flag will tell the insert operation to ignore unknown fields, and accept only those fields that it recognizes. Setting this flag allows you to start streaming records with the new field immediately while avoiding the "no such field" error during the 2 minute window--but note that the new field values will be silently dropped until the cached table schema updates!
Use the skipInvalidRows option. This flag will tell the insert operation to insert as many rows as it can, instead of failing the entire operation when a single invalid row is detected. This option is useful if only some of your data contains the new field, since you can continue inserting rows with the old format, and decide separately how to handle the failed rows (either with ignoreUnknownValues or by waiting for the 2 minute window to pass).
If you must capture all values and cannot wait for 2 minutes, you can create a new table with the updated schema and stream to that table. The downside to this approach is that you need to manage multiple tables generated by this approach. Note that you can query these tables conveniently using TABLE_QUERY, and you can run periodic cleanup queries (or table copies) to consolidate your data into a single table.
Historical note: A previous version of this answer suggested that users stop streaming, move the existing data to another table, re-create the streaming table, and restart streaming. However, due to the complexity of this approach and the shortened window for the schema cache, this approach is no longer recommended by the BigQuery team.

I was running into this error. It turned out that I was building the insert object like i was in "raw" mode but had forgotten to set the flag raw: true. This caused bigQuery to take my insert data and nest it again under a json: {} node.
In otherwords, I was doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
});
when I should have been doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}, {raw: true});
the node bigquery library didn't realize that it was already in raw mode and was then trying to insert this:
{
insertId: '<generated value>',
json: {
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}
So in my case the errors were referring to the fact that the insert was expecting my schema to have 2 columns in it (insertId and json).

Related

aws dynamodb global secondary index cannot query existing items

E.g:
myTable:
id field1
1 guy
id as the primary key and field1 as the global secondary index:
- AttributeName: id
KeyType: HASH
GlobalSecondaryIndexes:
- IndexName: Field1Index
KeySchema:
- AttributeName: field1
KeyType: HASH
when i use
dynamodb.scan()
or
dynamodb.get({TableName: 'myTable', Key: {id: 1}})
can find this record,
But when I use
ExpressionAttributeNames: {
'#field1': 'field1'
},
ExpressionAttributeValues: {
":field1": field1
},
KeyConditionExpression: "#field1 = :field1",
IndexName: "Field1Index",})
At the time, early data can be found, but recent data cannot be found.
I don't know if it has something to do with myTable's data volume exceeding 8000, but I still have to say it in advance, and I have to add that the data of this table is always written into the data table through batch upload.
At present, what I suspect is that this data does not seem to be registered in the secondary index. There are 8,357 items in the data table and 7,599 items in the secondary index. Therefore, it can be determined that the missing items in the secondary index cannot be queried. data.
At the same time, I also tried to delete the problem data first, and then try to add it again
Problem still can't be solved
Can anyone know what is the reason?
It looks like this, but there may be various warnings due to java version issues
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.ViolationDetection.html

Unnesting a big quantity of columns in BigQuery and BigTable

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.
I've crated the table definition file, like the docs required:
{
"sourceFormat": "BIGTABLE",
"sourceUris": [
"https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
],
"bigtableOptions": {
"readRowkeyAsString": "true",
"columnFamilies": [
{
"familyId": "leads",
"columns": [
{
"qualifierString": "Id",
"type": "STRING"
},
{
"qualifierString": "IsDeleted",
"type": "STRING"
},
...
]
}
]
}
}
But then, things started to go south...
This is how the BigQuery "table" ended up looking:
Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)
After a bit of searching I found a solution to this:
https://stackoverflow.com/a/70728545/4183597
So in my case it would be something like this:
SELECT
ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
...
FROM xxx
The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.
I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).
Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.
Any documentation, reference or idea will be greatly appreciated.
Thank you.
In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.
The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.
That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.

Ignite/GridGain putAllIfAbsent

I am use GridGain as my persistence database. I have following requirements.
Insert multiple records if key already not exists
Put multiple records if key exists or not.
For 1, I saw cache.putIfAbsent(key, value) method to insert single record if not exists. But I didn't find cache.putAllIfAbsent(Map<key, value>) like method. I can you loop to insert multiple records one by one. Is it given performance issue?
For 2, I think I can use cache.putAll(Map<key, value>) method. If it proper way?
I run server in Google cloud Kubernetes engine as thick clients.
putAll always overwrites existing records
putIfAbsent in a loop will be slower than putAll. Measure your specific use case to see by how much
If there is no requirement for ordering and atomicity, DataStreamer is a good choice. When allowOverwrite flag is false (default), you get putIfAbsent behavior, and good performance.
try (IgniteDataStreamer<Integer, String> stmr = ignite.dataStreamer("myCache")) {
stmr.allowOverwrite(false); // Don't overwrite existing data
Map<Integer, String> entries = getMydata()
stmr.addData(entries);
}

BQ Switching to TIMESTAMP Partitioned Table

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}
After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

Elastic Search documents sorting, indexing issue

I have 9000 documents in my ElasticSearch index.
I want to sort by an analyzed string field, so, in order to do that i knew ( through Google ) that i must update the mapping to make the field not-analyzed so i can sort by this field and i must re-index the data again to reflect the change in mapping.
The re-indexing process consumed about 20 minutes on my machine.
The strange thing is that the re-indexing process consumed about 2 hours on a very powerful production server.
I checked the memory status and the processor usage on that server and everything was normal.
What i want to know is:
Is there a way to sort documents by an analyzed, tokenized field without re-indexing the whole documents?
If i must re-index the whole documents, then why does it take such huge time to re-index the documents on the server ?? or how to trace the slowness reason on that server?
As long as the field is stored in _source, I'm pretty sure you could use a script to create a custom fields everytime you search.
{
"query" : { "query_string" : {"query" : "*:*"} },
"sort" : {
"_script" : {
"script" : "<some sorting field>",
"type" : "number",
"params" : {},
"order" : "asc"
}
}
}
This has the downside of re-evaluating the sorting script on the server side each time you search, but I thing it solves (1).