BQ Switching to TIMESTAMP Partitioned Table - google-bigquery

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}

After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

Related

Superset with Impala - Invalid Session id/No protocol version header

I have Superset using Impala as the main data source. Most of the times, every query runs smoothly and I can build charts and dashboards with ease. I need to generate a Table Chart, containing around 100k records and 30+ columns, but I am having some issues. It is basically a SELECT *, no aggregations, filtering or ordering
are being used.
When the data is relatively big, Superset just throws a bunch of errors (It appears to be that the errors are coming from Impala). But I cannot find any
information regarding those errors. I have tried paginating the results, but it did not worked. Also, when I run the query in Superset Chart page, it doesn't take long, it just
displays the error. The only way some information gets displayed in the Table Chart is when I limit the rows at the "Row limit" option to 10 records. But, this will not work out for me.
Those are the errors that keep ocurring:
impala error: Invalid session id: f344bf1aa2a42e2b:ad1df0047d7f909c
impala error: No protocol version header
When I use the Oracle connection that I also have, I can generate a table chart from a large amount of records with no problem.
My setup is the following:
Impala v3.2.0-cdh6.3.3
Superset v0.36.0
So, is that a problem with Superset or Impala? Could have something to with configuration in Superset?

Librato composite error: What does: Unable to execute composite: ["error": "Requested MD data from SD endpoint"]. mean?

I want to create an alert that triggers whenever one of the following counter statistics is not zero:
a.b.c.failed
a.b.e.failed
I already use these statistics separately on a dashboard page, but as they occur rarely, I'd like an alert.
It appears I have to make a sum composite so that I can trigger the alert when the sum is above zero. I think the composite would look something like:
sum(series("a.b.*.failed",{}))
However, every attempt I make gives the error:
Unable to execute composite: ["error": "Requested MD data from SD endpoint"]
There is another thread that suggested replacing the {} with "*" (including the quotes). This no longer gives an error, but gives a bizarre result (it's above zero all the time, even though there only very rarely any 'failed' statistics above zero).
The correct expression for my case is:
sum(derive(series("a.b.*.failed","*")))
Using "*" works to select the source.
Derive gives the change of each statistic instead of the cumulative total (but I'm not sure why the cumulative total was showing up - it is not shown normally for these statistics).
Sum adds the change of the different statistics.
I don't understand why {} doesn't work - I think that is related to the mystery of the meaning the error message that uses undocumented terminology (MD and SD endpoints). Librato documentation of their composite statistics function language is very minimal and provides few examples and few explanations of the meaning of terms and technical foundations.

Pentaho "Return value id can't be found in the input row"

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

Validating rows before inserting into BigQuery from Dataflow

According to
How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.
If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?
There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.
Update:
My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.
In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.
My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.
Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".
Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.
In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".
From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.
People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?
I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.
I'd recommend an iterative approach to developing your schema validation, with the following two steps.
Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.
Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.
Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in BigQueryIO.java that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.
You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.
(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)
Finally, in terms of logging information:
Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.

Big query throwing invalid errors on load jobs

I have a regularly scheduled load job that runs and imports data into bigQuery via the json data format every hour. This process has been working fine for months,now all of a sudden bigQuery has started to throw me back errors about missing required fields.
Naturally the first thing I did was review my schema and compare to one of the JSON files and all required fields are indeed there. Bigquery doesn't throw much information back beyond that, and I have checked and re-checked my data 20 times because I'm usually missing something.
Is this a back-end issue? or perhaps formatting requirements have changed? A perfect example would be JOB # job_2ee5a4be176c421985d7c3eaa84abf4b.It tells me "missing required field(s)", of which there are only 4 in my schema - I check my JSON for this particular job and they are all there.
Any light shed on this would be tremendously helpful, thanks in advance!!
A sample of the json, only the first 4 fields are required in my schema, and they are all there! I have also double-checked to make sure no extra fields are in the json, and each json is on a new line etc.:
{"date":"2013-05-31 20:56:41","sdate":1370033801,"type":"0","act":"1","cid":"139","chain":"5156","hotel":"21441","template":"default","arrival":"2013-08-04 00:00:00","depart":"2013-08-05 00:00:00","window":"64","nights":"1","total":"0.0000","dailyrate":"0.0000","session":"1530894334","source":"google","keyword":"the carolina hotel chapel hill nc","campaign":"organic","medium":"organic","visits":"2","device":"pc","language":"en-us","ip":"gc.synxis.com","cookies":"2","base_total":"0.0000","base_rate":"0.0000","batch":"batch_1370045767"}
I am a Google engineer who works on BigQuery. Sorry for the trouble; it appears that you're missing a required RECORD field called currencies.
It appears that the old code was accepting this due to a bug. It was creating empty RECORD fields even if one was not specified in the JSON. As a result, a RECORD field that was REQUIRED could be omitted without triggering an error. However, the correct behavior is to trigger an error, which is what the current code does.
It is unfortunate that the error message does not tell you which required field was missing. This is a TODO in the current version of the code.