BigQuery maximum row size - google-bigquery

Recently we've started to get errors about "Row larger than the maximum allowed size".
Although documentation states that limitation in 2MB from JSON, we have successfully loaded 4MB (and larger) records also (see job job_Xr8vR3Fyp6rlH4zYaZFbZSyQsyI for an example of a 4.6MB record).
Has there been any change in the maximum allowed row size?
Erroneous job is job_qt_sCwokO2PWKNZsGNx6mK3cCWs. Unfortunately the error messages produced doesn't specify what record(s) is the problematic one.

There hasn't been a change in the maximum row size (I double checked and went back through change lists and didn't see anything that could affect this). The maximum is computed from the encoded row, rather than the raw row, which is why you sometimes can get larger rows than the specified maximum into the system.
From looking at your failed job in the logs, it looks like the error was on line 1. Did that information not get returned in the job errors? Or is that line not the offending one?
It did look like there was a repeated field with a lot of entries that looked like "Person..durable".
Please let me know if you think that you received this in error or what we can do to make the error messages better.

Related

CloudWatchLogs line length limit

I was wondering if CLoudWatchLogs has a limit on the length of 1 line of logging. I checked the CloudWatchLogs Limit documentation page, but they do not specify anything regarding line length limit.
They do mention the Event size limit (256 KB) , which is the maximum size of 1 event, but that does not tell me anything about the length of the line. A log event can contain more information than only the #message field.
Looking into this a bit (since I was curious about the same thing). The boto3 python client documentation refers to log lines as events. The event consists of the timestamp and the message. Within various AWS tools, the message can be broken into various fields, but I believe the timestamp and message are the only actual fields in the log event.
So this would suggest that about 256K is the maximum size for each line (minus the size of the timestamp and probably some overhead as well).
This is not to say that the AWS web console will handle lines that long well though.
The maximum event size is ~256kB, events longer than that will fail (they are not truncated). This size includes 26 bytes of metadata (10 bytes of timestamp and 16 of field names).
This can be verified with this boto3 script.
I tried to console.log a big file( around 800 KB), in cloudwatch I can see 4 console.log message first 3 with around 250 KB size, and rest in the last.
So from my experience line numbers do not matter only the total size of each event matters.

Pentaho "Return value id can't be found in the input row"

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

In OpenOffice 3.3 why am I getting a "maximum number of rows has been exceeded" message?

I trying to open a .csv file generated by an Oracle SQL query in OpenOffice 3.3 and it is spitting back this message:
The maximum number of rows has been exceeded. Excess rows were not
imported!
I have looked at this question but it did not help. This is so strange because of 21,088 rows every single one loaded.
The more bizarre part is that some of the rows have been printed with proper delimitation and some haven't. In particular, seemingly random rows are truncated, including the first one, about the last 1000, and even the very last row that normally says "PL/SQL procedure successfully completed." just says "PL/SQL".
I have two databases identical in structure that I have been running the query on, and the other one works with no issues. I'm floored. Any ideas?
I solved the problem. I wasn't noticing that it was opening the file with the delimiter set to the space character. I do not know why it was giving that error message because all the data was still there (hidden pages over, it's a LONG string), but at the very least that solved the problem. Thank you for your help.

What is the maximum permitted response data size?

In the API Docs section Browsing Table Data, there is a reference to the "permitted response data size"; however, that link is dead. Experimentation revealed that requests with maxResults=50000 are usually successful, but as I near maxResults=100000 I begin to get errors from the BigQuery server.
This is happening while I page through a large table (or set of query results), so after each page is received, I request the next one; it thus doesn't matter to me what the page size is, but it does affect the communication with BigQuery.
What is the optimal value for this parameter?
Here is some explanations: https://developers.google.com/bigquery/docs/reference/v2/jobs/query?hl=en
The maximum number of rows of data to return per page of results. Setting this flag to a small value such as 1000 and then paging through results might improve reliability when the query result set is large. In addition to this limit, responses are also limited to 10 MB. By default, there is no maximum row count, and only the byte limit applies.
To sum up: max size is 10MB, no row count limit.
You can choose value of maxResult parameter based on your usage of app.
If you want show data on the report, then you need to set low value for fast showing first page.
If you need to load data to other app, then you can use max possible value (record size * row count < 10MB).
As you say, you manually set maxResults = 100000 to page through result set, it will get errors from BigQuery server. What errors you will get? Could you paste the error message?

SSIS: importing files some with column names, some without

OrPresumably due to inconsistent configuration of logging devices, I need to load a collection of csv files via SSIS that will sometimes have a first row with column names and will sometimes not. The file format is otherwise identical.
There seems a chance that the logging configuration can be standardized, so I don't want to waste programming time with a script task that opens each file and determines whether it has a header row and then processes it differently depending.
Rather, I would like to specify something like Destination.MaxNumberOfErrors, that would allow up to one error row per file (so if the only problem in the file was the header, it would not fail). The Flat File Source error is fatal though, so I don't see a way of getting it to keep going.
The meaning of the failure code is defined by the component, but the
error is fatal and the pipeline stopped executing. There may be error
messages posted before this with more information about the failure.
My best choice seems to be to simply ignore the first data row for now and wait to see if a more uniform configuration can be achieved. Of course, the dataset is invalid while this strategy is in place. I should add that the data is very big, so the ETL routines need to be as efficient as possible. In my opinion this contraindicates any file parsing or conditional splitting if there is any alternative.
The question is if there is a way to configure the File Source to continue from this fatal error?
Yes there is!
In the "Error Output" page in the editor, change the Error response for each row to "Redirect row". Then you can trap the problem rows (the headers, in your case) by taking them as a single column through the error output of your source.
If you can assume the values for header names would never appear in your data, then define your flat file connection manager as having no headers. The first step inside your data flow would check the values of column 1-N vs the header row values. Only let the data flow through if the values don't match.
Is there something more complex to the problem than that?