Pentaho "Return value id can't be found in the input row" - pentaho

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here

Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.

This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

Related

BQ Switching to TIMESTAMP Partitioned Table

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}
After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

In OpenOffice 3.3 why am I getting a "maximum number of rows has been exceeded" message?

I trying to open a .csv file generated by an Oracle SQL query in OpenOffice 3.3 and it is spitting back this message:
The maximum number of rows has been exceeded. Excess rows were not
imported!
I have looked at this question but it did not help. This is so strange because of 21,088 rows every single one loaded.
The more bizarre part is that some of the rows have been printed with proper delimitation and some haven't. In particular, seemingly random rows are truncated, including the first one, about the last 1000, and even the very last row that normally says "PL/SQL procedure successfully completed." just says "PL/SQL".
I have two databases identical in structure that I have been running the query on, and the other one works with no issues. I'm floored. Any ideas?
I solved the problem. I wasn't noticing that it was opening the file with the delimiter set to the space character. I do not know why it was giving that error message because all the data was still there (hidden pages over, it's a LONG string), but at the very least that solved the problem. Thank you for your help.

BigQuery maximum row size

Recently we've started to get errors about "Row larger than the maximum allowed size".
Although documentation states that limitation in 2MB from JSON, we have successfully loaded 4MB (and larger) records also (see job job_Xr8vR3Fyp6rlH4zYaZFbZSyQsyI for an example of a 4.6MB record).
Has there been any change in the maximum allowed row size?
Erroneous job is job_qt_sCwokO2PWKNZsGNx6mK3cCWs. Unfortunately the error messages produced doesn't specify what record(s) is the problematic one.
There hasn't been a change in the maximum row size (I double checked and went back through change lists and didn't see anything that could affect this). The maximum is computed from the encoded row, rather than the raw row, which is why you sometimes can get larger rows than the specified maximum into the system.
From looking at your failed job in the logs, it looks like the error was on line 1. Did that information not get returned in the job errors? Or is that line not the offending one?
It did look like there was a repeated field with a lot of entries that looked like "Person..durable".
Please let me know if you think that you received this in error or what we can do to make the error messages better.

Big query throwing invalid errors on load jobs

I have a regularly scheduled load job that runs and imports data into bigQuery via the json data format every hour. This process has been working fine for months,now all of a sudden bigQuery has started to throw me back errors about missing required fields.
Naturally the first thing I did was review my schema and compare to one of the JSON files and all required fields are indeed there. Bigquery doesn't throw much information back beyond that, and I have checked and re-checked my data 20 times because I'm usually missing something.
Is this a back-end issue? or perhaps formatting requirements have changed? A perfect example would be JOB # job_2ee5a4be176c421985d7c3eaa84abf4b.It tells me "missing required field(s)", of which there are only 4 in my schema - I check my JSON for this particular job and they are all there.
Any light shed on this would be tremendously helpful, thanks in advance!!
A sample of the json, only the first 4 fields are required in my schema, and they are all there! I have also double-checked to make sure no extra fields are in the json, and each json is on a new line etc.:
{"date":"2013-05-31 20:56:41","sdate":1370033801,"type":"0","act":"1","cid":"139","chain":"5156","hotel":"21441","template":"default","arrival":"2013-08-04 00:00:00","depart":"2013-08-05 00:00:00","window":"64","nights":"1","total":"0.0000","dailyrate":"0.0000","session":"1530894334","source":"google","keyword":"the carolina hotel chapel hill nc","campaign":"organic","medium":"organic","visits":"2","device":"pc","language":"en-us","ip":"gc.synxis.com","cookies":"2","base_total":"0.0000","base_rate":"0.0000","batch":"batch_1370045767"}
I am a Google engineer who works on BigQuery. Sorry for the trouble; it appears that you're missing a required RECORD field called currencies.
It appears that the old code was accepting this due to a bug. It was creating empty RECORD fields even if one was not specified in the JSON. As a result, a RECORD field that was REQUIRED could be omitted without triggering an error. However, the correct behavior is to trigger an error, which is what the current code does.
It is unfortunate that the error message does not tell you which required field was missing. This is a TODO in the current version of the code.

SSIS: importing files some with column names, some without

OrPresumably due to inconsistent configuration of logging devices, I need to load a collection of csv files via SSIS that will sometimes have a first row with column names and will sometimes not. The file format is otherwise identical.
There seems a chance that the logging configuration can be standardized, so I don't want to waste programming time with a script task that opens each file and determines whether it has a header row and then processes it differently depending.
Rather, I would like to specify something like Destination.MaxNumberOfErrors, that would allow up to one error row per file (so if the only problem in the file was the header, it would not fail). The Flat File Source error is fatal though, so I don't see a way of getting it to keep going.
The meaning of the failure code is defined by the component, but the
error is fatal and the pipeline stopped executing. There may be error
messages posted before this with more information about the failure.
My best choice seems to be to simply ignore the first data row for now and wait to see if a more uniform configuration can be achieved. Of course, the dataset is invalid while this strategy is in place. I should add that the data is very big, so the ETL routines need to be as efficient as possible. In my opinion this contraindicates any file parsing or conditional splitting if there is any alternative.
The question is if there is a way to configure the File Source to continue from this fatal error?
Yes there is!
In the "Error Output" page in the editor, change the Error response for each row to "Redirect row". Then you can trap the problem rows (the headers, in your case) by taking them as a single column through the error output of your source.
If you can assume the values for header names would never appear in your data, then define your flat file connection manager as having no headers. The first step inside your data flow would check the values of column 1-N vs the header row values. Only let the data flow through if the values don't match.
Is there something more complex to the problem than that?