OrPresumably due to inconsistent configuration of logging devices, I need to load a collection of csv files via SSIS that will sometimes have a first row with column names and will sometimes not. The file format is otherwise identical.
There seems a chance that the logging configuration can be standardized, so I don't want to waste programming time with a script task that opens each file and determines whether it has a header row and then processes it differently depending.
Rather, I would like to specify something like Destination.MaxNumberOfErrors, that would allow up to one error row per file (so if the only problem in the file was the header, it would not fail). The Flat File Source error is fatal though, so I don't see a way of getting it to keep going.
The meaning of the failure code is defined by the component, but the
error is fatal and the pipeline stopped executing. There may be error
messages posted before this with more information about the failure.
My best choice seems to be to simply ignore the first data row for now and wait to see if a more uniform configuration can be achieved. Of course, the dataset is invalid while this strategy is in place. I should add that the data is very big, so the ETL routines need to be as efficient as possible. In my opinion this contraindicates any file parsing or conditional splitting if there is any alternative.
The question is if there is a way to configure the File Source to continue from this fatal error?
Yes there is!
In the "Error Output" page in the editor, change the Error response for each row to "Redirect row". Then you can trap the problem rows (the headers, in your case) by taking them as a single column through the error output of your source.
If you can assume the values for header names would never appear in your data, then define your flat file connection manager as having no headers. The first step inside your data flow would check the values of column 1-N vs the header row values. Only let the data flow through if the values don't match.
Is there something more complex to the problem than that?
Related
I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.
According to
How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.
If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?
There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.
Update:
My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.
In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.
My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.
Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".
Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.
In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".
From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.
People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?
I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.
I'd recommend an iterative approach to developing your schema validation, with the following two steps.
Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.
Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.
Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in BigQueryIO.java that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.
You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.
(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)
Finally, in terms of logging information:
Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.
I have a regularly scheduled load job that runs and imports data into bigQuery via the json data format every hour. This process has been working fine for months,now all of a sudden bigQuery has started to throw me back errors about missing required fields.
Naturally the first thing I did was review my schema and compare to one of the JSON files and all required fields are indeed there. Bigquery doesn't throw much information back beyond that, and I have checked and re-checked my data 20 times because I'm usually missing something.
Is this a back-end issue? or perhaps formatting requirements have changed? A perfect example would be JOB # job_2ee5a4be176c421985d7c3eaa84abf4b.It tells me "missing required field(s)", of which there are only 4 in my schema - I check my JSON for this particular job and they are all there.
Any light shed on this would be tremendously helpful, thanks in advance!!
A sample of the json, only the first 4 fields are required in my schema, and they are all there! I have also double-checked to make sure no extra fields are in the json, and each json is on a new line etc.:
{"date":"2013-05-31 20:56:41","sdate":1370033801,"type":"0","act":"1","cid":"139","chain":"5156","hotel":"21441","template":"default","arrival":"2013-08-04 00:00:00","depart":"2013-08-05 00:00:00","window":"64","nights":"1","total":"0.0000","dailyrate":"0.0000","session":"1530894334","source":"google","keyword":"the carolina hotel chapel hill nc","campaign":"organic","medium":"organic","visits":"2","device":"pc","language":"en-us","ip":"gc.synxis.com","cookies":"2","base_total":"0.0000","base_rate":"0.0000","batch":"batch_1370045767"}
I am a Google engineer who works on BigQuery. Sorry for the trouble; it appears that you're missing a required RECORD field called currencies.
It appears that the old code was accepting this due to a bug. It was creating empty RECORD fields even if one was not specified in the JSON. As a result, a RECORD field that was REQUIRED could be omitted without triggering an error. However, the correct behavior is to trigger an error, which is what the current code does.
It is unfortunate that the error message does not tell you which required field was missing. This is a TODO in the current version of the code.
I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..
I would like to ignore the errors that may occur when a batch is commited. In my case, unique columns.
The OLE DB Destination Error Output is set to "Ignore failure" but it is still failling. The Data Flow "stop on failure properties" are set to false and the MaximumErrorCount to 0.
I don't want to do row redirection to be able to keep the fast load mode.
Thank you
A few comments:
You can't use ignore errors because ignore row errors still passes the records to the destination. You have to use redirection to get rid of the bad rows.
If you don't want to keep a copy of the bad rows, then you can send them to a Row Count transformation since that has minimal performance impact. Alternatively, you can output the bad rows to a flat file or to another table so you can review the errors on a future date.
Fast load options are properties of the destination and not the source. You can use fast load even if you redirect error rows from your source to somewhere else. I just ran a performance test on a million row data set with the fast load ORDER option and the performance was basically identical when I added error redirection and redirected 500K rows to a Row Count transformation. I also verified performance was slower when I removed the fast load option, so I'm certain this has 0 impact.
I finally redirected the error stream into a test node (to test the errorcode and check that it is a row insertion error) that redirects these rows into a "OLE DB Query" node in which I do nothing with SELECT 1, to simply ignore them.