Making sure data is loaded - google-bigquery

I use the following command to load data.
/home/bigquery/bq load --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...
It has happened that when I got non-zero return code the data was still loaded. How do I make sure that the command is successful or not? Checking return code does not seem to help. There are times when I loaded the same file again because I got an error but the data was already available in bigquery.

You can use bq show -j of the load job and check job status.
If you are writing code to do the load, so you don't know the job id, you can pass the job id into the load operation (as long as it is unique) so you will know which job to check.
For instance you can run
/home/bigquery/bq load --job_id=some_unique_job_id --max_bad_record=30000 -F '^' company.junelog entry.gz country:STRING,telco_name:STRING,datetime:STRING, ...'
then
/home/bigquery/bq show --j some_unique_job_id
Note if you are creating new tables for every load (as opposed to appending), you could use the write disposition WRITE_EMPTY to make sure you only did the load if the table was empty, thus preventing adding the same data twice. This isn't directly supported in bq.py, but you could use the underlying bigquery_client.py to make this call, or use the REST api directly.

Related

runtimeservice.getVariables does not work because it can't find process instance id

I'm new to flowable and I'm trying to start a process instance with variables. params here is the Map of <String,Object> that I'm using to start the process. It all goes well, but if I try to get my variables back it tells me
"execution 22f42f67-5f88-11e9-9df0-d46d6dbfea92 doesn't exist"
But if I search for it in my process instances list, is there. This is what I do:
pi = runtimeService.startProcessInstanceById(processDefinitionId, params);
runtimeService.getVariables(pi.getId());
I'm stuck with this problem and I do not understand why it keeps doing this. What am I missing?
Flowable has the concept of RuntimeService and HistoryService. The first one contains only the runtime data (what is currently active) and the second one has all the data. The runtime data is a subset of the history data.
The reason why you can’t find the variables via the RuntimeService is due to the fact that the process is completed.
If you use the HistoryService then it would work as expected.

Set variables in Javascript job entry at root level

I need to set variables in root scope in one job to be used in a different job. The first job has a Javascript job entry, with the statements:
parent_job.setVariable("customers_full_path", "C:\\customers22.csv", "r");
true;
But the compilation fails with:
Couldn't compile javascript:
org.mozilla.javascript.EvaluatorException: Can't find method
org.pentaho.di.job.Job.setVariable(string,string,string). (#2)
How to set a variable at root level in a Javascript job entry?
Sorry for the passive agressive but:
I don't know if you are new to Pentaho but, the most common mistake for new users, with previous knowledge of programming, is to be sort of 'addicted' to know methods, as such you are using JavaScript for a functionality that is built in the tool. Both Transformations(KTR) and JOBs(KJB) have a similar step, you can better manipulate this in a KTR.
JavaScript steps slow down the flow considerably, so try to stay away from those as much as possible.
EDIT:
Reading This article, seems the only thing you're doing wrong is the actual syntax of the command..
Correct usage :
parent_job.setVariable("Desired Value", [name_of_variable]);
The command you described has 3 parameters, when it should be 2. If you have more than 1 variable you need to set, use 3 times the command. Try it out see if it works.

Wait.on(signals) use in Apache Beam

Is it possible to write to 2nd BigQuery table after writing to 1st has finished in a batch pipeline using Wait.on() method(new feature in Apache Beam 2.4)? The example given in the Apache Beam documentation is:
PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
data.apply(Wait.on(firstWriteResults))
// Windows of this intermediate PCollection will be processed no earlier than when
// the respective window of firstWriteResults closes.
.apply(ParDo.of(...write to second database...));
But why would I write to database from within ParDo? Can we not do the same by using the I/O transforms given in Dataflow?
Thanks.
Yes this is possible, although there are some known limitations and there is currently some work being done to further support this.
In order to make this work you can do something like the following:
WriteResult writeResult = data.apply(BigQueryIO.write()
...
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
);
data.apply(Wait.on(writeResults.getFailedInserts()))
.apply(...some transform which writes to second database...);
It should be noted that this only works with streaming inserts and wont work with file loads. At the same time there is some work being done currently to better support this use case that you can follow here
Helpful references:
http://moi.vonos.net/cloud/beam-send-pubsub/
http://osdir.com/apache-beam-users/msg02120.html

Loop over file names in sub job (Kettle job)

The task is to get file names from the folder and then loop the same task (job) over all the files one by one.
I created a simple job with transformation (get files names) and then job with flag "Execute for each row" (now is just logging the name of the file).
Did it the same way it is described here: http://ramathoughts.blogspot.ch/2010/08/processing-group-of-files-with-kettle.html
However, the path of the received files is not passed to the sub-job (logging doesn't display variable value). But the sub-job is executed as many times as there is number of files in the input folder. So it looks like it is passed to some extent, but for some reason is not available as a variable.
Image with log details, as seen the variable is displayed as ${path} instead of value of the path:
http://i.imgur.com/pK1iHtl.png?1
The sample code is below as archive with jobs and transformation and also sample input files. Any help is appreciated, as I may be missing something simple here https://www.hightail.com/download/bXBhL0dNcklCMTVsQXNUQw
The issue is the 2nd Job (i.e. j_log_file_names.kjb) is unable to detect the parameter path. Just try defining the parameter to this Job; like the image below:
This will make sure that the parameter that is coming from the prev. step is correctly fetched into the Job. Rest of your job looks absolutely fine.
Hope this helps :)

SSIS 2005 flat file source - partial row which isn't actually a partial row

I'm currently working on an SSIS package to load mainframe logs from multiple server/file sources into a database.
As it stands at the moment I'm using a foreach loop container to loop through a recordset containing filenames and load the files using a Data Flow task from a Flat File Source and File connection to an OLE DB Destination through a Derived column.
I've built in error handling on the Data Flow task to allow for the fact that there won't always be a log file in the location specified (ie. because the server was down for maintenance during a specific period as the files are generated on an hourly basis), but the problems start after it finishes handling these errors.
If the file immediately following an attempt to load a file that wasn't found exists it begins to load it but then throws the following warning message: [Message Log File Source (NORDXSL) [57]] Warning: There is a partial row at the end of the file., and doesn't load all of the records in that file.
However, when I remove the files I know won't exist from the recordset (so that it only attempts to load files that do exist, including the one with the alleged "partial row"), everything works fine and all files/rows are loaded without a problem. It just seems to not want to load the first file after it's failed a missing file correctly and I can't for the life of me work out why?
I've tried calling Dispose() and ReleaseConnection() on the file connection after the Data Flow task has finished processing but this makes no difference and I'm now completely out of ideas.
Any help would be really appreciated as this is the last bug in this project and I want to get it out the door. PLEASE!!
Thanks,
James
I've now found a workaround for this problem...
I've added a Script Task before the Data Flow Task to load the files that checks to see if the file I want to read exists:
If (System.IO.File.Exists(Dts.Variables("MQLogMessagePath").Value.ToString)) Then
Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If
If it doesn't exist it fails the iteration of the Foreach Loop container and continues onto the next file.
BINGO!