BigQuery Double Imports - google-bigquery

I am using Google BigQuery from app engine. I have a cron job that runs every 15 minutes to do an export to BigQuery. Randomly though, the import runs twice. However, the appengine logs do not reflect this. I have a set of blobs I maintain to write data to bigquery with, and duplicate data is not being written to them. Has anyone else had bigquery problems with duplicate imports? Again, my appengine logs show the imports happening only one time, and I'm kind of at a loss how to troubleshoot.

One way to troubleshoot is to look at your import jobs. You can do this using the bq tool by running bq ls -j to list the jobs you've run, and bq show -j <job_id> to show details about particular jobs.
We've not heard of any other cases of duplicate loads. One idea to prevent this is to give your import jobs an id ... by default one gets created for you (it will look like job_). Job ids are enforced to be unique within a project, so if you generate an id per import you intend to do, if a double import is triggered, the second one will fail immediately because the job id will already exist.

I am facing the same problem where the jobs seem to have imported twice even though our logs shows it was submitted only once.
Also looked into the jobs and the above command shows it was successfully processed only once.
Note that since the job was only submitted once, not sure how controlling the job_id will help in this case. It seems to be something internal to BigQuery that might have caused the jobs to duplicate?
Let me know if you need anything from my end to investigate.
Thanks,

Related

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

BigQuery Google Analytics Export Processing Time Management

Our company has many schedule reports in BigQuery that generate aggregation tables of Google Analytics data. Because we cannot control when Google Analytics data is imported into our BigQuery environment we keep getting days with no data.
This means we then have to manually run the data for missing days.
I have edited my schedule query to keep pushing back the time of day the scheduled query runs however it is now running around 8 AM. These queries are for reports for stakeholders and stakeholders are requesting them earlier. Is there any way to ensure Google Analytics export to BigQuery processing times?
You may also think about a Scheduled Query solution that reruns at a later time if the requested table isn't available yet.
You can't current add a conditional trigger to a BigQuery scheduled query.
You could manually add a fail safe to your query to check for table from yesterday using a combination of the code below and DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY):
SELECT
MAX(FORMAT_TIMESTAMP('%F %T', TIMESTAMP(PARSE_DATE('%Y%m%d',
REGEXP_EXTRACT(_TABLE_SUFFIX,r'^\d\d\d\d\d\d\d\d'))) ))
FROM `DATASET.ga_sessions_*` AS ga_sessions
Obviously this will fail if the conditions are not met and will not retry, which I understand is not an advancement on your current setup.
I've encountered this many times in the past and eventually had to move my data pipelines to another solution, as scheduled queries are still quite simplistic.
I would recommend you take a look at CRMint for simple pipelines into BigQuery:
https://github.com/google/crmint
If you still find this too simplistic then you should look at Google Cloud Composer, where you can check a table exists before running a particular job in a pipeline:

Pentaho Logging specify Job or Trans for each line

I am running Pentaho Kettle 6.1 through a java application. All of the Pentaho logs are directed through the java app and logged out into the same log file at the java level.
When a job starts or finishes the logs indicate which job is starting or finishing, but when the job is in the middle of running the log output only indicates the specific step it is on without any indication of which job or trans is executing.
This causes confusion and is difficult to follow when there is more than one job running simultaneously. Does anyone know of a way to prepend the name of the job or trans to each log entry?
Not that I know, and I doubt there is for the simple reason that the same transformation/job may be split to run on more than one machine, by more that one user, and/or launched in parallel in different job hierarchies of callers.
The general answer is to log in a database (right-click any where, Parameters, Logging, define the logging table and what you want to log). All the logging will be copied to a table database together with a channel_id. This is a unique number that will be attributed to each "run" and link together all the logging information that comes from all the dependent job/transformations. You can then view this info with a SELECT...WHERE channel_id=...
However, you case seams to be simpler. Use the database logging with a log_intervale of, say, 2 seconds and SELECT TRANSNAME/JOBNAME, LOG_FIELD FROM LOG_TABLE continuously on your terminal.
You can also follow a specific job/transformation by logging in a specific table, but this means you know in advance which is the job/transformation to debug.

Abort a table import stuck in 'pending'

Similar questions have been asked but not exactly what I am looking for.
The problem: on some occasions importing a table from Google Cloud to Big Query gets stuck in a 'pending' state for hours if not days. Tables that get stuck in this state never seem to come out of it, or at least we didn't bother waiting that long. I know it's not a queue issue since in the mean time we can import other tables just fine. No errors are returned by Big Query.
My question: in this situation, and in general, how can we safely abort/cancel an import to Big Query without having the table quietly import on us without us knowing. This would actually apply to any table regardless of its state, as long as it hasn't finished importing.
Thanks.
You may be hitting job load rate limits. For example, if you try to start more than two load jobs per minute for the same table, the load jobs against that table will be defferred, while other load jobs against other tables may continue at normal speed.
There are per-project limits on rate at which load jobs will be started and limits on the number of load jobs that can be running per project at any one time. If you send jobs faster than this, we'll queue, but as you've noticed, our queueing is not a fair queue, and can start newer jobs before older ones.
Aborting pending jobs is a commonly requested feature. If you file a feature request here that will help us prioritize it.

Start SQL Server Jobs when field = specific value

I don't know if this is even possible, so i would appreciate any ideas, even those outside of Sql Server 2005, on how this might be accomplished. I have a linked server set up to a remote mainframe and I have a simple import job that runs overnight. The problem is that the table on the mainframe that the import needs to come from is just a temporary report file that gets overwritten each time a user runs that report, sometimes with different parameters, so the data is always changing. One request was that the SQL job would run only when a specific user runs the report. This is stored as a field in the same mainframe report table that the import is coming from. Setting up a scheduled run on the mainframe is not an option since we don't control it an having the owners set it up would be costly, don't ask me why.
Any ideas that will keep me from forcing the user to run the mainframe report at a specific time would be helpful.
Well, the only thing that you could do from this side is to pull periodically and detect a change. You may try to set up a job that queries only report version, time-stamp and the author. The job runs every 5 minutes and triggers the import job when it detect changes. Not elegant, but it may be good enough.