BigQuery handling dependencies and retries in query scheduling - sql

I'm looking for some best (simplest;)) practices here.
I have Google Analytics data that is send to BigQuery on a daily basis. I have a query running on a daily basis that uses the data from the previous day's table.
However, I can't be sure this table and the data is there at the time the query runs and I'd like to check if it does. If it isn't there I want to retry later.
Ideally I have some monitoring/alerting around this as well.
Of course this can be done within the Google Cloud in many ways, I'm looking for some best practices how others do this?
I'm used to working with Airflow, but using Composer just for this seems a bit over the top. Cloud Run would be an option and I'm sure there are others. Also I've seen this question discussing how to handle a dependency in SQL, I'm just not sure if I could have it retry using just SQL as well?
EDIT:
I've got the check for the table working in SQL. I guess I just have to see if BigQuery has a way to build in delay like 'WAITFOR'

Related

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

Big Query keeps on running queries; account deactived

I've been trying out BigQuery for a week or so. I've linked BigQuery to one of our organizations Firebase projects. I used my free trial of Google Cloud Platform for this test. I didn't want to use BQ anymore, thus I deactived the link between Firebase and BQ. I also degraded our account from Blaze back to Spark. I assumed that, with the deactivation of the Blaze subscription and link between Firebase and BQ, no queries would be ran in BQ. This happened almost a week ago. However, this did not happen; the last query ran yesterday. I am not sure if I will be billed for these queries.
How do I cancel all queries in BQ in the future? I can't seem to find a(n easy) way to do that.
THanks in advance.

how to store weekly data from google analytics

I have some simple weekly aggregates from Google analytics that i'd like to store somewhere. The reason for storing is because if I run a query against too much data in google analytics, it becomes sampled and I want it to be totally accurate.
What is the best way to solve this?
My thoughts are:
1) Write a process in bigquery to append the data each week to a permanent dataset
2) Use an API that gets the data each week and stores the data in a google spreadsheet (appending a line each time)
What is the best recommendation for my problem - and how do I go about executing it?
Checking your previous questions, we see that you already use Bigquery.
When you run a query against the Google Analytics tables that is not sampled, as that has all the data in it. There is no need to store as you can query every time you need.
In case if you want to store, and pay for the addition table, you can go ahead store in a destination table.
If you want to access quickly, try creating a view.
I suggest the following:
1) make a roll-up table for your weekly data - you can do that either by writing a query for it and running manually or with a script in a Google Spreadsheet that uses the same query (using the API) and is scheduled to run every week. I tried a bunch of the tutorials out there and this one is the simplest to implement
2) depending on the data points you want, you can even use the Google Analytics API without having to go through BigQuery for this request, try pulling this report of yours from here . If it works there are a bunch of Google Sheets extensions that can make it a lot quicker to set up a weekly report. Or you can just code it yourself
Would that work for you?
thks!

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.

Solr Filter for Logging/Analysing Queries

I'm using Solr and wish to be able to log queries/tokens to a database for analytics for both marketing and suggested search functionality.
I'm guessing that a Lucene filter would be a good way of doing this, maybe putting a filter in the query filter chain for a given field type which logs to a specified database. I'm wondering the best way to do this to minimize the impact on Solr response times.
If anybody has done this in the past and is willing to share their solution, that would be fantastic.
Perhaps the easiest way is to just analyze the access log, and consume that data into a report. This could be done offline, and would have no affect on response times.