how to store weekly data from google analytics - api

I have some simple weekly aggregates from Google analytics that i'd like to store somewhere. The reason for storing is because if I run a query against too much data in google analytics, it becomes sampled and I want it to be totally accurate.
What is the best way to solve this?
My thoughts are:
1) Write a process in bigquery to append the data each week to a permanent dataset
2) Use an API that gets the data each week and stores the data in a google spreadsheet (appending a line each time)
What is the best recommendation for my problem - and how do I go about executing it?

Checking your previous questions, we see that you already use Bigquery.
When you run a query against the Google Analytics tables that is not sampled, as that has all the data in it. There is no need to store as you can query every time you need.
In case if you want to store, and pay for the addition table, you can go ahead store in a destination table.
If you want to access quickly, try creating a view.

I suggest the following:
1) make a roll-up table for your weekly data - you can do that either by writing a query for it and running manually or with a script in a Google Spreadsheet that uses the same query (using the API) and is scheduled to run every week. I tried a bunch of the tutorials out there and this one is the simplest to implement
2) depending on the data points you want, you can even use the Google Analytics API without having to go through BigQuery for this request, try pulling this report of yours from here . If it works there are a bunch of Google Sheets extensions that can make it a lot quicker to set up a weekly report. Or you can just code it yourself
Would that work for you?
thks!

Related

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

BigQuery Google Analytics Export Processing Time Management

Our company has many schedule reports in BigQuery that generate aggregation tables of Google Analytics data. Because we cannot control when Google Analytics data is imported into our BigQuery environment we keep getting days with no data.
This means we then have to manually run the data for missing days.
I have edited my schedule query to keep pushing back the time of day the scheduled query runs however it is now running around 8 AM. These queries are for reports for stakeholders and stakeholders are requesting them earlier. Is there any way to ensure Google Analytics export to BigQuery processing times?
You may also think about a Scheduled Query solution that reruns at a later time if the requested table isn't available yet.
You can't current add a conditional trigger to a BigQuery scheduled query.
You could manually add a fail safe to your query to check for table from yesterday using a combination of the code below and DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY):
SELECT
MAX(FORMAT_TIMESTAMP('%F %T', TIMESTAMP(PARSE_DATE('%Y%m%d',
REGEXP_EXTRACT(_TABLE_SUFFIX,r'^\d\d\d\d\d\d\d\d'))) ))
FROM `DATASET.ga_sessions_*` AS ga_sessions
Obviously this will fail if the conditions are not met and will not retry, which I understand is not an advancement on your current setup.
I've encountered this many times in the past and eventually had to move my data pipelines to another solution, as scheduled queries are still quite simplistic.
I would recommend you take a look at CRMint for simple pipelines into BigQuery:
https://github.com/google/crmint
If you still find this too simplistic then you should look at Google Cloud Composer, where you can check a table exists before running a particular job in a pipeline:

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.

Google BigQuery: Stop running query

I have run a query on Google BigQuery several hours ago, and the query is still running. I clicked "abandon", but it appears there is no way to stop a query. What can I do? Can I contact Google somehow, so they stop the query?
I've been working on a project for a company which analyzes Google Analytics data with BigQuery, so I don't want to run them a big bill or something.
(Maybe StackOverflow is not the right place to ask this question, but I've tried to find another place, and I couldn't. On the BigQuery support page, it is said that questions should be asked here, with the google-bigquery tag, so I'm doing that).
I've written a query (which I don't want to paste or describe here, as someone might abuse it to block BigQuery or something, I don't know). Let's just say it includes inner joins. After I've written it, and before running it, the console message was something like "This will analyze 674KB of data", which looked OK, given the fact that the table only has 10,000 rows. I've got the same message after clicking on "abandon" query, something like "You can abandon this, but you will still be billed for 674KB of data".
I try very hard to make sure what I do doesn't cause problems to someone, so I've actually run that query on a local PostgreSQL database (with the exact same data - 10,000 rows) as in BigQuery, and the query there finishes in a second or two.
How can I cancel this query, and can I (the company I've worked for) be billed for something more than 674KB of data?
At the time being, there is no way to stop a BigQuery job once it's started, neither via web interface or API calls.
According to this, this feature may be added in the future.
As BigQuery will shard the query to multiple machines, even a large query (TeraByte level) will not have a large impact on an individual machine, let alone a query of 674KB. However, according to this, this is the amount that you will be charged.
Here are some tips to save money in BigQuery.
First thing to know is that, unlike traditional RDBMS, BigQuery is column based, and you will be charged by the amount of data in the columns rather than in the rows.
That means, don't include columns that you do not need in the query. This may sound trivial, but sometimes people coming from RDBMS may write queries like this:
SELECT
COUNT(*), user_id
FROM
[Dataset.Table]
The query is absolutely correct, but instead of being charged only the size of user_id column, Google would actually bill the whole table for this query. Therefore it's a good idea to explicitly specify the column names.
Break the tables into smaller chunks. Instead of having a single table that contains all the data, it's a good idea to split the table according to date, and use table wildcard functions to stitch the tables together during query. In this case, you won't be billed by rows that you don't need.
BigQuery supports canceling query jobs.
You can do this via the bq command line utility:
bq cancel <job_id>
or from the API via the jobs.cancel method (documented here)

How do I automate getting BigQuery billing data?

We are using BigQuery rather heavily now and I've been tasked with keeping track of how much we are spending on queries each day. There seems to be no easy way to do this within BigQuery? Has anyone else done this already?
I started trying to scrape it myself, but its a real mess. Retrieving data involves a POST to https://bpui0.google.com/billing/ui/batchservice which sends the entire contents of my about:plugins to Google for every new request.
There are two components for BigQuery pricing: Data storage, and data processed by each query.
https://developers.google.com/bigquery/pricing#table
To keep track of daily spend, you'd want to track how much data is being processed. An easy way to do this is to look at the 'bytes_processed' field that comes with each API query response.
You could even pipe this data back to BigQuery, to further dice and analyze usage :).