BigQuery queries validation - google-bigquery

I am currently developing some BigQuery queries for data from Google Analytics (GA), and need a method to asses the validity of the results obtained by them, i.e. that the results returned are correct.
What I am doing so far for validation is to use a combination of the following:
Running the equivalent queries in GA Query Explorer and comparing the results;
For the queries for which I cannot directly check in GA Query Explorer, I have a Python script which calculates the expected results.
What are the good practices for this kind of validation? Do you have any further suggestions on how to improve this?

Related

What is the best way to persist SQL queries

I am developing an application which mimics in someways as notebook. The users login in to the web application, connect to the data-source(database/csv) and then write series of SQL queries. These are typically queries used to compute metrics.
The workflow subsequently is to run these queries on periodic basis to compute the metrics and persist them as time-series data.
Since the users can write SQL queries here, what would be the suggested approach to persist the query in a backend store?
You can store SQL queries as text.
But you must not allow users to define SQL queries and then run them verbatim, without first having a human vet the query to make sure it's not malicious (or even simply errors or unwise queries).

How can I customize a data set for Google BigQuery? Can I export a file? How do I test it to see if it meets my needs?

I would like to improve the quality of existing data by using the Google BigQuery API to help validate the accuracy of existing data.
I dont see information on the types of data elements contained in the BigQuery and dont understand how to use an API if I just want to see what types of data are contained in there.
I tried looking for instructions and data elements in the Google Health Care API and Google BigQuery documentation and only saw how to set up a payment option.
I am a newbie at programming and wanted to do some preliminary research on these data sets prior to bringing them to our technical team.
I expect to see a list of relevant results based on a custom query.
You can see the data types supported
by Google BigQuery here and the conversion between different types here.
Also you can try out the BigQuery APIs in the OAuthPlayGround.

How to avoid errors when querying BigQuery with Data Studio?

I have a Google Data Studio that uses fetches from a Google Big Query View. I am running into "Quota Error: Too many concurrent queries. I was thinking of getting around this by using batched queries.
Any other solutions to get around the error are welcome
Thank you
Batch queries won't help in this case - Data Studio would not know how to retrieve the batched results.
My preferred option for these cases is to copy the query results out of BigQuery into temporary storage (sheets, GCS, MySQL...) and have Data Studio read the results from there. The best place depends on the shape of your data and the results you are trying to visualize.
Other options - depending on your exact use case:
Turn on caching in Data Studio, which will prefetch data and run queries against cache.
Materialize the view, so that the queries will run faster.
Reduce the number of components on the page so that they don't generate as many queries.
(This answer might change depending on future Data Studio updates)

Solution to host 200GB of data and provide JSON API with aggregates?

I am looking for a solution that will host a nearly-static 200GB, structured, clean dataset, and provide a JSON API onto the data, for querying in a web app.
Each row of my data looks like this, and I have about 700 million rows:
parent_org,org,spend,count,product_code,product_name,date
A31,A81001,1003223.2,14,QX0081,Rosiflora,2014-01-01
The data is almost completely static - it updates once a month. I would like to support straightforward aggregate queries like:
get total spending on product codes starting QX, by organisation, by month
get total spending by parent org A31, by month
And I would like these queries to be available over a RESTful JSON API, so that I can use the data in a web application.
I don't need to do joins, I only have one table.
Solutions I have investigated:
To date I have been using Postgres (with a web app to provide the API), but am starting to reach the limits of what I can do with indexing and materialized views, without dedicated hardware + more skills than I have
Google Cloud Datastore: is suitable for structured data of about this size, and has a baked-in JSON API, but doesn't do aggregates (so I couldn't support my "total spending" queries above)
Google BigTable: can definitely do data of this size, can do aggregates, could build my own API using App Engine? Might need to convert data to hbase to import.
Google BigQuery: fast at aggregating, would need to roll my own API as with BigTable, easy to import data
I'm wondering if there's a generic solution for my needs above. If not, I'd also be grateful for any advice on the best setup for hosting this data and providing a JSON API.
Update: Seems that BigQuery and Cloud SQL support SQL-like queries, but Cloud SQL may not be big enough (see comments) and BigQuery gets expensive very quickly, because you're paying by the query, so isn't ideal for a public web app. Datastore is good value, but doesn't do aggregates, so I'd have to pre-aggregate and have multiple tables.
Cloud SQL is likely sufficient for your needs. It certainly is capable of handling 200GB, especially if you use Cloud SQL Second Generation.
They only reason why a conventional database like MySQL (the database Cloud SQL uses) might not be sufficient is if your queries are very complex and not indexed. I recommend you try Cloud SQL, and if the performance isn't sufficient, try ensuring you have sufficient indexes (hint: use the EXPLAIN statement to see how the queries are being executed).
If your queries cannot be indexed in a useful way, or your queries are so cpu intensive that they are slow regardless of indexing, you might want to graduate up to BigQuery. BigQuery is parallelised so that it can handle pretty much as much data as you throw at it, however it isn't optimized for real-time use and isn't as conveneint as Cloud SQL's "MySQL in a box".
Take a look at ElasticSearch. It's JSON, REST, cloud, distributed, quick on aggregate queries and so on. It may or may not be what you're looking for.

Tableau data extract refresh from Google BigQuery takes very long

We are very pleased with the combination BigQuery <-> Tableau Server with live connection. However, we now want to work with a data extract (500MB) on Tableau Server (since this datasource is not too big and is used very frequently). This takes too much time to refresh (1.5h+). We noticed that only 0.1% is query time and the rest is data export. Since the Tableau Server is on the same platform and location, latency should not be a problem.
This is similar to the slow export of a BigQuery table to a single file, which can be solved by using "daisy chain" option (wildcards). Unfortunately we can't use similar logic with a Google BigQuery data extract refresh in Tableau...
We have identified some approaches, but are not pleased with our current ideas:
Working with incremental refresh: our existing BigQuery table rows can change: these changes can only be applied in Tableau if you do a full refresh
Exporting the BigQuery table to GCS using the daisy chain option and making a Tableau data extract using the Tableau SDK: this would result in quite some overhead...
Writing a Dataflow job using a custom sink for Tableau Server (data extracts).
Experimenting with a Tableau web connector that communicates directly with the BigQuery API: I don't think this will be faster? I didn't see anything about parallelizing calls with the Tableau web connecector, but I didn't try this approach yet.
We would prefer a non-technical option, to limit maintenance... Is there a way to modify the Tableau connector to make use of the "daisy chain" option for BigQuery?
You've uploaded the data in BigQuery. Can't you just use the input for that load job (a CSV perhaps) as input for Tableau?
When we use Tableau and BigQuery we also notice that extracts are slow but we generally don't do that because you lose BigQuery's power. We start with a live data connection at first, and then (if needed) convert this into a custom query that aggregates that data into a much smaller datasets which extracts in just a few seconds.
Another way to achieve higher performance with BigQuery and Tableau is aggregating or joining tables on beforehand. JOINs on huge tables can be slow, so if you use a lot of those you might consider generating a denormalised dataset which does all of the JOIN-ing first. You will get a dataset with a lot of duplicates and a lot of columns. But if you select only what you need in Tableau (hide unused fields!) then these columns won't count in your query cost.
One recommendation I have seen is similar to your point 2 where you export the BQ table to Google Cloud Storage and then use the Tableau Extract API to create a .tde from the flat files in GCS.
This was from an article on the Google Cloud site so I'd assume it would be best practice:
https://cloud.google.com/blog/products/gcp/the-switch-to-self-service-marketing-analytics-at-zulily-best-practices-for-using-tableau-with-bigquery
There is an article here which provides a step by step guide to achieving the above.
https://community.tableau.com/docs/DOC-23161
It would be nice if Tableau optimised the BQ connector for extract refresh using the BigQuery Storage API. We too have our Tableau Server environment in the same GCP zone as our BQ datasets and experience slow refresh times.