How to avoid errors when querying BigQuery with Data Studio? - google-bigquery

I have a Google Data Studio that uses fetches from a Google Big Query View. I am running into "Quota Error: Too many concurrent queries. I was thinking of getting around this by using batched queries.
Any other solutions to get around the error are welcome
Thank you

Batch queries won't help in this case - Data Studio would not know how to retrieve the batched results.
My preferred option for these cases is to copy the query results out of BigQuery into temporary storage (sheets, GCS, MySQL...) and have Data Studio read the results from there. The best place depends on the shape of your data and the results you are trying to visualize.
Other options - depending on your exact use case:
Turn on caching in Data Studio, which will prefetch data and run queries against cache.
Materialize the view, so that the queries will run faster.
Reduce the number of components on the page so that they don't generate as many queries.
(This answer might change depending on future Data Studio updates)

Related

What's the best method of creating a SSRS report that will be run manually many times with different Parameters?

I have a SSRS Sales report that will be run many times a day by users, but with different parameters selected for the branch and product types.
The SQL query uses some large tables and is quite complex, therefore, running it many times is going to have a performance cost.
I assumed the best solution would be to create a dataset for the report with all permutations, ran once overnight and then apply filters when the users run the report.
I tried creating a snapshot in SSRS which doesn’t consider the parameters and therefore has all the required data, then filtering the Tablix using the parameters that the users selected. The snapshot works fine but it appears to be refreshed when the report is run with different parameters.
My next solution would be to create a table for the dataset which the report would then point to. I could recreate the table every night using a stored procedure. With a couple of small indexes the report would be lightning fast.
This solution would seem to work really well but my knowledge of SQL is limited, and I can’t help thinking this is not the right solution.
Is this suitable? Are there better ways? Can anybody confirm either way?
SSRS datasets have caching capabilities. I think you'll find this more useful instead of having to create extra db tables and such.
Please see here https://learn.microsoft.com/en-us/sql/reporting-services/report-server/cache-shared-datasets-ssrs?view=sql-server-ver15
If the rate of change of the data is low enough, and SSRS Caching doesn't suit your needs, then you could manually cache the record set from the report query (without the filtering) into its own table, then you can modify the report to query from that table.
Oracle and most Data Warehouse implementations have a formal mechanism specifically for this called Materialized Views, no such luck in SQL server though you can easily implement the same pattern yourself.
There are 2 significant drawbacks to this:
The data in the new table is a snapshot at the point in time that it was loaded, so this technique is better suited to slow moving datasets or reports where it is not critical that the data is 100% accurate.
You will need to manage the lifecycle of the data in this table, ideally you should setup a Job or Scheduled Task to automate this refresh but you could trigger a refresh as part of the logic in your report (not recommended, but possible).
Though it is possible, you would NOT consider using a TRIGGER to update the data as you have already indicated the query takes some time to execute, this could have a major impact on the rest of your LOB application
If you do go down this path you should write the refresh logic into a stored procedure so that it can be executed when needed and from other internal and external automation mechanisms.
You should also add a column that records the date and time of when the dataset was executed, then replace any references in your report that display the date and time the report was printed, with the time the data was prepared.
It is also worth pointing out that often performance issues with expensive queries in SSRS reports can be overcome if you can reducing the functions and value formatting that is in the SQL query itself and move that logic into the report definition. This goes for filtering operations too, you can easily add additional computed columns in the dataset definition or in the design surface and you can implement filtering directly in the tablix too, there is no requirement that every record from the SQL query be displayed in the report at all, just as we do not need to show every column.
Sometimes some well crafted indexes can help too, for complicated reports we can often find a balance between what the SQL engine can do efficiently and what the RDL can do for us.
Disclaimer: This is hypothetical advice, you should evaluate each report on a case by case basis.

Is there a way to find out how much data does a google data studio dashboard consume from BigQuery?

I would like to know how much data does Data Studio consume from querying a View from BigQuery.
For example, if I have a dashboard that is getting its data from a View in BigQuery, how much data would it be using?
I have tried to look at the usage logs of Big Query, but due to my lack of experience with the tools I have not been able to find a solution. I have been able to find specific bytes processed for the data(View from BigQuery) in question but don't know how much is from Data Studio.
If you look in BigQuery at the Query History you can search for queries that have used that view by typing in it's name.
Queries from Data Studio will have strange looking names e.g.
COUNT(DISTINCT t0.yourVariable) AS t0_qt_QqufHrYw
If you click on the query you will see the amount of data processed and billed (Bytes processed & Bytes billed).
Bare in mind that each component of your report will have it's own query (all at a similar time) so you may need to add them up to find the total bytes queried.

Performance enhancement when using Direct Query to get data from SQL server in Power BI

I am using PBI Desktop to create PBIX files which I later upload to Azure Power BI Embedded (PaaS solution). The PBIX gets data from Azure SQL server in Direct Query mode. I want to increase the efficiency of queries that Power BI Embedded sends to SQL for getting my data.
My pbix contains relationships between many tables and RLS (Row Level Security) configured and is taking a lot of time to load. Please advice if the following options will help me increase the efficiency of queries, thus reducing the time taken by the pbix to load: -
Using Advanced Options in the Get Data dialog box : Inserting a SQL statement here will get only specific data instead of the entire table. This will reduce the data I see in PBI Desktop, but will it really increase the efficiency of queries sent to SQL for the creation of charts? Eg: Say PBIX needs to create a join between two tables. If I use the advanced options, will the Join be done on reduced data?
Using Filters to filter out unwanted rows of the table : Again like above option, this will reduce the data I see in PBI Desktop, but will it really increase the efficiency of queries sent to SQL for the creation of charts? Eg: If I use filters, will the Join be done on reduced data?
[EDIT - I have added one more option below]
Are the queries for charts on different pages of a PBIX file sent to SQL only when the page is loaded? : If this is true then I can separate my charts into different pages to reduce the number of queries sent at once to SQL.

Solution to host 200GB of data and provide JSON API with aggregates?

I am looking for a solution that will host a nearly-static 200GB, structured, clean dataset, and provide a JSON API onto the data, for querying in a web app.
Each row of my data looks like this, and I have about 700 million rows:
parent_org,org,spend,count,product_code,product_name,date
A31,A81001,1003223.2,14,QX0081,Rosiflora,2014-01-01
The data is almost completely static - it updates once a month. I would like to support straightforward aggregate queries like:
get total spending on product codes starting QX, by organisation, by month
get total spending by parent org A31, by month
And I would like these queries to be available over a RESTful JSON API, so that I can use the data in a web application.
I don't need to do joins, I only have one table.
Solutions I have investigated:
To date I have been using Postgres (with a web app to provide the API), but am starting to reach the limits of what I can do with indexing and materialized views, without dedicated hardware + more skills than I have
Google Cloud Datastore: is suitable for structured data of about this size, and has a baked-in JSON API, but doesn't do aggregates (so I couldn't support my "total spending" queries above)
Google BigTable: can definitely do data of this size, can do aggregates, could build my own API using App Engine? Might need to convert data to hbase to import.
Google BigQuery: fast at aggregating, would need to roll my own API as with BigTable, easy to import data
I'm wondering if there's a generic solution for my needs above. If not, I'd also be grateful for any advice on the best setup for hosting this data and providing a JSON API.
Update: Seems that BigQuery and Cloud SQL support SQL-like queries, but Cloud SQL may not be big enough (see comments) and BigQuery gets expensive very quickly, because you're paying by the query, so isn't ideal for a public web app. Datastore is good value, but doesn't do aggregates, so I'd have to pre-aggregate and have multiple tables.
Cloud SQL is likely sufficient for your needs. It certainly is capable of handling 200GB, especially if you use Cloud SQL Second Generation.
They only reason why a conventional database like MySQL (the database Cloud SQL uses) might not be sufficient is if your queries are very complex and not indexed. I recommend you try Cloud SQL, and if the performance isn't sufficient, try ensuring you have sufficient indexes (hint: use the EXPLAIN statement to see how the queries are being executed).
If your queries cannot be indexed in a useful way, or your queries are so cpu intensive that they are slow regardless of indexing, you might want to graduate up to BigQuery. BigQuery is parallelised so that it can handle pretty much as much data as you throw at it, however it isn't optimized for real-time use and isn't as conveneint as Cloud SQL's "MySQL in a box".
Take a look at ElasticSearch. It's JSON, REST, cloud, distributed, quick on aggregate queries and so on. It may or may not be what you're looking for.

Query speed up strategies

At the company i am working on we have an application build on Jboss/Apache/Hibernate with Ms Sql 2005 db.
We have a page that loads a bunch of transactions. Now we timed this during loading of the page and it takes abnout 15-20 seconds to load the files, thsi is because the queries build (not sure if these are build by hibernate) join a big number of tables .
To rectify the issue we changed some left joins to inner joins and add indexes to the tables. however this doesnt really solve the issue, it gets better, but not significantly.
any ideas?
You can move your read only database instance to its own server, use solid state drives, and adjust your indexes. Another way to optimize this would be to run a query to create a table you can access with a simple query instead of running a bunch of queries in run time.
What did you do to determine which indexes to add? I've always had great luck with the MSSQL Index Tuning Wizard - you can use SQL Profiler to trace the database activity during a page load, and then tell the Query Tuning Wizard to suggest new indexes and statistics based on that activity. It will generally suggest a handful of indexes that can make a huge difference.
Are the databases on high-contention disks? Maybe the queries would be faster if the database were on their own physical disks. Given the size of your underlying tables, maybe the database server is under-powered - does it have enough spare resources to handle the file loading?
How many records are being returned by the query?
If there are a lot of records, you may want to do some sort of custom paging and only return the # of records that are on the current page (i.e. page of 50 will only return records 1 - 50)