Does BigQuery BI engine support left join or not? - google-bigquery

as per the documentation BI engine is supposed to accelerate left join
https://cloud.google.com/bi-engine/docs/optimized-sql#unsupported-features
I tried this dummy query as a view, connect to datastudio
SELECT xx.country_region,yy._1_22_20 FROM `bigquery-public-data.covid19_jhu_csse.deaths` xx
left join `bigquery-public-data.covid19_jhu_csse.deaths` yy
on xx.country_region=yy.country_region
my question is: is left join supported or not ?
bug report here : https://issuetracker.google.com/issues/154786936
Datastudio report : https://datastudio.google.com/reporting/25710c42-acda-40a3-a3bf-68571c314650
edit : it seems BI engine is still under heavy development and needs more time to be feature completed, I just materialized my view, but it has a cost, 4 small tables < 10 MB, that change very 5 minutes cost 11 GB/ day , I guess it is worth it, Datastudio is substantially faster now, you can check it here (public report)
https://nemtracker.github.io/

Don't try JOINs, don't try sub SELECTs, don't do queries for BI Engine.
The best practice is to CREATE OR REPLACE a table dedicated to the dashboard you're building. Make sure to not have nested/repeated data there either. Then BI Engine will make your reports shine.
Related, check out this video I just made with the best practices for BI Engine with BigQuery:
https://youtu.be/zsm8FYrOfGs?t=307

Related

Google BigQuery Query exceeded resource limits

I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:
Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.
As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.
My current query is:
SELECT
coy.company_id,
cont.contact_id,
deals.deal_id,
{another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy
ON CAST(ac.from_id AS int64) = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64)
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;
FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.
It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.
TLDR;
Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.
I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.
Long Answer
Investigation method
Say you are joining two tables
SELECT
cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2
and you run into this error. The first thing you should do is check for duplicates in both tables.
SELECT
c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC
And the same thing for each table involved in a join.
I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!
Why BigQuery gives this error
In my experience, this error message means that your query does too many computation compared to the size of your inputs.
No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.
BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery
SELECT MINE_BITCOIN_UDF()
The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.
So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.
Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.
P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.

How to optimise google BigQuery with 17+ tables which contains approx. 55 GB of data?

I have a huge amount of data store which contains almost 20+ tables. all tables contain data in GB.
So basically I'm exporting all data into CSV for analysis. I have 17+ tables in join query which almost process billions of records. Google says it will process 10 GB data.
Now the problem is query taking too much time & resources, sometimes query fails with resource limit. how can I optimize such a query?
FYI: I'm using LEFT JOIN
Best way to optimize your query is implement Partitioning & Clustering. Best solution is to implement partitioning and Clustering on fields over which Joining conditions are done.

Google BigQuery move to SQL Server, Big Data table optimisation

I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.

Slow spatial comparison when using cross join

I'm using U-SQL to select all objects which are inside one or more of the shapes. The code works but is really slow. Is there some way to make it more performant?
#rs1 =
SELECT DISTINCT aisdata.imo,
portshape.unlocode
FROM #lsaisdata AS aisdata
CROSS JOIN
#portsshape AS portshape
WHERE Geometry.STMPolyFromText(new SqlChars(portshape.shape.Replace("Z", "").ToCharArray()), 0).STContains(Geometry.Point(aisdata.lon, aisdata.lat, 0)).IsTrue;
Added more information about my issue:
I've registered Microsoft.SqlServer.Types.dll and SqlServerSpatial130.dll to be able to use spatial functions in U-SQL
I'm running my job in Data Lake Analytics using two AUs. Initially I used 10 AUs, but the Diagnostics tab stated that the job was 8 AUs over-allocated and max useful AUs was 2.
The job takes about 27 minutes to run with the UDT code below and the cross join takes almost all of this time
The input is one csv file (66 Mb) and one wkt file (2.4 Mb)
I'm using Visual Studio 2015 with Azure Data Lake Tools v2.2.5000.0
I tried encapsulating some of the spatial code in UDTs and that improved the performance to 27 minutes:
#rs1 =
SELECT DISTINCT aisdata.imo,
portshape.unlocode
FROM #lsaisdata AS aisdata
CROSS JOIN
#portsshape AS portshape
WHERE portshape.geoShape.GeoObject.STContains(SpatialUSQLApp.CustomFunctions.GetGeoPoint(aisdata.lon, aisdata.lat).GeoObject).IsTrue;
First, a CROSS JOIN will always explode your data to an NxM matrix. Depending on the number of rows this may either make it very expensive and possibly hard to estimate correct degree of parallelism.
Secondly, I assume that the Spatial join you do is an expensive operation. For example, if you use SQL Server 2012's spatial capabilities (2016 has native implementations of the type that may be a bit faster), I assume you probably get similar performance behavior. Most of the time you need a spatial index to get better performance. Now U-SQL does not support spatial indices, but you probably could approximate the same behavior by using an abstraction (like tessellation of the objects and determining if they overlap), to provide a faster pre-filter/join before you then test the condition to weed out the false positives.

Using PowerBI to visualize large amounts of data on a SQL Data Warehouse

I have a SQL DW which is about 30 GB. I want to use PowerBI to visualize this data, but I noticed PowerBI desktop only supports file size up to 250MB. What is the best way to connect to PowerBI to visualize this data?
You have a couple of choices depending on your use case:
Direct query of the source data
View based aggregations of the source data
Direct Query
For smaller datasets (think in the thousands of rows), you can simply connect PowerBI directly to Azure SQL Data Warehouse and use the table view to pull in the data as necessary.
View Based Aggregations
For larger datasets (think millions, billions, even trillions of rows) you're better served by running the aggregations within SQL Data Warehouse. This can take the shape of view that is creating the aggregations (think sales by hour instead of every individual sale) or you can create a permanent table at data loading time through a CTAS operation that contains the aggregations your users commonly query against. This latter CTAS operation model is a simple select with filter operation for the user (say Aggregated Sales greater than today - 90 days). Once the view or reporting table is created, you can simply connect to PowerBI as you normally would.
The PowerBI team has a blog post - Exploring Azure SQL Data Warehouse with PowerBI - that covers this as well.
You could also create a query (power query - M) that retrieves only the required data level (ie groups, joins, filters, etc). If done right the queries are translated to tsql and only limited amount of data is downloaded into power bi designer