BigQuery. Long execution time on small datasets

BigQuery. Long execution time on small datasets - sql

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.

Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Related

BigQuery - how to decrease slot time of Coalesce execution step?

I have a pretty complex query, with about 70 execution steps. The query was somewhat optimized for performance, so most of the steps in the execution plan run pretty fast - except Coalesce steps, which take about 10 to 100 times more slot time, compared to the others. As far as I understand, it prepares data for the following Join step, but why does it take so long even if actual nuber of records processed by this step is low? The most extreme case I saw looks like this (ZERO records processed by this step, but still takes 8 seconds! ):
S46: Coalesce
Slot time: 8223 ms
Duration: 92 ms
Bytes Shuffled: 0 B
I wasn't able to find any hints regarding this "Coalesce" step and ways to optimize it in Google documentation, so perhaps you can give me some advice about it or point to actual documentation that explains it.

Check the execution time of a query accurate to the microsecond

I have a query in SQL Server 2019 that does a SELECT on the primary key fields of a table. This table has about 6 million rows of data in it. I want to know exactly how fast my query is down to the microsecond (or at least the 100 microsecond). My query is faster than a millisecond, but all I can find in SQL server is query measurements accurate to the millisecond.
What I've tried so far:
SET STATISTICS TIME ON
This only shows milliseconds
Wrapping my query like so:
SELECT #Start=SYSDATETIME()
SELECT TOP 1 b.COL_NAME FROM BLAH b WHERE b.key = 0x1234
SELECT #End=SYSDATETIME();
SELECT DATEDIFF(MICROSECOND,#Start,#End)
This shows that no time has elapsed at all. But this isn't accurate because if I add WAITFOR DELAY '00:00:00.001', which should add a measurable millisecond of delay, it still shows 0 for the datediff. Only if I wat for 2 milliseconds do I see it show up in the datediff
Looking up the execution plan and getting the total_worker_time from the sys.dm_exec_query_stats table.
Here I see 600 microseconds, however the microsoft docs seem to indicate that this number cannot be trusted:
total_worker_time ... Total amount of CPU time, reported in microseconds (but only accurate to milliseconds)
I've run out of ideas and could use some help. Does anyone know how I can accurately measure my query in microseconds? Would extended events help here? Is there another performance monitoring tool I could use? Thank you.

This is too long for a comment.
In general, you don't look for performance measurements measured in microseconds. There is just too much variation, based on what else is happening in the database, on the server, and in the network.
Instead, you set up a loop and run the query thousands -- or even millions -- of times and then average over the executions. There are further nuances, such as clearing caches if you want to be sure that the query is using cold caches.

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?

When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

How to understand statistics of trace file in Oracle. Such as CPU, elapsed time, query...etc

I am learning query optimization in Oracle and I know that trace file will create statistic about the query execution and EXPLAIN Plan of the query.
At the bottom of the trace file, it is EXPLAIN PLAN of the query. My first question is , does the part "time = 136437 us" show the time duration for the steps of query execution? what does "us" mean ? Is it unit of time?
In addition, can anyone explain what statistics such as count, cpu, elapsed , disk and query mean? I google and read Oracle doc about them already but I still can not understand it. Can anyone clarify the meaning of those stats more clearly?
Thanks in advance. I am new and sorry for my English.

The smallest unit of data access in Oracle Database is a block. Not a row.
Each block can store many rows.
The database can access a block in current or consistent mode.
Current = as the block exists "right now".
Consistent = as the blocked existed at the time your query started.
The query and current columns report how many times the database accessed a block in consistent (query) and current mode.
When accessing a block it may already be in the buffer cache (memory). If so, no disk access is needed. If not, it has to do a physical read (pr). The disk column is a count of the total physical reads.
The stats for each line in the plan are the figures for that operation. Plus the sum of all its child operations.
In simple terms, the database processes the plan by accessing the first child first. Then passes the rows up to the parent. Then all the other child ops of that parent in order. Child operations are indented from their parent in the display.
So the database processed your query like so:
Read 2,000 rows from CUSTOMER. This required 749 consistent block gets and 363 disk reads (cr and pr values on this row). This took 10,100 microseconds.
Read 112,458 rows from BOOKING. This did 8,203 consistent reads and zero disk reads. This took 337,595 microseconds
Joined these two tables together using a hash join. The CR, PR, PW (physical writes) and time values are the sum of the operations below this. Plus whatever work this operation did. So the hash join:
did 8,952 - ( 749 + 8,203 ) = zero consistent reads
did 363 - ( 363 + 0 ) = zero physical reads
took 1,363,447 - ( 10,100 + 337,595 ) = 1,015,752 microseconds to execute
Notice that the CR & PR totals for the hash join match the query and disk totals in the fetch line?
The count column reports the number of times that operation happened. A fetch is a call to the database to get rows. So the client called the database 7,499 times. Each time it received ceil( 112,458 / 7,499 ) = 15 rows.
CPU is the total time in seconds the server's processors were executing that step. Elapsed is the total wall clock time. This is the CPU time + any extra work. Such as disk reads, network time, etc.

Understanding the Sql Server execution time

I'm using an SQL Server 2012 and SET STATISTICS TIME ON to measure the CPU-time for my sql statements. I use this because i only want to get the time the database needs to execute the statement.
When returning large data from a select, i noticed the CPU-time going up pretty high, like using TOP 2000 will need about 400ms, but without it will need about 10000ms CPU-time.
What i'm not sure about is:
Is it possible that the CPU-time i get returned includes something like the time it needs to display the millions of rows returned in my Sql Server Management Studio? That would be pretty much of a bad situation.
Update:
The time i want to recieve is the execution time of the sql server without the time needed for the ssms to display the rows. There are several time statistics display in the Client statistics , but after searching for a long time it's really hard to find good references explaining what they are. Any suggestions?
Idea: elapsed time(sql server execution time) - client processing time (Client statistics)
Maybe this is an option?

In a multi-threaded world, CPU time is increasingly less helpful for simple tuning. Execution time is worth looking at.
To see if execution time (elapsed time) spent on displaying results is included you could SELECT TOP 2000 * INTO #temp to compare execution times.
Update:
My quick tests suggest the overhead of creating/inserting into a #temp table outweighs that of displaying results (at 5000). When I go to 50,000 results the SELECT INTO runs more quickly. The counts at which the two become equivalent depends on how many and what type of fields are returned. I tested with:
SET STATISTICS TIME ON
SELECT TOP 50000 NEWID()
FROM master..spt_values v1, master..spt_values v2
WHERE v1.number > 100
SET STATISTICS TIME OFF
-- CPU time = 32 ms, elapsed time = 121 ms.
SET STATISTICS TIME ON
SELECT TOP 50000 NEWID() col1
INTO #test
FROM master..spt_values v1, master..spt_values v2
WHERE v1.number > 100
SET STATISTICS TIME OFF
-- CPU time = 15 ms, elapsed time = 87 ms.

CPU time in SET STATISTICS TIME ON only counts the time that SQL Server needs to execute the query. It doesn't include any time the client takes to render the results. It also excludes any time SQL Server spends waiting for buffers to clear. In short, it really is pretty independent of the client.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas