Is there any performance benefit when we use Limit - google-bigquery

for example
SELECT company_ID, totalRevenue
FROM `BigQuery.BQdataset.companyperformance`
ORDER BY totalRevenue LIMIT 10
The only difference I can see between using and not using LIMIT 10 is just the different amount of data used for displaying to user.
The system still orders all the data first before performing a LIMIT.

Below is applicable for BigQuery
Not necessarily 100% technically correct - but close enough so I hope below will give you an idea why LIMIT N is extremely important to consider in BigQuery
Assume you have 1,000,000 rows of data and 8 workers to process query like below
SELECT * FROM table_with_1000000_rows ORDER BY some_field
Round 1: To sort this data each worker gets 125,000 rows – so now you have 8 sorted sets of 125,000 rows each
Round 2: Worker #1 sends its sorted data (125,000 rows) to worker #2, #3 sends to #4 and so on. So now we have 4 workers and each produce ordered set of 250,000 rows
Round 3: Above logic repeated and now we have just 2 workers each producing ordered list of 500,000 rows
Round 4: And finally, just one worker producing final ordered set of 1,000,000 rows
Of course, based on number of rows and number of available workers – number of rounds can be different than in above example
In Summary: what we have here:
a. We have quite a huge amount of data being transferred between workers – this can be quite a factor for performance going down
b. And we have chance for one of the workers not being able to process amount of data distributed to respective worker. It can happen earlier or later and is usually manifested with “Resources exceeded …” type of error
So, now if you have LIMIT as a part of query as below
SELECT * FROM table_with_1000000_rows ORDER BY some_field LIMIT 10
So, now – Round 1 is going to be the same. But starting with Round 2 – ONLY top 10 rows will be sent to another worker – thus in each Round after first one - only 20 rows will processed and only top 10 will be sent for further processing
Hope you see how different these two processes in terms of volume of the data being sent between workers and how much work each worker needs to apply to sort respective data
To Summarize:
Without LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 1,500,000
• Overall merged ordered rows (Round 2 - 4): 1,500,000;
• Final result: 1,000,000 rows
With LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 70
• Overall merged ordered rows (Round 2 - 4): 140;
• Final result: 10 rows
Hope above numbers clearly show the difference in performance you gain using LIMIT N and in some cases even ability to successfully run the query without "Resource exceeded ..." error

This answer assumes you are asking about the difference between the following two variants:
ORDER BY totalRevenue
ORDER BY totalRevenue LIMIT 10
In many databases, if a suitable index existed involving totalRevenue, the LIMIT query could stop sorting after finding the top 10 records.
In the absence of any index, as you pointed out, both versions would have to do a full sort, and therefore should perform the same.
Also, there is a potentially major performance difference between the two, if the table be large. In the LIMIT version, BigQuery only has to send across 10 records, while in the non LIMIT version, potentially much more data has to be sent.

There is no performance gain. bigQuery still has go through all the records on the table.
You can partition your data in order to cut the amount of records that bigQuery has to read. That will increase performance. You can read more information here:
https://cloud.google.com/bigquery/docs/partitioned-tables

See the statistical difference in bigQuery UI between the below 2 queries
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 1000
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 10000
As you can see BQ will return immediately to UI after the limit criteria is reached this result in better performance and less traffic on the network

Related

Does SQL Server Table-Scan Time depend on the Query?

I observed that doing a full table scan takes a different time based on the query. I believed that under similar conditions (set of columns under select, column data types) a table scan should take a somewhat similar time. Seems like it's not the case. I just want to understand the reason behind that.
I have used "CHECKPOINT" and "DBCC DROPCLEANBUFFERS" before querying to make sure there is no impact from the query cache.
Table:
10 Columns
10M rows Each column has different densities ranging from 0.1 to 0.000001
No indexes
Queries:
Query A: returned 100 rows, time took: ~ 900ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL07 = 50000
Query B: returned 910595 rows, time took: ~ 15000ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL01 = 5
** Where column COL07 was randomly populated with integers ranging from 0 to 100000 and column COL01 was randomly populated with integers ranging from 0 to 10
Time Taken:
Query A: around 900 ms
Query B: around 18000 ms
What's the point I'm missing here?
Query A: (returned 100 rows, time took: ~ 900ms)
Query B: (returned 910595 rows, time took: ~ 15000ms)
I believe that what you are missing is that there are about x100 more rows to fetch in the second query. That only could explain why it took 20 times longer.
The two columns have different density of the data.
Query A, COL07: 10000000/100000 = 100
Query B, COL05: 10000000/10 = 1000000
The fact that both the search parameters are in the middle of the data range doesn't necessarily impact the speed of the search. This is depending on the number of times the engine scans the column to return the values of the search predicate.
In order to see if this is indeed the case, I would try the following:
COL04: 10000000/1000 = 10000. Filtering on WHERE COL04 = 500
COL08: 10000000/10000 = 1000. Filtering on WHERE COL05 = 5000
Considering the times from the initial test, you would expect to see COL04 at ~7200ms and COL05 at ~3600ms.
An interesting article about SQL Server COUNT() Function Performance Comparison
Full Table Scan (also known as Sequential Scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order
Reference
In your case, full table scan scans sequentially (in ordered way) so that it does not need to scan whole table in order to advance next record because Col7 is ordered.
but in Query2 the case is not like that, Col01 is randomly distributed so full table scan is needed.
Query 1 is optimistic scan where as Query 2 is pessimistic can.

Get filtered row count using dm_db_partition_stats

I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.
Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?

ORMLite improve the select performance

I use ORMlite with H2 database on Java server. I have a dataset of 450k rows. And I have the following distribution:
From 0 to 300k: every 300-th row has the status 1
From 300k to 420k: every 50-th row has the status 1
From 420k to 450k: every 4-th has the status 1
Other rows has the status 2. So, overall about 10900 rows has the status 1, others has the status 2.
Please note that the distribution is approximate, so the SQL query cannot rely on the numbers above.
Then I need to select the last 8k rows, which does not have the status 2. I have tried the following:
First order the rows by time creating and then query 8k rows with limit function:
SELECT * FROM `table_name` WHERE `Status` <> 2 LIMIT 0,100000
queryBuilder.orderBy('time_saved', false);
queryBuilder.limit(8000L);
The problem with this solution, that it takes quite long to retrieve all elements: about 6 seconds. The most of time takes this order function: about 2 seconds for all processing and about 4 seconds for ordering.
Query all elements, but then trim the resulting array. The performance of this solution is quite good for me.
So I was wondering, whether there is another solutions, which can be better for solving such problem. I know that I can set the offset for QueryBuilder to improve the performance, but I don't know the starting number for this offset.
Thank you in advance.
p.s. Unfortunately, I don't have much experience with ORMLite so any supporting links will be appreciated. Thank you.

postgres: get random entries from table - too slow

In my postgres database, I have the following relationships (simplified for the sake of this question):
Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)
Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag
Media
-----
n_id
t_media_path
So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
limit
2
This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.
Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
order by
random()
limit
2
While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.
So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?
Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?
Here's some previous work done on the topic which may prove helpful:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.
It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.
The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.
I think you'd have to change the table schema to optimize for this case. How about something like:
1) Create a new column with a sequence starting at 1.
2) Every row will then have a number.
3) Create an index on: number % 1000
4) Query for rows where number % 1000 is equal to a random number
between 0 and 999 (this should hit the index and load a random
portion of your database)
5) You can probably then add on RANDOM() to your ORDER BY clause and
it will then just sort that chunk of your database and be 1,000x
faster.
6) Then select the first two of those rows.
If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.
Hopefully something along these lines could be very fast and decently random.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost