Vague count in sql select statements

Vague count in sql select statements - sql

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.

Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human

With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm

Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.

I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.

this question gives some pretty good information

When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.

Related

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".

There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Get filtered row count using dm_db_partition_stats

I'm using paging in my app but I've noticed that paging has gone very slow and the line below is the culprit:
SELECT COUNT (*) FROM MyTable
On my table, which only has 9 million rows, it takes 43 seconds to return the row count. I read in another article which states that to return the row count for 1.4 billion rows, it takes over 5 minutes. This obviously cannot be used with paging as it is far too slow and the only reason I need the row count is to calculate the number of available pages.
After a bit of research I found out that I get the row count pretty much instantly (and accurately) using the following:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
But the above returns me the count for the entire table which is fine if no filters are applied but how do I handle this if I need to apply filters such as a date range and/or a status?
For example, what is the row count for MyTable when the DateTime field is between 2013-04-05 and 2013-04-06 and status='warning'?
Thanks.
UPDATE-1
In case I wasn't clear, I require the total number of rows available so that I can determine the number of pages required that will match my query when using 'paging' feature. For example, if a page returns 20 records and my total number of records matching my query is 235, I know I'll need to display 12 buttons below my grid.
01 - (row 1 to 20) - 20 rows displayed in grid.
02 - (row 21 to 40) - 20 rows displayed in grid.
...
11 - (row 200 to 220) - 20 rows displayed in grid.
12 - (row 221 to 235) - 15 rows displayed in grid.
There will be additional logic added to handle a large amount of pages but that's a UI issue, so this is out of scope for this topic.
My problem with using "Select count(*) from MyTable" is that it is taking 40+ seconds on 9 million records (thought it isn't anymore and I need to find out why!) but using this method I was able to add the same filter as my query to determine the query. For example,
SELECT COUNT(*) FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning'
Once I determine the page count, I would then run the same query but include the fields instead of count(*), the CurrentPageNo and PageSize in order to filter my results by page number using the row ids and navigate to a specific pages if needed.
SELECT RowId, DateTime, Status, Message FROM [MyTable]
WHERE [DateTime] BETWEEN '2018-04-05' AND '2018-04-06' AND
[Status] = 'Warning' AND
RowId BETWEEN (CurrentPageNo * PageSize) AND ((CurrentPageNo + 1) * PageSize)
Now, if I use the other mentioned method to get the row count i.e.
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable')
AND (index_id=0 or index_id=1)
It returns the count instantly but how do I filter this so that I can include the same filters as if I was using the SELECT COUNT(*) method, so I could end up with something like:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTable') AND
(index_id=0 or index_id=1) AND
([DateTime] BETWEEN '2018-04-05' AND '2018-04-06') AND
([Status] = 'Warning')
The above clearing won't work as I'm querying the dm_db_partition_stats but I would like to know if I can somehow perform a join or something similar to provide me with the total number of rows instantly but it needs to be filtered rather than apply to the entire table.
Thanks.

Have you ever asked for directions to alpha centauri? No? Well the answer is, you can't get there from here.
Adding indexes, re-orgs/re-builds, updating stats will only get you so far. You should consider changing your approach.
sp_spaceused will return the record count typically instantly; You may be able to use this, however depending (which you've not quite given us enough information) on what you are using the count for might not be adequate.
I am not sure if you are trying to use this count as a means to short circuit a larger operation or how you are using the count in your application. When you start to highlight 1.4 billion records and you're looking for a window in said set, it sounds like you might be a candidate for partitioned tables.
This allows you assign several smaller tables, typically separated by date, years / months, that act as a single table. When you give the date range on 1.4+ Billion records, SQL can meet performance expectations. This does depend on SQL Edition, but there is also view partitioning as well.
Kimberly Tripp has a blog and some videos out there, and Kendra Little also has some good content on how they are used and how to set them up. This would be a design change. It is a bit complex and not something you would want implement on a whim.
Here is a link to Kimberly's Blog: https://www.sqlskills.com/blogs/kimberly/sqlskills-sql101-partitioning/
Dev banter:
Also, I hear you blaming SQL, are you using entity framework by chance?

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

BigQuery COUNT(DISTINCT value) vs COUNT(value)

I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.

In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.

I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

changing sorting criteria after the first result

I am selecting from a database of news articles, and I'd prefer to do it all in one query if possible. In the results, I need a sorting criteria that applies ONLY to the first result.
In my case, the first result must have an image, but the others should be sorted without caring about their image status.
Is this something I can do with some sort of conditionals or user variables in a MySQL query?

Even if you manage to find a query that looks like one query, it is going to be logicaly two queries. Have a look at MySQL UNION if you really must make it one query (but it will still be 2 logical queries). You can union the image in the first with a limit of 1 and the rest in the second.

Something like this ensures an article with an image on the top.
SELECT
id,
title,
newsdate,
article
FROM
news
ORDER BY
CASE WHEN HasImage = 'Y' THEN 0 ELSE 1 END,
newsdate DESC
Unless you define "the first result" closer, of course. This query prefers articles with images, articles without will appear at the end.
Another variant (thanks to le dorfier, who deleted his answer for some reason) would be this:
SELECT
id,
title,
newsdate,
article
FROM
news
ORDER BY
CASE WHEN id = (
SELECT MIN(id) FROM news WHERE HasImage = 'Y'
) THEN 0 ELSE 1 END,
newsdate DESC
This sorts the earliest (assuming MIN(id) means "earliest") article with an image to the top.

I don't think it's possible, as it's effectively 2 queries (the first query the table has to get sorted for, and the second unordered), so you might as well use 2 queries with a LIMIT 1 in the first.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas