Access SQL Query Top 1 percent group by winsorizing

Access SQL Query Top 1 percent group by winsorizing - sql

I need to find the 99th and 1th percentiles for a variable at each date. So far I have managed to do so but for the overall period, I would like to "loop" the following query (that does work) for each date (which is basic winsorizing) like a simple GROUP BY, but the latter does not work with TOP PERCENT)
SELECT Date,ID,Value,
IIf(Value>[upper_threshold],[upper_threshold],IIf(Value<[lower_threshold],
[lower_threshold],Value)) AS winsor_Value
FROM MyTable,
(SELECT [lower_threshold], [upper_threshold] FROM (SELECT MAX(Value) AS
lower_threshold FROM (SELECT TOP 1 PERCENT Value FROM MyTable ORDER BY
Value)) AS t1, (SELECT MIN(Value) AS upper_threshold FROM (SELECT TOP 1
PERCENT Value FROM MyTable ORDER BY Value DESC)));
My data looks like
I have 700 000 rows.
Thanks a lot

I am not sure if the following works in MS Access, but it is worth a try. To get the value at the top 99%:
select t.date,
(select min(t2.value)
from (select top 1 percent t2.*
from t as t2
where t2.date = t.date
order by t2.value desc
) as t2
) as percentile_99
from (select distinct date
from t
) as t;
I do not know if MS Access scoping rules allow you to correlate a subquery more than one level deep. If so the above approach should work for all the percentiles.

Related

Adding an extra condition to a group-by query makes it slow

I have a query that selects a maximum value for a certain group, e.g.
select *
from mytable
where value in (select max(value) from mytable group by mygroupfield)
This is fast enough (~ 0.1 sec), but when I add an extra condition, e.g.
select *
from mytable
where value in (select max(value) from mytable group by mygroupfield)
AND mygroupfield='something'
the query becomes very slow (~ 4-5 seconds)
Both value and mygroupfield are indexed (non unique, non clustered, if it matters). This table contains ~ 40,000 records.
I know I can do this:
where value in (select max(value) where mygroupfield='something') (which is also fast) but due to our architecture, that is not an option right now.
Is there a way to speed up this query?

Try a correlated subquery like the example below.
SELECT a.*
FROM mytable AS a
WHERE value IN (
SELECT MAX(value)
FROM mytable AS b
WHERE b.mygroupfield=a.mygroupfield
GROUP by mygroupfield
)
AND mygroupfield='something';
This index may help as well:
CREATE INDEX idx ON dbo.mytable(mygroupfield) INCLUDE(value);

You can try using row_number() though not sure whether it'll serve your purpose or not
with cte as
(
select *,row_number() over(partition by mygroupfield order by value desc) as rn
from mytable
where mygroupfield='something'
)
select * from cte where rn=1

I would suggest a correlated subquery:
select t.*
from mytable t
where t.value = (select max(t2.value)
from mytable t2
where t2.mygroupfield = t.mygroupfield
) and
t.mygroupfield = 'something';
In particular, this can take advantage of an index on mytable(mygroupfield, value)

How to limit duplicated rows

I would like help regarding an SQL query.
Looking around the site, I found several code snippets to return duplicate rows.
Here is the one I went with:
select unumber, name, localid
from table1
where unumber
in (select unumber from table1 group by unumber having count (*) > 1 )
order by unumber
which works fine, however, in the table I have other columns as well, like timestamp etc.
As such, when I run the query I indeed get the duplicate rows, however, I get the duplicates several times due to different timestamps for example.
Is there any way to limit the results to 'unique' duplicate rows only?
Hope this makes sense!
Thank you in advance!

For what you describe, you can just use select distinct:
select distinct unumber, name, localid
from table1
where unumber in (select unumber from table1 group by unumber having count (*) > 1 )
order by unumber;
However, I would be more likely to write this using window functions:
select unumber, name, localid
from (select t1.*,
count(*) over (partition by unumber) as cnt,
row_number() over (partition by unumber, name, localid order by unumber) as seqnum
from table1 t1
) t1
where cnt > 1 and seqnum = 1;

Get total row count while paging

I have a search screen where the user has 5 filters to search on.
I constructed a dynamic query, based on these filter values, and page 10 results at a time.
This is working fine in SQL2012 using OFFSET and FETCH, but I'm using two queries to do this.
I want to show the 10 results and display the total number of rows found by the query (let's say 1000).
Currently I do this by running the query twice - once for the Total count, then again to page the 10 rows.
Is there a more efficient way to do this?

You don't have to run the query twice.
SELECT ..., total_count = COUNT(*) OVER()
FROM ...
ORDER BY ...
OFFSET 120 ROWS
FETCH NEXT 10 ROWS ONLY;
Based on the chat, it seems your problem is a little more complex - you are applying DISTINCT to the result in addition to paging. This can make it complex to determine exactly what the COUNT() should look like and where it should go. Here is one way (I just want to demonstrate this rather than try to incorporate the technique into your much more complex query from chat):
USE tempdb;
GO
CREATE TABLE dbo.PagingSample(id INT,name SYSNAME);
-- insert 20 rows, 10 x 2 duplicates
INSERT dbo.PagingSample SELECT TOP (10) [object_id], name FROM sys.all_columns;
INSERT dbo.PagingSample SELECT TOP (10) [object_id], name FROM sys.all_columns;
SELECT COUNT(*) FROM dbo.PagingSample; -- 20
SELECT COUNT(*) FROM (SELECT DISTINCT id, name FROM dbo.PagingSample) AS x; -- 10
SELECT DISTINCT id, name FROM dbo.PagingSample; -- 10 rows
SELECT DISTINCT id, name, COUNT(*) OVER() -- 20 (DISTINCT is not computed yet)
FROM dbo.PagingSample
ORDER BY id, name
OFFSET (0) ROWS FETCH NEXT (5) ROWS ONLY; -- 5 rows
-- this returns 5 rows but shows the pre- and post-distinct counts:
SELECT PostDistinctCount = COUNT(*) OVER() -- 10,
PreDistinctCount -- 20,
id, name
FROM
(
SELECT DISTINCT id, name, PreDistinctCount = COUNT(*) OVER()
FROM dbo.PagingSample
-- INNER JOIN ...
) AS x
ORDER BY id, name
OFFSET (0) ROWS FETCH NEXT (5) ROWS ONLY;
Clean up:
DROP TABLE dbo.PagingSample;
GO

My solution is similar to "rs. answer"
DECLARE #PageNumber AS INT, #RowspPage AS INT
SET #PageNumber = 2
SET #RowspPage = 5
SELECT COUNT(*) OVER() totalrow_count,*
FROM databasename
where columnname like '%abc%'
ORDER BY columnname
OFFSET ((#PageNumber - 1) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY;
The return result will include totalrow_count as the first column name

Can you try something like this
SELECT TOP 10 * FROM
(
SELECT COUNT(*) OVER() TOTALCNT, T.*
FROM TABLE1 T
WHERE col1 = 'somefilter'
) v
or
SELECT * FROM
(
SELECT COUNT(*) OVER() TOTALCNT, T.*
FROM TABLE1 T
WHERE col1 = 'somefilter'
) v
ORDER BY COL1
OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY
Now you have total count in your totalcnt column and you can use this column to show total number of rows

In my testing with a complex join and ~6,000 records returned, it's much faster to do two separate queries. Faster, as in milliseconds total to get the total and separately bring back a subset of 100 records, vs 17 seconds to do the combined query. Anyone else see this kind of performance hit? Obviously, it could have something to do with the data structure but this is still a huge difference.

I hope I'm not too late to jump in on this question, but I ran across a very similar problem tonight. I had a paging class that was over inflating the number of results returned because the previous developer was dropping the DISTINCT and just doing a SELECT count(*) of the table joins. While this doesn't solve the 2 query problem I ended up using a nested query so that it looked like this:
Original Query
SELECT DISTINCT
field1, field2
FROM
table1 t1
left join table2 t2 on t2.id = t1.id
Over Inflated Results Query
SELECT
count(*)
FROM
table1 t1
left join table2 t2 on t2.id = t1.id
My Results Query Solution
SELECT
count(*)
FROM
(SELECT DISTINCT
field1, field2
FROM
table1 t1
left join table2 t2 on t2.id = t1.id) as tbl;

Compare SQL groups against eachother

How can one filter a grouped resultset for only those groups that meet some criterion compared against the other groups? For example, only those groups that have the maximum number of constituent records?
I had thought that a subquery as follows should do the trick:
SELECT * FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t HAVING Records = MAX(Records);
However the addition of the final HAVING clause results in an empty recordset... what's going on?

In MySQL (Which I assume you are using since you have posted SELECT *, COUNT(*) FROM T GROUP BY X Which would fail in all RDBMS that I know of). You can use:
SELECT T.*
FROM T
INNER JOIN
( SELECT X, COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
) T2
ON T2.X = T.X
This has been tested in MySQL and removes the implicit grouping/aggregation.
If you can use windowed functions and one of TOP/LIMIT with Ties or Common Table expressions it becomes even shorter:
Windowed function + CTE: (MS SQL-Server & PostgreSQL Tested)
WITH CTE AS
( SELECT *, COUNT(*) OVER(PARTITION BY X) AS Records
FROM T
)
SELECT *
FROM CTE
WHERE Records = (SELECT MAX(Records) FROM CTE)
Windowed Function with TOP (MS SQL-Server Tested)
SELECT TOP 1 WITH TIES *
FROM ( SELECT *, COUNT(*) OVER(PARTITION BY X) [Records]
FROM T
)
ORDER BY Records DESC
Lastly, I have never used oracle so apolgies for not adding a solution that works on oracle...
EDIT
My Solution for MySQL did not take into account ties, and my suggestion for a solution to this kind of steps on the toes of what you have said you want to avoid (duplicate subqueries) so I am not sure I can help after all, however just in case it is preferable here is a version that will work as required on your fiddle:
SELECT T.*
FROM T
INNER JOIN
( SELECT X
FROM T
GROUP BY X
HAVING COUNT(*) =
( SELECT COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
)
) T2
ON T2.X = T.X

For the exact question you give, one way to look at it is that you want the group of records where there is no other group that has more records. So if you say
SELECT taxid, COUNT(*) as howMany
GROUP by taxid
You get all counties and their counts
Then you can treat that expressions as a table by making it a subquery, and give it an alias. Below I assign two "copies" of the query the names X and Y and ask for taxids that don't have any more in one table. If there are two with the same number I'd get two or more. Different databases have proprietary syntax, notably TOP and LIMIT, that make this kind of query simpler, easier to understand.
SELECT taxid FROM
(select taxid, count(*) as HowMany from flats
GROUP by taxid) as X
WHERE NOT EXISTS
(
SELECT * from
(
SELECT taxid, count(*) as HowMany FROM
flats
GROUP by taxid
) AS Y
WHERE Y.howmany > X.howmany
)

Try this:
SELECT * FROM (
SELECT *, MAX(Records) as max_records FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t
) WHERE Records = max_records
I'm sorry that I can't test the validity of this query right now.

Oracle: getting maximum value of a group?

Given a table like this, what query will the most recent calibration information for each monitor? In other words, I want to find the maximum date value for each of the monitors. Oracle-specific functionality is fine for my application.
monitor_id calibration_date value
---------- ---------------- -----
1 2011/10/22 15
1 2012/01/01 16
1 2012/01/20 17
2 2011/10/22 18
2 2012/01/02 19
The results for this example would look like this:
1 2012/01/20 17
2 2012/01/02 19

I'd tend to use analytic functions
SELECT monitor_id,
host_name,
calibration_date,
value
FROM (SELECT b.monitor_id,
b.host_name,
a.calibration_date,
a.value,
rank() over (partition by b.monitor_id order by a.calibration_date desc) rnk
FROM table_name a,
table_name2 b
WHERE a.some_key = b.some_key)
WHERE rnk = 1
You could also use correlated subqueries though that will be less efficient
SELECT monitor_id,
calibration_date,
value
FROM table_name a
WHERE a.calibration_date = (SELECT MAX(b.calibration_date)
FROM table_name b
WHERE a.monitor_id = b.monitor_id)

My personal preference is this:
SELECT DISTINCT
monitor_id
,MAX(calibration_date)
OVER (PARTITION BY monitor_id)
AS latest_calibration_date
,FIRST_VALUE(value)
OVER (PARTITION BY monitor_id
ORDER BY calibration_date DESC)
AS latest_value
FROM mytable;
A variation would be to use the FIRST_VALUE syntax for latest_calibration_date as well. Either way works.

The window functions solution should be the most efficient and result in only one table or index scan. The one I am posting here i think wins some points for being intuitive and easy to understand. I tested on SQL server and it performed 2nd to window functions, resulting in two index scans.
SELECT T1.monitor_id, T1.calibration_date, T1.value
FROM someTable AS T1
WHERE NOT EXISTS
(
SELECT *
FROM someTable AS T2
WHERE T2.monitor_id = T1.monitor_id AND T2.value > T1.value
)
GROUP BY T1.monitor_id, T1.calibration_date, T1.value
And just for the heck of it, here's another one along the same lines, but less performing (63% cost vs 37%) than the other (again in sql server). This one uses a Left Outer Join in the execution plan where as the first one uses an Anti-Semi Merge Join:
SELECT T1.monitor_id, T1.calibration_date, T1.value
FROM someTable AS T1
LEFT JOIN someTable AS T2 ON T2.monitor_id = T1.monitor_id AND T2.value > T1.value
WHERE T2.monitor_id IS NULL
GROUP BY T1.monitor_id, T1.calibration_date, T1.value

select monitor_id, calibration_date, value
from table
where calibration_date in(
select max(calibration_date) as calibration_date
from table
group by monitor_id
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Access SQL Query Top 1 percent group by winsorizing - sql

Related

Adding an extra condition to a group-by query makes it slow

How to limit duplicated rows

Get total row count while paging

Compare SQL groups against eachother

Oracle: getting maximum value of a group?

Categories

Resources