Optimize Postgres TOP-n query

Optimize Postgres TOP-n query - sql

Table with two columns (transaction_id, user_id), both with index. Approx 10M records in table.
transaction_id is unique
transaction_id count on user_id varies from very few to thousands.
What I need is to find the max(transaction_id), with respect to that the top25 (order by desc) transaction_id's on a given user must be ignored.
Eg a user_id with 21 transaction_id's will not be selected. A user_id with 47 transactions will return transaction 26.
I have tried several ways by using offset, limit etc, but they all seem to be to slow (very high cost).

you have a window functions i.e.
select user_id, nth_value(transaction_id, 26) over (
partition by user_id order by transaction_id
)
from your_table;
should be plenty

Related

LIMIT 1 by Column On Query Returning Multiple Results?

I have a user_certifications table, with these props:
int user_id
int cert_id
timestamp last_updated
So, a user can be listed multiple times with a variety of certs. I have a list of user_ids for which I need to get ONLY the most recently updated record. So for one user it would be:
SELECT user_id, last_updated
FROM user_certifications
WHERE user_id == x
ORDER BY last_updated DESC
LIMIT 1
How do I do this efficiently if I need ONLY the last dated entry for each of a number of users. E.g. similar query, but WHERE user_id IN (x,y,z) and returning one entry per user, with the latest date?
P.S. - I apologize for the title, I don't know how to word this.

Use distinct on:
SELECT DISTINCT ON (user_id), uc.*
FROM user_certifications uc
ORDER BY user_id, last_updated DESC

SQL Server : index for finding latest value which is greater than a passed value

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance

For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

Partition by week/month//quarter/year to get over the partition limit?

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?

Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB

Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

Creating effective sqlite indexes to process statements with group by and many group_concat statements

I'm trying to extract statistics from a sqlite db. I am basically collecting all possible variations and counts of variations for each column while the primary keys are grouped by a group_id column (many skus belong to each group_id).
For clarity assume a table like this:
sku, group_id, column_a, column_b
The query (abridged) looks like this:
SELECT
group_id,
count(sku),
group_concat(distinct column_a),
count(distinct column_a),
group_concat(distinct column_b),
count(distinct column_b),
FROM
my_table
GROUP BY
group_id
ORDER BY
count(sku) DESC;
My table has 8 million lines and 20 columns.
Currently the unabridged version of this query never completes (at least 20 minutes running).
I've added indexes on group_id and sku columns.
EDIT
my indexes are like this:
CREATE INDEX index_group_id ON products(group_id);
CREATE INDEX index_sku ON products(sku);
CREATE INDEX index_group_id_sku ON products(group_id,sku);
Query plan
0|0|0|SCAN TABLE products USING INDEX index_group_id_sku
Number of group_ids
sqlite> select count(distinct group_id) from products;
6426446
Number of skus (primary keys - row count)
sqlite> select count(sku) from products;
8395475

Getting the min() of a count(*) column

I have a table called Vehicle_Location containing the columns (and more):
ID NUMBER(10)
SEQUENCE_NUMBER NUMBER(10)
TIME DATE
and I'm trying to get the min/max/avg number of records per day per id.
So far, I have
select id, to_char(time), count(*) as c
from vehicle_location
group by id, to_char(time), min having id = 16
which gives me:
ID TO_CHAR(TIME) COUNT(*)
---------------------- ------------- ----------------------
16 11-05-31 159
16 11-05-23 127
16 11-06-03 56
So I'd like to get the min/max/avg of the count(*) column. I am using Oracle as my RDBMS.

I don't have an oracle station to test on but you should be able to just wrap the aggregator around your SELECT as a subquery/derived table/inline view
So it would be (UNTESTED!!)
SELECT
AVG(s.c)
, MIN(s.c)
, MAX(s.c)
, s.ID
FROM
--Note this is just your query
(select id, to_char(time), count(*) as c from vehicle_location group by id, to_char(time), min having id = 16) as s
GROUP BY s.ID
Here's some reading on it:
http://www.devshed.com/c/a/Oracle/Inserting-SubQueries-in-SELECT-Statements-in-Oracle/3/
EDIT: Though normally it is a bad idea to select both the MIN and MAX in a single query.
EDIT2: The min/max issue is related to how some RDBMS (including oracle) handle aggregations on indexed columns. It may not affect this particular query but the premise is that it's easy to use the index to find either the MIN or the MAX but not both at the same time because any index may not be used effectively.
Here's some reading on it:
http://momendba.blogspot.com/2008/07/min-and-max-functions-in-single-query.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimize Postgres TOP-n query - sql

you have a window functions i.e. select user_id, nth_value(transaction_id, 26) over ( partition by user_id order by transaction_id ) from your_table; should be plenty

Related

LIMIT 1 by Column On Query Returning Multiple Results?

SQL Server : index for finding latest value which is greater than a passed value

Partition by week/month//quarter/year to get over the partition limit?

Creating effective sqlite indexes to process statements with group by and many group_concat statements

Getting the min() of a count(*) column

Categories

Resources