sqlite: get the average of the top X% for every item - sql

Is it possible to get the average of the top X% items in a group?
For example:
I have a table which has a item_id, timestamp and price column. The output should be grouped by item_id and timestamp and the 'price-column' should get averaged. For the averaging only the lowest X% prices within that group should be used.
I've found similar questions (How to select top x records for every group) but this won't work with sqlite.

Getting the top n records within each group requires counting. Assuming that there are no duplicates, the following query returns the number of records for an item:
select t.*,
(select count(*) from t t2 where t2.item_id = t.item_id
) as NumPrices
from t
This is called a correlated subquery. Now, let's extend the idea to include a rank and then calculate the average for the right group:
select item_id, avg(price)
from (select t.*,
(select count(*) from t t2 where t2.item_id = t.item_id
) as NumPrices,
(select count(*) from t t2 where t2.item_id = t.item_id and t2.price <= t.price
) as PriceRank
from t
) t
where (100.0*PriceRank / NumPrices) <= X
group by item_id
To improve performance, you will want an index on (item_id, price).

To get the count of records in the group with ID I and timestamp T, use this query:
SELECT COUNT(*)
FROM MyTable
WHERE item_id = I
AND timestamp = T
To get the limit, multiply with X, and use ROUND/CAST to convert to an integer:
SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
FROM MyTable
WHERE item_id = I
AND timestamp = T
To get all records in a specific group that are inside that limit, order the records in the group by price, and limit the returned count:
SELECT *
FROM MyTable
WHERE item_id = I
AND timestamp = T
ORDER BY price
LIMIT (SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
FROM MyTable
WHERE item_id = I
AND timestamp = T)
In theory, to get the group averages, add GROUP BY around that:
SELECT item_id,
timestamp,
(SELECT AVG(price)
FROM (SELECT price
FROM MyTable T2
WHERE T2.item_id = T1.item_id
AND T2.timestamp = T1.timestamp
ORDER BY price
LIMIT (SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
FROM MyTable T3
WHERE T3.item_id = T1.item_id
AND T3.timestamp = T1.timestamp)
)
) AS AvgPriceLowestX
FROM MyTable T1
GROUP BY item_id,
timestamp
However, it appears that SQLite does not allow access to correlation variables from inside the LIMIT clause, so this does not work in practice.
You would have to get the IDs of all groups (SELECT DISTINCT item_id, timestamp FROM MyTable) and execute the third query above for each group.
In any case, ensure that you have one index on the three columns item_id, timestamp, and price to get good performance.

Related

How to select the max of revenue for each user_id with row number in SQL?

In my dataset there are some user_id that each of them has several row number (from 1 to n) that each row has a specific revenue. I want to select the maximum of the revenue for each user_id with the row number belongs to this revenue. I want to have a query with result of the highlighted rows.
One method is a correlated subquery:
select t.*
from t
where t.revenue = (select max(t2.revenue) from t t2 where t2.user_id = t.user_id);
If there are ties for the maximum, this returns all the highest value rows.
select *,
case when revenue = max(revenue) over (partition by user_id) then 1 else 0 end as highlight
from T
select tt.*
from #tbl tt
join (select user_Id, max(revenue) as revenue from #tbl group by user_Id) tm on tt.user_Id = tm.user_Id and tt.revenue = tm.revenue

SQL - delete record where sum = 0

I have a table which has below values:
If Sum of values = 0 with same ID I want to delete them from the table. So result should look like this:
The code I have:
DELETE FROM tmp_table
WHERE ID in
(SELECT ID
FROM tmp_table WITH(NOLOCK)
GROUP BY ID
HAVING SUM(value) = 0)
Only deletes rows with ID = 2.
UPD: Including additional example:
Rows in yellow needs to be deleted
Your query is working correctly because the only group to total zero is id 2, the others have sub-groups which total zero (such as the first two with id 1) but the total for all those records is -3.
What you're wanting is a much more complex algorithm to do "bin packing" in order to remove the sub groups which sum to zero.
You can do what you want using window functions -- by enumerating the values for each id. Taking your approach using a subquery:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
)
delete from t
where exists (select 1
from t t2
where t2.id = t.id and t2.value = - t.value and t2.seqnum = t.seqnum
);
You can also do this with a second layer of window functions:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
),
tt as (
select t.*, count(*) over (partition by id, abs(value), seqnum) as cnt
from t
)
delete from tt
where cnt = 2;

sql: Select count(*) - nth record from each group

I'm grouping by tenant_id. I want to select the count() - 1000th record (ordered by _updated time) from each GROUPBY group, for the groups where count() is greater than 1000. As follows:
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
count(*) - 1000
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
But this is not allowed as count(*) cannot be used inside the subquery.
So I tried the following, which still doesn't work as I don't have access to t1 since this is not a join.
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
(select count(*)-1000
from trace t2
group by tenant_id
having t2.tenant_id = t1.tenant_id)
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
So how can I get the following?
tenant_id | timekey
+-----------+----------------------------------+
n7ia6ryc | 2019-07-23 23:09:49.951406+00:00
You seem to want ROW_NUMBER(). Cockroach supports windows functions, so:
SELECT updated
FROM (
SELECT
tenant_id,
updated,
ROW_NUMBER() OVER(PARTITION BY tenant_id ORDER BY updated DESC) rn
FROM trace
) x WHERE rn = 1001
For each tenant_id, this will return the timestamp of the 1001th less recent record. If a given tenant has less than 1000 records, it will not appear in the results.
select x.tenant_id
from (
select t.tenant_id,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
group by x.tenant_id
just the one timestamp would look like this:
select min(x.timekey) as min_timestamp
from (
select t.tenant_id, t.timekey,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
note that grouping does not matter here because each row can only be in one group and you are only looking at one row.

How to get max value and using group by clause

I have a query like this:
select transactions_id,
time_stamp,
clock
from times
group by transactions_id
having sum(distinct type) = 1
now, I would like to get max value depending on id.
I used below queries but not worked:
select max(id),
transactions_id,
time_stamp,
clock
from times
group by transactions_id
having sum(distinct type) = 1
or
select transactions_id,
time_stamp,
clock
from times
group by transactions_id
having sum(distinct type) = 1
and max(id)
for example:
I have three conditions:
type must be 1
group by transactions_id
max id
You can find aggregates in one query and join its result with the table to get the relevant rows.
select *
from times t1
join (
select transactions_id,
max(id) as id
from times
where type = 1
group by transactions_id
) t2 using (transactions_id, id);
If I understand correctly, you can use the ANSI standard row_number() function:
select t.*
from (select t.*,
row_number() over (partition by transactions_id order by id desc) as seqnum
from times t
) t
where seqnum = 1;
I am not sure what having sum(distinct type) = 1. That condition is not explained in the question.

PostgreSQL MAX and GROUP BY

I have a table with id, year and count.
I want to get the MAX(count) for each id and keep the year when it happens, so I make this query:
SELECT id, year, MAX(count)
FROM table
GROUP BY id;
Unfortunately, it gives me an error:
ERROR: column "table.year" must appear in the GROUP BY clause or be
used in an aggregate function
So I try:
SELECT id, year, MAX(count)
FROM table
GROUP BY id, year;
But then, it doesn't do MAX(count), it just shows the table as it is. I suppose because when grouping by year and id, it gets the max for the id of that specific year.
So, how can I write that query? I want to get the id´s MAX(count) and the year when that happens.
The shortest (and possibly fastest) query would be with DISTINCT ON, a PostgreSQL extension of the SQL standard DISTINCT clause:
SELECT DISTINCT ON (1)
id, count, year
FROM tbl
ORDER BY 1, 2 DESC, 3;
The numbers refer to ordinal positions in the SELECT list. You can spell out column names for clarity:
SELECT DISTINCT ON (id)
id, count, year
FROM tbl
ORDER BY id, count DESC, year;
The result is ordered by id etc. which may or may not be welcome. It's better than "undefined" in any case.
It also breaks ties (when multiple years share the same maximum count) in a well defined way: pick the earliest year. If you don't care, drop year from the ORDER BY. Or pick the latest year with year DESC.
For many rows per id, other query techniques are (much) faster. See:
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
select *
from (
select id,
year,
thing,
max(thing) over (partition by id) as max_thing
from the_table
) t
where thing = max_thing
or:
select t1.id,
t1.year,
t1.thing
from the_table t1
where t1.thing = (select max(t2.thing)
from the_table t2
where t2.id = t1.id);
or
select t1.id,
t1.year,
t1.thing
from the_table t1
join (
select id, max(t2.thing) as max_thing
from the_table t2
group by id
) t on t.id = t1.id and t.max_thing = t1.thing
or (same as the previous with a different notation)
with max_stuff as (
select id, max(t2.thing) as max_thing
from the_table t2
group by id
)
select t1.id,
t1.year,
t1.thing
from the_table t1
join max_stuff t2
on t1.id = t2.id
and t1.thing = t2.max_thing