Get n grouped categories and sum others into one - sql

I have a table with the following structure:
Contents (
id
name
desc
tdate
categoryid
...
)
I need to do some statistics with the data in this table. For example I want to get number of rows with the same category by grouping and id of that category. Also I want to limit them for n rows in descending order and if there are more categories available I want to mark them as "Others". So far I have come out with 2 queries to database:
Select n rows in descending order:
SELECT COALESCE(ca.NAME, 'Unknown') AS label
,ca.id AS catid
,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
GROUP BY label
,catid
ORDER BY data DESC LIMIT 7
Select other rows as one:
SELECT 'Others' AS label
,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
WHERE c.categoryid NOT IN ($INCONDITION)
But when I have no category groups left in db table I still get an "Others" record. Is it possible to make it in one query and make the "Others" record optional?

The specific difficulty here: Queries with one or more aggregate functions in the SELECT list and no GROUP BY clause produce exactly one row, even if no row is found in the underlying table.
There is nothing you can do in the WHERE clause to suppress that row. You have to exclude such a row after the fact, i.e. in the HAVING clause, or in an outer query.
Per documentation:
If a query contains aggregate function calls, but no GROUP BY clause,
grouping still occurs: the result is a single group row (or perhaps no
rows at all, if the single row is then eliminated by HAVING). The same
is true if it contains a HAVING clause, even without any aggregate
function calls or GROUP BY clause.
It should be noted that adding a GROUP BY clause with only a constant expression (which is otherwise completely pointless!) works, too. See example below. But I'd rather not use that trick, even if it's short, cheap and simple, because it's hardly obvious what it does.
The following query only needs a single table scan and returns the top 7 categories ordered by count. If (and only if) there are more categories, the rest is summarized into 'Others':
WITH cte AS (
SELECT categoryid, count(*) AS data
, row_number() OVER (ORDER BY count(*) DESC, categoryid) AS rn
FROM contents
GROUP BY 1
)
( -- parentheses required again
SELECT categoryid, COALESCE(ca.name, 'Unknown') AS label, data
FROM cte
LEFT JOIN category ca ON ca.id = cte.categoryid
WHERE rn <= 7
ORDER BY rn
)
UNION ALL
SELECT NULL, 'Others', sum(data)
FROM cte
WHERE rn > 7 -- only take the rest
HAVING count(*) > 0; -- only if there actually is a rest
-- or: HAVING sum(data) > 0
You need to break ties if multiple categories can have the same count across the 7th / 8th rank. In my example, categories with the smaller categoryid win such a race.
Parentheses are required to include a LIMIT or ORDER BY clause to an individual leg of a UNION query.
You only need to join to table category for the top 7 categories. And it's generally cheaper to aggregate first and join later in this scenario. So don't join in the the base query in the CTE (common table expression) named cte, only join in the first SELECT of the UNION query, that's cheaper.
Not sure why you need the COALESCE. If you have a foreign key in place from contents.categoryid to category.id and both contents.categoryid and category.name are defined NOT NULL (like they probably should be), then you don't need it.
The odd GROUP BY true
This would work, too:
...
UNION ALL
SELECT NULL , 'Others', sum(data)
FROM cte
WHERE rn > 7
GROUP BY true;
And I even get slightly faster query plans. But it's a rather odd hack ...
SQL Fiddle demonstrating all.
Related answer with more explanation for the UNION ALL / LIMIT technique:
Sum results of a few queries and then find top 5 in SQL

The quick fix, to make the 'Others' row conditional would be to add a simple HAVING clause to that query.
HAVING COUNT(c.id) > 0
(If there are no other rows in the contents table, then COUNT(c.id) is going to be zero.)
That only answers half the question, how to make the return of that row conditional.
The second half of the question is a little more involved.
To get the whole resultset in one query, you could do something like this
(this is not tested yet; desk checked only.. I'm not sure if postgresql accepts a LIMIT clause in an inline view... if it doesn't we'd need to implement a different mechanism to limit the number of rows returned.
SELECT IFNULL(t.name,'Others') AS name
, t.catid AS catid
, COUNT(o.id) AS data
FROM contents o
LEFT
JOIN category oa
ON oa.id = o.category_id
LEFT
JOIN ( SELECT COALESCE(ca.name,'Unknown') AS name
, ca.id AS catid
, COUNT(c.id) AS data
FROM contents c
LEFT
JOIN category ca
ON ca.id = c.categoryid
GROUP
BY COALESCE(ca.name,'Unknown')
, ca.id
ORDER
BY COUNT(c.id) DESC
, ca.id DESC
LIMIT 7
) t
ON ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL))
GROUP
BY ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL))
, t.catid
ORDER
BY COUNT(o.id) DESC
, ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL)) DESC
, t.catid DESC
LIMIT 7
The inline view t basically gets the same result as the first query, a list of (up to) 7 id values from category table, or 6 id values from category table and a NULL.
The outer query basically does the same thing, joining content with category, but also doing a check if there's a matching row from t. Because t might be returning a NULL, we have a slightly more complicated comparison, where we want a NULL value to match a NULL value. (MySQL conveniently gives us shorthand operator for this, the null-safe comparison operator <=>, but I don't think that's available in postgresql, so we have to express differently.
a = b OR (a IS NULL AND b IS NULL)
The next bit is getting a GROUP BY to work, we want to group by the 7 values returned by the inline view t, or, if there's not matching value from t, group the "other" rows together. We can get that to happen by using a boolean expression in the GROUP BY clause.
We're basically saying "group by 'if there was a matching row from t'" (true or false) and then group by the row from 't'. Get a count, and then order by the count descending.
This isn't tested, only desk checked.

You can approach this with nested aggregation. The inner aggregation calculates the counts along with a sequential number. You want to take everything whose number is 7 or less and then combine everything else into the others category:
SELECT (case when seqnum <= 7 then label else 'others' end) as label,
(case when seqnum <= 7 then catid end) as catid, sum(cnt)
FROM (SELECT ca.name AS label, ca.id AS catid, COUNT(c.id) AS cnt,
row_number() over (partition by ca.name, catid order by count(c.id) desc) as seqnum
FROM contents c LEFT OUTER JOIN
category ca
ON ca.id = c.categoryid
GROUP BY label, catid
) t
GROUP BY (case when seqnum <= 7 then label else 'others' end),
(case when seqnum <= 7 then catid end)
ORDER BY cnt DESC ;

Related

SQL Optimization - using results from subqueries [Clickhouse]

Query aims: I would like to group columns where for each column is top 5 representatives for given pairs. For example I get 5 most common items in a whole table and for each item I get 5 most common users and for each distinct item-user pair I get 5 most common values and etc... This results in maximum distinct values of each column -> 5, 5^2, 5^3,5^4... (https://clickhouse.com/docs/en/sql-reference/statements/select/limit-by/). Without limiting the groups its basicaly this simple query
SELECT toStartOfDay("timestamp") AS "dt_timestamp",
item,
user,
value,
Count() as cnt
from base_table
GROUP BY dt_timestamp,
item,
user,
value
ORDER BY dt_timestamp asc,
cnt desc
I have a working query, but it is not as fast as I would like. The idea is to get 5 top items from base_table and then with inner joins over all columns get the result...
SELECT toStartOfDay("timestamp") AS "dt_timestamp",
item,
user,
value,
Count() as cnt
FROM (
SELECT item,
user,
value,
Count() as cnt
FROM(
SELECT item,
user,
Count() as cnt
FROM (
SELECT item,
Count() as cnt
FROM base_table
GROUP BY item
ORDER BY cnt desc
limit 5
) as q
INNER JOIN base_table on base_table.item = q.item
GROUP BY item,
user
ORDER BY cnt desc
limit 5 by item
) as qq
INNER JOIN base_table on base_table.item = qq.item
AND base_table.user = qq.user
GROUP BY item,
user,
value
ORDER BY cnt desc,
item desc
limit 5 by item, user
) as qqq
INNER JOIN base_table on base_table.item = qqq.item
AND base_table.user = qqq.user
AND base_table.value = qqq.value
GROUP BY dt_timestamp,
item,
user,
value
ORDER BY dt_timestamp asc,
cnt desc
NOTE: limit 5 by column has different functionality than limit, but its not really relevant in this question.
ISSUES and SPACE for Optimizations
I would like to reuse results of subqueries (q,qq) as they contain extracted items and items,users. So basically in qq I would use result of q and in qqq I would use result of q or qq
Is it possible to do inner join not on whole base_table but to somehow always pass reduced base_table? That means in q will be whole base_table but in qq will be base_table - unwanted items and in qqq will be base_table - unwanted items and users.
I tried to do 1) and 2) with WITH AS but its not very efficient because the query is run again everytime it is called.
If you have any idea how to optimize this query, it would be much appreciated

SQL How to select customers with highest transaction amount by state

I am trying to write a SQL query that returns the name and purchase amount of the five customers in each state who have spent the most money.
Table schemas
customers
|_state
|_customer_id
|_customer_name
transactions
|_customer_id
|_transact_amt
Attempts look something like this
SELECT state, Sum(transact_amt) AS HighestSum
FROM (
SELECT name, transactions.transact_amt, SUM(transactions.transact_amt) AS HighestSum
FROM customers
INNER JOIN customers ON transactions.customer_id = customers.customer_id
GROUP BY state
) Q
GROUP BY transact_amt
ORDER BY HighestSum
I'm lost. Thank you.
Expected results are the names of customers with the top 5 highest transactions in each state.
ERROR: table name "customers" specified more than once
SQL state: 42712
First, you need for your JOIN to be correct. Second, you want to use window functions:
SELECT ct.*
FROM (SELECT c.customer_id, c.name, c.state, SUM(t.transact_amt) AS total,
ROW_NUMBER() OVER (PARTITION BY c.state ORDER BY SUM(t.transact_amt) DESC) as seqnum
FROM customers c JOIN
transaactions t
ON t.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, c.state
) ct
WHERE seqnum <= 5;
You seem to have several issues with SQL. I would start with understanding aggregation functions. You have a SUM() with the alias HighestSum. It is simply the total per customer.
You can get them using aggregation and then by using the RANK() window function. For example:
select
state,
rk,
customer_name
from (
select
*,
rank() over(partition by state order by total desc) as rk
from (
select
c.customer_id,
c.customer_name,
c.state,
sum(t.transact_amt) as total
from customers c
join transactions t on t.customer_id = c.customer_id
group by c.customer_id
) x
) y
where rk <= 5
order by state, rk
There are two valid answers already. Here's a third:
SELECT *
FROM (
SELECT c.state, c.customer_name, t.*
, row_number() OVER (PARTITION BY c.state ORDER BY t.transact_sum DESC NULLS LAST, customer_id) AS rn
FROM (
SELECT customer_id, sum(transact_amt) AS transact_sum
FROM transactions
GROUP BY customer_id
) t
JOIN customers c USING (customer_id)
) sub
WHERE rn < 6
ORDER BY state, rn;
Major points
When aggregating all or most rows of a big table, it's typically substantially faster to aggregate before the join. Assuming referential integrity (FK constraints), we won't be aggregating rows that would be filtered otherwise. This might change from nice-to-have to a pure necessity when joining to more aggregated tables. Related:
Why does the following join increase the query time significantly?
Two SQL LEFT JOINS produce incorrect result
Add additional ORDER BY item(s) in the window function to define which rows to pick from ties. In my example, it's simply customer_id. If you have no tiebreaker, results are arbitrary in case of a tie, which may be OK. But every other execution might return different results, which typically is a problem. Or you include all ties in the result. Then we are back to rank() instead of row_number(). See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
While transact_amt can be NULL (has not been ruled out) any sum may end up to be NULL as well. With an an unsuspecting ORDER BY t.transact_sum DESC those customers come out on top as NULL comes first in descending order. Use DESC NULLS LAST to avoid this pitfall. (Or define the column transact_amt as NOT NULL.)
PostgreSQL sort by datetime asc, null first?

How to Rank Based on Multiple Columns

I'm trying to score people in Microsoft Access based on the count they have for a particular category.
There are 7 possible categories a person can have against them, and I want to assigned each person a score from 1-7, with 1 being assigned to the highest scoring category, 7 being the lowest. They might not have an answer for every category, in which case that category can be ignored.
The aim would be to have an output result as shown in this image:
I've tried a few different things, including partition over and joins, but none have worked. To be honest I think I'm way off the mark with the queries I've been trying. I've tried to write the code in SQL from scratch, and used query builder.
Any help is really appreciated!
As you for an email can have duplicated counts, you will need two subqueries for this:
SELECT
Score.email,
Score.category,
Score.[Count],
(Select Count(*) From Score As T Where
T.email = Score.email And
T.[Count] >= Score.[Count])-
(Select Count(*) From Score As S Where
S.email = Score.email And
S.[Count] = Score.[Count] And
S.category > Score.category) AS Rank
FROM
Score
ORDER BY
Score.email,
Score.[Count] DESC,
Score.category;
For categories with equal Count values for the same email, the following will rank the records alphabetically descending by Category name (since this is what is shown in your example):
select t.email, t.category, t.count,
(
select count(*) from YourTable u
where t.email = u.email and
((t.count = u.count and t.category <= u.category) or t.count < u.count)
) as rank
from YourTable t
order by t.email, t.count desc, t.category desc
Change both references of YourTable to the name of your table.

Is order in a subquery guaranteed to be preserved?

I am wondering in particular about PostgreSQL. Given the following contrived example:
SELECT name FROM
(SELECT name FROM people WHERE age >= 18 ORDER BY age DESC) p
LIMIT 10
Are the names returned from the outer query guaranteed to be be in the order they were for the inner query?
No, put the order by in the outer query:
SELECT name FROM
(SELECT name, age FROM people WHERE age >= 18) p
ORDER BY p.age DESC
LIMIT 10
The inner (sub) query returns a result-set. If you put the order by there, then the intermediate result-set passed from the inner (sub) query, to the outer query, is guaranteed to be ordered the way you designate, but without an order by in the outer query, the result-set generated by processing that inner query result-set, is not guaranteed to be sorted in any way.
For simple cases, #Charles query is most efficient.
More generally, you can use the window function row_number() to carry any order you like to the main query, including:
order by columns not in the SELECT list of the subquery and thus not reproducible
arbitrary ordering of peers according to ORDER BY criteria. Postgres will reuse the same arbitrary order in the window function within the subquery. (But not truly random order from random() for instance!)
If you don't want to preserve arbitrary sort order of peers from the subquery, use rank() instead.
This may also be generally superior with complex queries or multiple query layers:
SELECT p.name
FROM (
SELECT name, row_number() OVER (ORDER BY <same order by criteria>) AS rn
FROM people
WHERE age >= 18
ORDER BY <any order by criteria>
) p
ORDER BY p.rn
LIMIT 10;
The are not guaranteed to be in the same order, though when you run it you might see that it is generally follows the order.
You should place the order by on the main query
SELECT name FROM
(SELECT name FROM people WHERE age >= 18) p
ORDER BY p.age DESC LIMIT 10

SQL Server query. JOIN by latest date

I have 3 tables:
UnitInfo(UnitID, ...),
UnitList(UnitID, ...)
UnitMonitoring(RecordID,UnitID, EventDate, ...)
UnitList is a subset of UnitInfo (in terms of data and in terms of columns). UnitMonitoring receives records time to time pertaining to UnitList (for every UnitID in UnitMonitoring we will have many records) filling EventDate. (UnitInfo has extended info).
I can't figure how to build a query so that for every UnitID in UnitList I get UnitMonitoring record such that EventDate is the latest one.
So far I have
SELECT a.UnitID, a.Name, b.EventDate
FROM UnitInfo a INNER JOIN UnitMonitoring b
WHERE a.UnitID IN (SELECT UnitID FROM UnitList)
which yields all records from UnitMonitoring
SELECT ul.unitId, um.*
FROM UnitList ul
OUTER APPLY
(
SELECT TOP 1 *
FROM UnitMonitoring umi
WHERE umi.UnitID = ul.unitID
ORDER BY
EventDate DESC
)
This will handle the duplicates correctly and will return all units (those with no records in UnitMonitoring will have NULL values in corresponding fields)
I chose to go with a Common Table Expression (CTE) to apply a ranking function (ROW_NUMBER):
;WITH NumberedMonitoring as (
SELECT RecordID,UnitID, EventDate, ...
ROW_NUMBER() over (PARTITION BY UnitID ORDER BY EventDate desc) rn
FROM UnitMonitoring
)
SELECT * FROM
UnitList ul
inner join
NumberedMonitoring nm
on
ul.UnitID = nm.UnitID and nm.rn = 1
But there are many different solutions (the above could also be done using a subselect).
Common Table Expressions (quoting from above link):
A common table expression (CTE) can be thought of as a temporary result set
That is, it lets you write a bit of the query first. In this case, I'm using it because I want to number the rows (using the ROW_NUMBER function). I'm telling it to restart the numbering for each UnitID (PARTITION BY UnitID), and within each unit ID, I want the rows numbered based on the EventDate descending (ORDER BY EventDate desc). This means that the row that receives row number 1 (within each UnitID partition) is the most recent row.
In the following select, I'm able to treat my CTE (NumberedMonitoring) as if it's any other table. So I'm just joining it to the UnitList table, and ensuring as part of the join conditions that I'm only selecting row number 1 (rn = 1)
Try:
Select M.*
From UnitList L
Join UnitMonitoring M
On M.UnitId = L.UnitId
Where M.EventDate =
(Select Max(EventDate) From UnitMonitoring
Where UnitId = M.UnitId)
If There are multiple records with the same UnitId and EventDate, then you can still use this technique, but you need to filter on a unique field, say the PK field in UnitMonitoring in this case is named PkId.
Select M.*
From UnitList L
Join UnitMonitoring M
On M.UnitId = L.UnitId
Where M.PkId =
(Select Max(PkId) From UnitMonitoring iM
Where UnitId = M.UnitId
And EventDate =
(Select Max(EventDate) From UnitMonitoring
Where UnitId = M.UnitId))
SELECT a.UnitID, a.Name, MAX(b.EventDate)
FROM UnitInfo a
INNER JOIN UnitMonitoring b
WHERE a.UnitID IN (SELECT UnitID FROM UnitList)
GROUP BY a.UnitID, a.Name