SQL Optimization - using results from subqueries [Clickhouse] - sql

Query aims: I would like to group columns where for each column is top 5 representatives for given pairs. For example I get 5 most common items in a whole table and for each item I get 5 most common users and for each distinct item-user pair I get 5 most common values and etc... This results in maximum distinct values of each column -> 5, 5^2, 5^3,5^4... (https://clickhouse.com/docs/en/sql-reference/statements/select/limit-by/). Without limiting the groups its basicaly this simple query
SELECT toStartOfDay("timestamp") AS "dt_timestamp",
item,
user,
value,
Count() as cnt
from base_table
GROUP BY dt_timestamp,
item,
user,
value
ORDER BY dt_timestamp asc,
cnt desc
I have a working query, but it is not as fast as I would like. The idea is to get 5 top items from base_table and then with inner joins over all columns get the result...
SELECT toStartOfDay("timestamp") AS "dt_timestamp",
item,
user,
value,
Count() as cnt
FROM (
SELECT item,
user,
value,
Count() as cnt
FROM(
SELECT item,
user,
Count() as cnt
FROM (
SELECT item,
Count() as cnt
FROM base_table
GROUP BY item
ORDER BY cnt desc
limit 5
) as q
INNER JOIN base_table on base_table.item = q.item
GROUP BY item,
user
ORDER BY cnt desc
limit 5 by item
) as qq
INNER JOIN base_table on base_table.item = qq.item
AND base_table.user = qq.user
GROUP BY item,
user,
value
ORDER BY cnt desc,
item desc
limit 5 by item, user
) as qqq
INNER JOIN base_table on base_table.item = qqq.item
AND base_table.user = qqq.user
AND base_table.value = qqq.value
GROUP BY dt_timestamp,
item,
user,
value
ORDER BY dt_timestamp asc,
cnt desc
NOTE: limit 5 by column has different functionality than limit, but its not really relevant in this question.
ISSUES and SPACE for Optimizations
I would like to reuse results of subqueries (q,qq) as they contain extracted items and items,users. So basically in qq I would use result of q and in qqq I would use result of q or qq
Is it possible to do inner join not on whole base_table but to somehow always pass reduced base_table? That means in q will be whole base_table but in qq will be base_table - unwanted items and in qqq will be base_table - unwanted items and users.
I tried to do 1) and 2) with WITH AS but its not very efficient because the query is run again everytime it is called.
If you have any idea how to optimize this query, it would be much appreciated

Related

COUNT of GROUP of two fields in SQL Query -- Postgres

I have a table in postgres with 2 fields: they are columns of ids of users who have looked at some data, under two conditions:
viewee viewer
------ ------
93024 66994
93156 93151
93163 113671
137340 93161
92992 93161
93161 93135
93156 93024
And I want to group them by both viewee and viewer field, and count the number of occurrences, and return that count
from high to low:
id count
------ -----
93161 3
93156 2
93024 2
137340 1
66994 1
92992 1
93135 1
93151 1
93163 1
I have been running two queries, one for each column, and then combining the results in my JavaScript application code. My query for one field is...
SELECT "viewer",
COUNT("viewer")
FROM "public"."friend_currentfriend"
GROUP BY "viewer"
ORDER BY count DESC;
How would I rewrite this query to handle both fields at once?
You can combine to columns from the table into a single one by using union all then use group by as below:
select id ,count(*) Count from (
select viewee id from vv
union all
select viewer id from vv) t
group by id
order by count(*) desc
Results:
This is a good place to use a lateral join:
select v.viewx, count(*)
from t cross join lateral
(values (t.viewee), (t.viewer)) v(viewx)
group by v.viewx
order by count(*) desc;
You can try this :
SELECT a.ID,
SUM(a.Total) as Total
FROM (SELECT t.Viewee AS ID,
COUNT(t.Viewee) AS Total
FROM #Temp t
GROUP BY t.Viewee
UNION
SELECT t.Viewer AS ID,
COUNT(t.Viewer) AS Total
FROM #Temp t
GROUP BY t.Viewer
) a
GROUP BY a.ID
ORDER BY SUM(a.Total) DESC

Selecting All Rows Matching Matching First 10 IDs in Oracle

This is a pagination related issue in Oracle. I have two tables, LIST and LIST_ITEM with a one-to-many relationship. I'm trying to implement pagination on the number of lists, where each LIST can contain a variable number of LIST_ITEM. Essentially, I need to grab all rows from LIST_ITEM matching the first N LIST ids. Any thoughts on implementing this in Oracle DB? Ideally without adding a separate query.
Previously I was using JPA EntityManager to implement pagination using setFirstResult() and setMaxResults(), but because the number of rows this query should return is variable, that will no longer work for me.
You can use an analytic function like dense_rank() in a subquery to rank the IDs, and then filter on the ranks you want, e.g.:
select id, col1, col2
from (
select id, col1, col2, dense_rank() over (order by id) as rnk
from list_items
)
where rnk <= 10
or for later pages
select id, col1, col2
from (
select id, col1, col2, dense_rank() over (order by id) as rnk
from list_items
)
where rnk > 10 and <= 20
If you have ID in the list table with no IDs, and you want to take those into account, then you can use a subquery against that table and join (which lets you include other list columns too):
select l.id, li.col1, li.col2
from (
select id, dense_rank() over (order by id) as rnk
from list
) l
left join list_items li on li.id = l.id
where l.rnk <= 10;
If you're on Oracle 12c or high you can use the [row limit clause] enhancements to simplify that:
select l.id, li.col1, li.col2
from (
select id
from list
order by id
fetch next 10 rows only
) l
left join list_items li on li.id = l.id;
or for the second page:
offset 10 rows fetch next 10 rows only

Get n grouped categories and sum others into one

I have a table with the following structure:
Contents (
id
name
desc
tdate
categoryid
...
)
I need to do some statistics with the data in this table. For example I want to get number of rows with the same category by grouping and id of that category. Also I want to limit them for n rows in descending order and if there are more categories available I want to mark them as "Others". So far I have come out with 2 queries to database:
Select n rows in descending order:
SELECT COALESCE(ca.NAME, 'Unknown') AS label
,ca.id AS catid
,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
GROUP BY label
,catid
ORDER BY data DESC LIMIT 7
Select other rows as one:
SELECT 'Others' AS label
,COUNT(c.id) AS data
FROM contents c
LEFT OUTER JOIN category ca ON ca.id = c.categoryid
WHERE c.categoryid NOT IN ($INCONDITION)
But when I have no category groups left in db table I still get an "Others" record. Is it possible to make it in one query and make the "Others" record optional?
The specific difficulty here: Queries with one or more aggregate functions in the SELECT list and no GROUP BY clause produce exactly one row, even if no row is found in the underlying table.
There is nothing you can do in the WHERE clause to suppress that row. You have to exclude such a row after the fact, i.e. in the HAVING clause, or in an outer query.
Per documentation:
If a query contains aggregate function calls, but no GROUP BY clause,
grouping still occurs: the result is a single group row (or perhaps no
rows at all, if the single row is then eliminated by HAVING). The same
is true if it contains a HAVING clause, even without any aggregate
function calls or GROUP BY clause.
It should be noted that adding a GROUP BY clause with only a constant expression (which is otherwise completely pointless!) works, too. See example below. But I'd rather not use that trick, even if it's short, cheap and simple, because it's hardly obvious what it does.
The following query only needs a single table scan and returns the top 7 categories ordered by count. If (and only if) there are more categories, the rest is summarized into 'Others':
WITH cte AS (
SELECT categoryid, count(*) AS data
, row_number() OVER (ORDER BY count(*) DESC, categoryid) AS rn
FROM contents
GROUP BY 1
)
( -- parentheses required again
SELECT categoryid, COALESCE(ca.name, 'Unknown') AS label, data
FROM cte
LEFT JOIN category ca ON ca.id = cte.categoryid
WHERE rn <= 7
ORDER BY rn
)
UNION ALL
SELECT NULL, 'Others', sum(data)
FROM cte
WHERE rn > 7 -- only take the rest
HAVING count(*) > 0; -- only if there actually is a rest
-- or: HAVING sum(data) > 0
You need to break ties if multiple categories can have the same count across the 7th / 8th rank. In my example, categories with the smaller categoryid win such a race.
Parentheses are required to include a LIMIT or ORDER BY clause to an individual leg of a UNION query.
You only need to join to table category for the top 7 categories. And it's generally cheaper to aggregate first and join later in this scenario. So don't join in the the base query in the CTE (common table expression) named cte, only join in the first SELECT of the UNION query, that's cheaper.
Not sure why you need the COALESCE. If you have a foreign key in place from contents.categoryid to category.id and both contents.categoryid and category.name are defined NOT NULL (like they probably should be), then you don't need it.
The odd GROUP BY true
This would work, too:
...
UNION ALL
SELECT NULL , 'Others', sum(data)
FROM cte
WHERE rn > 7
GROUP BY true;
And I even get slightly faster query plans. But it's a rather odd hack ...
SQL Fiddle demonstrating all.
Related answer with more explanation for the UNION ALL / LIMIT technique:
Sum results of a few queries and then find top 5 in SQL
The quick fix, to make the 'Others' row conditional would be to add a simple HAVING clause to that query.
HAVING COUNT(c.id) > 0
(If there are no other rows in the contents table, then COUNT(c.id) is going to be zero.)
That only answers half the question, how to make the return of that row conditional.
The second half of the question is a little more involved.
To get the whole resultset in one query, you could do something like this
(this is not tested yet; desk checked only.. I'm not sure if postgresql accepts a LIMIT clause in an inline view... if it doesn't we'd need to implement a different mechanism to limit the number of rows returned.
SELECT IFNULL(t.name,'Others') AS name
, t.catid AS catid
, COUNT(o.id) AS data
FROM contents o
LEFT
JOIN category oa
ON oa.id = o.category_id
LEFT
JOIN ( SELECT COALESCE(ca.name,'Unknown') AS name
, ca.id AS catid
, COUNT(c.id) AS data
FROM contents c
LEFT
JOIN category ca
ON ca.id = c.categoryid
GROUP
BY COALESCE(ca.name,'Unknown')
, ca.id
ORDER
BY COUNT(c.id) DESC
, ca.id DESC
LIMIT 7
) t
ON ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL))
GROUP
BY ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL))
, t.catid
ORDER
BY COUNT(o.id) DESC
, ( t.catid = oa.id OR (t.catid IS NULL AND oa.id IS NULL)) DESC
, t.catid DESC
LIMIT 7
The inline view t basically gets the same result as the first query, a list of (up to) 7 id values from category table, or 6 id values from category table and a NULL.
The outer query basically does the same thing, joining content with category, but also doing a check if there's a matching row from t. Because t might be returning a NULL, we have a slightly more complicated comparison, where we want a NULL value to match a NULL value. (MySQL conveniently gives us shorthand operator for this, the null-safe comparison operator <=>, but I don't think that's available in postgresql, so we have to express differently.
a = b OR (a IS NULL AND b IS NULL)
The next bit is getting a GROUP BY to work, we want to group by the 7 values returned by the inline view t, or, if there's not matching value from t, group the "other" rows together. We can get that to happen by using a boolean expression in the GROUP BY clause.
We're basically saying "group by 'if there was a matching row from t'" (true or false) and then group by the row from 't'. Get a count, and then order by the count descending.
This isn't tested, only desk checked.
You can approach this with nested aggregation. The inner aggregation calculates the counts along with a sequential number. You want to take everything whose number is 7 or less and then combine everything else into the others category:
SELECT (case when seqnum <= 7 then label else 'others' end) as label,
(case when seqnum <= 7 then catid end) as catid, sum(cnt)
FROM (SELECT ca.name AS label, ca.id AS catid, COUNT(c.id) AS cnt,
row_number() over (partition by ca.name, catid order by count(c.id) desc) as seqnum
FROM contents c LEFT OUTER JOIN
category ca
ON ca.id = c.categoryid
GROUP BY label, catid
) t
GROUP BY (case when seqnum <= 7 then label else 'others' end),
(case when seqnum <= 7 then catid end)
ORDER BY cnt DESC ;

Selecting top results from SQL Count query, including table join - Oracle

I have this query currently, which selects the top "number of pickups" in descending order. I need to filter only the top 10 rows/highest numbers though. How can I do this?
I have tried adding 'WHERE ROWNUM <= 10' at the bottom, to no avail.
SELECT customer.company_name, COUNT (item.pickup_reference) as "Number of Pickups"
FROM customer
JOIN item ON (customer.reference_no=item.pickup_reference)
GROUP BY customer.company_name, item.pickup_reference
ORDER BY COUNT (customer.company_name) DESC;
Thanks for any help!
You need to subquery it for the rownum to work.
SELECT *
FROM
(
SELECT customer.company_name, COUNT (item.pickup_reference) as "Number of Pickups"
FROM customer
JOIN item ON (customer.reference_no=item.pickup_reference)
GROUP BY customer.company_name, item.pickup_reference
ORDER BY COUNT (customer.company_name) DESC
)
WHERE rownum <= 10
You could alternatively use ranking functions, but given the relative simplicity of this, I'm not sure whether I would.
The solution by using the rank is something like this :
select customer.company_name, COUNT (item.pickup_reference) from (
select distinct customer.company_name, COUNT (item.pickup_reference) ,
rank() over ( order by count(item.pickup_reference) desc) rnk
from customer
JOIN item ON (customer.reference_no=item.pickup_reference)
group by customer.company_name, item.pickup_reference
order by COUNT (customer.company_name) )
where rnk < 10
Using the 'rownum' to get the top result doesn't give the expected result, because it get the 10 first rows which are not ordred, and then order them (Please notify this on a comment on Andrew's response, I don't have the right to add the comment) .

ORDER BY in GROUP BY clause

I have a query
Select
(SELECT id FROM xyz M WHEREM.ID=G.ID AND ROWNUM=1 ) TOTAL_X,
count(*) from mno G where col1='M' group by col2
Now from subquery i have to fetch ramdom id for this I am doing
Select
(SELECT id FROM xyz M WHEREM.ID=G.ID AND ROWNUM=1 order by dbms_random.value ) TOTAL_X,
count(*) from mno G where col1='M' group by col2
But , oracle is showing an error
"Missing right parenthesis".
what is wrong with the query and how can i wrtie this query to get random Id.
Please help.
Even if what you did was legal, it would not give you the result you want. The ROWNUM filter would be applied before the ORDER BY, so you would just be sorting one row.
You need something like this. I am not sure if this exact code will work given the correlated subquery, but the basic point is that you need to have a subquery that contains the ORDER BY without the ROWNUM filter, then apply the ROWNUM filter one level up.
WITH subq AS (
SELECT id FROM xyz M WHERE M.ID=G.ID order by dbms_random.value
)
SELECT (SELECT id FROM subq WHERE rownum = 1) total_x,
count(*)
from mno g where col1='M' group by col2
You can't use order by in a subselect. It wouldn't matter too, because the row numbering is applied first, so you cannot influence it by using order by,
[edit]
Tried a solution. Don't got Oracle here, so you'll have to read between the typos.
In this case, I generate a single random value, get the count of records in xyz per mno.id, and generate a sequence for those records per mno.id.
Then, a level higher, I filter only those records whose index match with the random value.
This should give you a random id from xyz that matches the id in mno.
select
x.mnoId,
x.TOTAL_X
from
(SELECT
g.id as mnoId,
m.id as TOTAL_X,
count(*) over (partition by g.id) as MCOUNT,
dense_rank() over (partition by g.id) as MINDEX,
r.RandomValue
from
mno g
inner join xyz m on m.id = g.id
cross join (select dbms_random.value as RandomValue from dual) r
where
g.col1 = 'M'
) x
where
x.MINDEX = 1 + trunc(x.MCOUNT * x.RandomValue)
The only difference between your two lines are that you order_by in the one that fails, right?
It so happens that order_by doesn't fit inside a nested select.
You could do an order_by inside a where clause that contains a select, though.
Edit: #StevenV is right.
If you're trying to do what I suspect, this should work
Select A.Id, Count(*)
From MNO A
Join (Select ID From XYZ M Where M.ID=G.ID And Rownum=1 Order By Dbms_Random.Value ) B On (B.ID = A.ID)
GROUP BY A.ID