Counts of unique values in Postgres GROUP BY - sql

I have a table with schema:
uid | day | type
In pandas, it looks like this:
d=pd.DataFrame(columns=['uid','day','type'])
d.loc[0]=[1,1,'C']
d.loc[1]=[1,1,'T']
d.loc[2]=[1,1,'C']
d.loc[3]=[2,1,'T']
d.loc[4]=[1,2,'T']
I want to:
GROUP BY uid and day.
Get the count of unique type values per group.
Return the top 3 type values per group.
In pandas, it's possible to get counts of unique values per group:
d.groupby(['uid','day']).type.value_counts()
The output (then I would filter to get the top 3 per group).
uid day
1 1 C 2
T 1
2 T 1
2 1 T 1
How would this query be done in postgres?

I'm not sure to completely understand your question, but as I can't leave a comment I'll just give it a try.
Let's say we have the table t containing these data :
uid | day | type
-----+-----+------
1 | 1 | C
1 | 1 | T
1 | 1 | C
2 | 1 | T
1 | 2 | T
Then this query will return what you want :
SELECT uid, day, type, count(type)
FROM t
GROUP BY uid, day, type;
uid | day | type | type_count
-----+-----+------+------------
1 | 1 | C | 2
1 | 2 | T | 1
1 | 1 | T | 1
2 | 1 | T | 1
Then you can make an ORDER BY DESC on the column type_count with a LIMIT 3 and you get your top 3.
I hope it's what you're looking for.

Related

Can I generate a map that shows a particular row was in a particular group in SQLite?

Say I have the following data:
+--------+-------+
| Group | Data |
+--------+-------+
| 1 | row 1 |
| 1 | row 2 |
| 1 | row 3 |
| 20 | row 1 |
| 20 | row 3 |
| 10 | row 1 |
| 10 | row A |
| 10 | row 2 |
| 10 | row 3 |
+--------+-------+
Is it possible to draw a map that shows which groups have which rows? Groups may not be contagious, so they can be placed into a separate table and use the row index for the string index instead. Something like this:
+-------+
| Group |
+-------+
| 1 |
| 20 |
| 10 |
+-------+
+-------+----------------+
| Data | Found in group |
+-------+----------------+
| row 1 | 111 |
| row A | 1 |
| row 2 | 1 1 |
| row 3 | 111 |
+-------+----------------+
Where the first character represents Group 1, the 2nd is Group 20 and the 3rd is Group 10.
Ordering of the Group rows isn't critical so long as I can reference which row goes with which character.
I only ask this because I saw this crazy example in the documentation generating a fractal, but I can't quite get my head around it.
Is this doable?
To find the missing values, first thing is to prepare a dataset which have all possible combination. You can achieve that using CROSS JOIN.
Once you have that DataSet, compare it with the actual DataSet.
Considering the Order by is done in the Grp column, you can achieve it using below.
SELECT
a.Data,group_concat(case when base.Grp is null then "." else "1" end,'') as Found_In_Group
,group_concat(b.Grp) as Group_Order
FROM
(SELECT Data FROM yourtable Group By Data)a
CROSS JOIN
(SELECT Grp FROM yourtable Group By Grp Order by Grp)b
LEFT JOIN yourtable base
ON b.Grp=base.Grp
AND a.Data=base.Data
GROUP BY a.Data
Note: Considered . instead of blank for better visibility to represent missing Group.
Data
Found_In_Group
Group_Order
row 1
111
1,10,20
row 2
11.
1,10,20
row 3
111
1,10,20
row A
.1.
1,10,20
Demo: Try here
SELECT Data, group_concat("Group") AS "Found in group"
FROM yourtable
GROUP BY Data
will give you a CSV list of groups.

Postgres create view with column values based on another table?

I'm implementing a view to store leaderboard data of the top 10 users that is computed using an expensive COUNT(*). I'm planning on the view to look something like this:
id SERIAL PRIMARY KEY
user_id TEXT
type TEXT
rank INTEGER
count INTEGER
-- adding an index to user_id
-- adding a two-column unique index to user_id and type
I'm having trouble with seeing how this view should be created to properly account for the rank and type. Essentially, I have a big table (~30 million rows) like this:
+----+---------+---------+----------------------------+
| id | user_id | type | created_at |
+----+---------+---------+----------------------------+
| 1 | 1 | Diamond | 2021-05-11 17:35:18.399517 |
| 2 | 1 | Diamond | 2021-05-12 17:35:17.399517 |
| 3 | 1 | Diamond | 2021-05-12 17:35:18.399517 |
| 4 | 2 | Diamond | 2021-05-13 17:35:18.399517 |
| 5 | 1 | Clay | 2021-05-14 17:35:18.399517 |
| 6 | 1 | Clay | 2021-05-15 17:35:18.399517 |
+----+---------+---------+----------------------------+
With the table above, I'm trying to achieve something like this:
+----+---------+---------+------+-------+
| id | user_id | type | rank | count |
+----+---------+---------+------+-------+
| 1 | 1 | Diamond | 1 | 3 |
| 2 | 2 | Diamond | 2 | 1 |
| 3 | 1 | Clay | 1 | 2 |
| 4 | 1 | Weekly | 1 | 5 | -- 3 diamonds + 2 clay obtained between Mon-Sun
| 5 | 2 | Weekly | 2 | 1 |
+----+---------+---------+------+-------+
By Weekly I am counting the time from the last Sunday to the upcoming Sunday.
Is this doable using only SQL, or is some kind of script needed? If doable, how would this be done? It's worth mentioning that there are thousands of different types, so not having to manually specify type would be preferred.
If there's anything unclear, please let me know and I'll do my best to clarify. Thanks!
The "weekly" rows are produced in a different way compared to the "user" rows (I called them two different "categories"). To get the result you want you can combine two queries using UNION ALL.
For example:
select 'u' as category, user_id, type,
rank() over(partition by type order by count(*) desc) as rk,
count(*) as cnt
from scores
group by user_id, type
union all
select 'w', user_id, 'Weekly',
rank() over(order by count(*) desc),
count(*) as cnt
from scores
group by user_id
order by category, type desc, rk
Result:
category user_id type rk cnt
--------- -------- -------- --- ---
u 1 Diamond 1 3
u 2 Diamond 2 1
u 1 Clay 1 2
w 1 Weekly 1 5
w 2 Weekly 2 1
See running example at DB Fiddle.
Note: For the sake of simplicity I left the filtering by timestamp out of the query. If you really needed to include only the rows of the last 7 days (or other period of time), it would be a matter of adding a WHERE clause in both subqueries.
I think this is what you were talking about, right?
WITH scores_plus_weekly AS ((
SELECT id, user_id, 'Weekly' AS type, created_at
FROM scores
WHERE created_at BETWEEN '2021-05-10' AND '2021-05-17'
)
UNION (
SELECT * FROM scores
))
SELECT
row_number() OVER (ORDER BY CASE "type" WHEN 'Diamond' THEN 0 WHEN 'Clay' THEN 1 ELSE 2 END, count(*) DESC) as "id",
user_id,
"type",
row_number() OVER (PARTITION BY count(*) DESC) as "rank",
count(*)
FROM scores_plus_weekly
GROUP BY user_id, "type"
ORDER BY "id";
I'm sure this is not the only way, but I thought the result wasn't too complex. This query first combines the original database with all scores from this week. For the sake of consistency I picked a date range that matches your entire example set. It then groups by user_id and type to get the counts for each combination. The row_numbers will give you the overall rank and the rank per type. A big part of this query consists of sorting by type, so if you're joining another table that contains the order or priority of the types, the CASE can probably be simplified.
Then, lastly, this entire query can be caught in a view using the CREATE VIEW score_ranks AS , followed by your query.

Oracle SQL: Counting how often an attribute occurs for a given entry and choosing the attribute with the maximum number of occurs

I have a table that has a number column and an attribute column like this:
1.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 1 | b |
| 1 | a |
| 2 | a |
| 2 | b |
| 2 | b |
+------------
I want to make the number unique, and the attribute to be whichever attribute occured most often for that number, like this (This is the end-product im interrested in) :
2.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 2 | b |
+------------
I have been working on this for a while and managed to write myself a query that looks up how many times an attribute occurs for a given number like this:
3.
+-----+-----+-----+
| num | att |count|
------------------+
| 1 | a | 1 |
| 1 | b | 2 |
| 2 | a | 1 |
| 2 | b | 2 |
+-----------------+
But I can't think of a way to only select those rows from the above table where the count is the highest (for each number of course).
So basically what I am asking is given table 3, how do I select only the rows with the highest count for each number (Of course an answer describing providing a way to get from table 1 to table 2 directly also works as an answer :) )
You can use aggregation and window functions:
select num, att
from (
select num, att, row_number() over(partition by num order by count(*) desc, att) rn
from mytable
group by num, att
) t
where rn = 1
For each num, this brings the most frequent att; if there are ties, the smaller att is retained.
Oracle has an aggregation function that does this, stats_mode().:
select num, stats_mode(att)
from t
group by num;
In statistics, the most common value is called the mode -- hence the name of the function.
Here is a db<>fiddle.
You can use group by and count as below
select id, col, count(col) as count
from
df_b_sql
group by id, col

Rows that have same value in a column, sum all values in another column and display 1 row

Example Table user:
ID | USER_ID | SCORE |
1 | 555 | 50 |
2 | 555 | 10 |
3 | 555 | 20 |
4 | 123 | 5 |
5 | 123 | 5 |
6 | 999 | 30 |
The result set should be like
ID | USER_ID | SCORE | COUNT |
1 | 555 | 80 | 3 |
2 | 123 | 10 | 2 |
3 | 999 | 30 | 1 |
Is it possible to generate a sql that can return the table above, so far I can only count the rows where certain user_id appear, but don't know how to sum and show for every user ?
You've included a column called "ID" in both the source data and desired results, but I'm going to assume that these ID values are not related and simply represent the row or line number - otherwise the question doesn't make sense.
In which case, you can simply use:
SELECT
USER_ID,
SUM(SCORE) AS SCORE,
COUNT(USER_ID) AS COUNT
FROM
<Table>
GROUP BY
USER_ID
If you really want to generate the ID column as well, then how you do this depends on the database platform being used. For example on Oracle you could use the ROWNUM pseudocolumn, on SQL Server you will need to use ROW_NUMBER() function (which also works for Oracle).
SELECT ID
,sum(SCORE)
,count(USER_ID)
FROM Table
GROUP BY
ID
I think COUNT is the number of scores per user_id, if so, then your sql request should be :
SELECT
ID,
USER_ID,
SUM(SCORE)AS SCORE,
COUNT(SCORE)AS COUNT
FROM
TABLE
GROUP BY
USER_ID

Microsoft Access query to duplicate ROW_NUMBER

Obviously there are a bunch of questions about ROW_NUMBER in MS Access and the usually response is that it does not exist but instead to use a COUNT(*) to create something similar. Unfortunately, doing so does not give me the results that I need.
My data looks like:
RID | QID
---------
1 | 1
1 | 2
1 | 3
1 | 3
2 | 1
2 | 2
2 | 2
What I am trying to get at is a unique count over RID and QID so that my query output looks like
RID | QID | SeqID
------------------
1 | 1 | 1
1 | 2 | 1
1 | 3 | 1
1 | 3 | 2
2 | 1 | 1
2 | 2 | 1
2 | 2 | 2
Using the COUNT(*) I get:
RID | QID | SeqID
------------------
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
1 | 3 | 3
2 | 1 | 1
2 | 2 | 2
2 | 2 | 2
My current query is:
SELECT
d.RID
,d.QID
,(SELECT
COUNT(*)
FROM
Data as d2
WHERE
d2.RID = d.RID
AND d2.QID < d.QID) + 1 AS SeqID
FROM
Data as d
ORDER BY
d.RID
,d.QID
Any help would be greatly appreciated.
As Matt's comment implied, the only way to make this work is if you have some column in your table that can uniquely identify each row.
Based on what you have posted, you don't seem to have that. If that's the case, consider adding a new auto increment numeric column that can serve that purpose. Let's pretend that you call that new column id.
With that in place, the following query will work:
select t.rid, t.qid,
(select count(*)
from data t2
where t2.rid = t.rid
and t2.qid = t.qid
and t2.id <= t.id) as SeqID
from data t
order by t.rid, t.qid
SQLFiddle Demo