Order by date, while grouping matches by another column - sql

I have this query
SELECT *, COUNT(app.id) AS totalApps FROM users JOIN app ON app.id = users.id
GROUP BY app.id ORDER BY app.time DESC LIMIT ?
which is supposed to get all results from "users" ordered by another column (time) in a related table (the id from the app tables references the id from the users table).
The issue I have is that the grouping is done before the ordering by date, so I get very old results. But I need the grouping in order to get distinct users, because each user can have multiple 'apps'... Is there a different way to achieve this?
Table users:
id TEXT PRIMARY KEY
Table app:
id TEXT
time DATETIME
FOREIGN KEY(id) REFERENCES users(id)
in my SELECT query I want to get a list of users, ordered by the app.time column. But because one user can have multiple app records associated, I could get duplicate users, that's why I used GROUP BY. But then the order is messed up

The underlying issue is that the SELECT is an aggregate query as it contains a GROUP BY clause :-
There are two types of simple SELECT statement - aggregate and
non-aggregate queries. A simple SELECT statement is an aggregate query
if it contains either a GROUP BY clause or one or more aggregate
functions in the result-set.
SQL As Understood By SQLite - SELECT
And thus that the column's value for that group, will be an arbitrary value the column of that group (first according to scan/search, I suspect, hence the lower values) :-
If the SELECT statement is an aggregate query without a GROUP BY
clause, then each aggregate expression in the result-set is evaluated
once across the entire dataset. Each non-aggregate expression in the
result-set is evaluated once for an arbitrarily selected row of the
dataset. The same arbitrarily selected row is used for each
non-aggregate expression. Or, if the dataset contains zero rows, then
each non-aggregate expression is evaluated against a row consisting
entirely of NULL values.
So in short you cannot rely upon the column values that aren't part of the group/aggregation, when it's an aggregate query.
Therefore have have to retrieve the required values using an aggregate expression, such as max(app.time). However, you can't ORDER by this value (not sure exactly why by it's probably inherrent in the efficiency aspect)
HOWEVER
What you can do is use the query to build a CTE and then sort without aggregates involved.
Consider the following, which I think mimics your problem:-
DROP TABLE IF EXISTS users;
DROP TABLE If EXISTS app;
CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, username TEXT);
INSERT INTO users (username) VALUES ('a'),('b'),('c'),('d');
CREATE TABLE app (the_id INTEGER PRIMARY KEY, id INTEGER, appname TEXT, time TEXT);
INSERT INTO app (id,appname,time) VALUES
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212),
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212)
;
SELECT * FROM users;
SELECT * FROM app;
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
ORDER BY max(app.time)
;
This results in :-
Where although the latest time for each group has been extracted the final result hasn't been sorted as you would think.
Wrapping it into a CTE can fix that e.g. :-
WITH cte1 AS
(
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
)
SELECT * FROM cte1 ORDER BY cast(latest_time AS INTEGER) DESC;
and now :-
Note simple integers have been used instead of real times for my convenience.

Since you need the newest date in every group, you could just MAX them:
SELECT
*,
COUNT(app.id) AS totalApps,
MAX(app.time) AS latestDate
FROM users
JOIN app ON app.id = users.id
GROUP BY app.id
ORDER BY latestDate DESC
LIMIT ?

You could use windowed COUNT:
SELECT *, COUNT(app.id) OVER(PARTITION BY app.id) AS totalApps
FROM users
JOIN app
ON app.id = users.id
ORDER BY app.time DESC
LIMIT ?

Maybe you could use?
SELECT DISTINCT
Read more here: https://www.w3schools.com/sql/sql_distinct.asp

Try to grouping by id and time and then order by time.
select ...
group by app.id desc, app.time
I assume that id is unique in app table.
and how you assign ID to? maybe you have enough to order by id desc

Related

Why sometimes a subquery can work like using 'group by'

I'm new to sql and can't understand why sometimes a subquery can work like using 'group by'.
Say, there are two tables in a data base.
'food' is a table crated by:
CREATE TABLE foods (
id integer PRIMARY KEY,
type_id integer,
name text
);
'foods_episodes' is a table created by:
CREATE TABLE foods_episodes (
food_id integer,
episode_id integer
);
Now I'm using the following two sqls and generating the same result.
SELECT name, (SELECT count(*) FROM foods_episodes WHERE food_id=f.id) AS frequency
FROM foods AS f
ORDER BY name;
SELECT name, count(*) AS frequency
FROM foods_episodes,
foods AS f
WHERE food_id=f.id
GROUP BY name;
So why the subquery in the first sql works like it group the result by name?
When I run the subquery alone:
SELECT count(*)
FROM foods_episodes,
foods f
WHERE food_id=f.id
the result is just one row. Why using this sql as a subquery can generate multi-rows result?
The first query isn't actually grouping by name. If you have more than 1 record with the same name (different ID), you will see it being displayed twice (hence, not grouped by).
The first query uses what is called a correlated subquery, it calculates the subquery (the inner SELECT) once for each row of the outmost select. Because the FROM in this outmost SELECT is just from the table foods, you will get one record for each food + the results of the subquery, thus no need to group.

SELECT list expression references column user_id which is neither grouped nor aggregated at [8:5]

I have 2 data sets. One of all patients who got ill (endo-2) and one of a special group of patients that also exists in endo-2 called "xp-56"
I've been trying to run this query and I'm not sure why it isn't working. I want to do counts of 3 columns in endo-2 of those patients that belong in the xp-56 table.
this is the code I've been using with the following error
SELECT list expression references column user_id which is neither grouped nor aggregated at [8:5]
how do I fix this so I never make the same mistake again!
SELECT
Virus_Exposure,
Medical_Delivery,
Number_of_Site
FROM
(
SELECT
medical_id,
COUNT(DISTINCT Virus_id) AS Virus_Exposure,
COUNT(EndoCrin_id) AS Medical_Delivery,
COUNT (site_id_clinic) AS Number_of_Site
FROM
`endo-2`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2017-12-15")
AND TIMESTAMP("2018-01-10")) AS a
RIGHT JOIN
(
SELECT
medical_id
FROM
`xp-56`
ORDER BY
medical_id DESC) AS b
ON
a.medical_id=b.medical_id
GROUP BY
medical_id
Why doesnt the medical_id in table a work?
Why not just do this?
SELECT e.medical_id,
COUNT(DISTINCT e.Virus_id) AS Virus_Exposure,
COUNT(e.EndoCrin_id) AS Medical_Delivery,
COUNT(e.site_id_clinic) AS Number_of_Site
FROM `endo-2` e JOIN
`xp-56` x
ON x.medical_id = e.medical_id
WHERE e._PARTITIONTIME BETWEEN TIMESTAMP("2017-12-15") AND TIMESTAMP("2018-01-10")
GROUP BY e.medical_id;

Having an issue with selecting max rows by date

I am trying to select the max timestamped records from table 1 based on some data from table 2. I am getting the correct records based on the where limits I have put on the query, but I am still getting duplicate entries not the max time stamped entries. Any ideas on what is wrong with the query?
Basically the ID 901413368 has access to certain leveltypes and I'm trying to find out what the max dated requests were that were put in for that same person for the leveltypes that person manages.
SELECT
MAX(timestamp) AS maxtime, Leveltype, assign_ID
FROM
WHERE
(leveltype IN
(SELECT leveltype FROM dbo.idleveltypes WHERE (id = 901413368)))
GROUP BY timestamp, assign_ID, leveltype
HAVING (assign_ID = '901413368')
UPDATE: The issue has been resolved by WEI_DBA's response below:
Remove the timestamp column from your Group By. Also put the assign_ID in the Where Clause and remove the Having clause
The following may be what you want. It should also be a simpler way to write the query:
SELECT MAX(a.timestamp) AS maxtime, a.Leveltype, a.assign_ID
FROM dbo.q_Archive a JOIN
dbo.idleveltypes lt
ON a.leveltype = lt.leveltype AND
a.assign_ID = lt.id
WHERE assign_ID = 901413368
GROUP BY assign_ID, leveltype;
Notes:
Filter on assign_ID before doing the group by. That is much more efficient.
A JOIN is the more typical way to represent the relationship between two tables.
The JOIN condition should be on all the columns needed for matching; there appear to be two.
I don't understand why the leveltype table would have a column called id, but this is your data structure.
The GROUP BY does not need timestamp.
Decide on the type for the id column that should be 901413368. Is it a number or a string? Only use single quotes for string and date constants.
Remove timestamp from GROUP BY clause due you're getting MAX(timestamp)
You shoud not add aggregated fields to GROUP BY clause.
SELECT
MAX(timestamp) AS maxtime,
Leveltype,
assign_ID
FROM
dbo.q_Archive
WHERE
(leveltype IN (SELECT leveltype
FROM dbo.idleveltypes
WHERE (id = 901413368)))
GROUP assign_ID, leveltype
HAVING (assign_ID = '901413368')

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

Using a DISTINCT clause to filter data but still pull other fields that are not DISTINCT

I am trying to write a query in Postgresql that pulls a set of ordered data and filters it by a distinct field. I also need to pull several other fields from the same table row, but they need to be left out of the distinct evaluation. example:
SELECT DISTINCT(user_id) user_id,
created_at
FROM creations
ORDER BY created_at
LIMIT 20
I need the user_id to be DISTINCT, but don't care whether the created_at date is unique or not. Because the created_at date is being included in the evaluation, I am getting duplicate user_id in my result set.
Also, the data must be ordered by the date, so using DISTINCT ON is not an option here. It required that the DISTINCT ON field be the first field in the ORDER BY clause and that does not deliver the results that I seek.
How do I properly use the DISTINCT clause but limit its scope to only one field while still selecting other fields?
As you've discovered, standard SQL treats DISTINCT as applying to the whole select-list, not just one column or a few columns. The reason for this is that it's ambiguous what value to put in the columns you exclude from the DISTINCT. For the same reason, standard SQL doesn't allow you to have ambiguous columns in a query with GROUP BY.
But PostgreSQL has a nonstandard extension to SQL to allow for what you're asking: DISTINCT ON (expr).
SELECT DISTINCT ON (user_id) user_id, created_at
FROM creations
ORDER BY user_id, created_at
LIMIT 20
You have to include the distinct expression(s) as the leftmost part of your ORDER BY clause.
See the manual on DISTINCT Clause for more information.
If you want the most recent created_at for each user then I suggest you aggregate like this:
SELECT user_id, MAX(created_at)
FROM creations
WHERE ....
GROUP BY user_id
ORDER BY created_at DESC
This will return the most recent created_at for each user_id
If you only want the top 20, then append
LIMIT 20
EDIT: This is basically the same thing Unreason said above... define from which row you want the data by aggregation.
The GROUP BY should ensure distinct values of the grouped columns, this might give you what you are after.
(Note I'm putting in my 2 cents even though I am not familiar with PostgreSQL, but rather MySQL and Oracle)
In MySql
SELECT user_id, created_at
FROM creations
GROUP BY user_id
ORDER BY user_id
In Oracle sqlplus
SELECT user_id, FIRST(created_at)
FROM creations
GROUP BY user_id
ORDER BY user_id
These will give you the user_id followed by the first created_at associated with that user_id. If you want a different created_at you have the option to substitute FIRST with other functions like AVG, MIN, MAX, or LAST in Oracle, you can also try adding ORDER BY on other columns (including ones that are not returned, to give you a different created_at.
Your question is not well defined - when you say you need also other data from the same row you are not defining which row.
You do say you need to order the results by created_at, so I will assume that you want values from the row with min created_at (earliest).
This now becomes one of the most common so SQL questions - retrieving rows containing some aggregate value (MIN, MAX).
For example
SELECT user_id, MIN(created_at) AS created_at
FROM creations
GROUP BY user_id
ORDER BY MIN(create_at)
LIMIT 20
This approach will not let you (easily) pick other values from the same row.
One approach that will let you pick other values is
SELECT c.user_id, c.created_at, c.other_columns
FROM creations c LEFT JOIN creation c_help
ON c.user_id = c_help.user_id AND c.created_at > c_help.create_at
WHERE c_help IS NULL
ORDER BY c.created_at
LIMIT 20
Using a sub-query was suggested by someone on the irc #postgresql channel. It worked:
SELECT user_id
FROM (SELECT DISTINCT ON (user_id) * FROM creations) ss
ORDER BY created_at DESC
LIMIT 20;