What will be faster for GROUP BY statement - sql

Imagine that I have the next two SQL Server tables:
CREATE TABLE Users (
id INT IDENTITY(1, 1) PRIMARY KEY,
name VARCHAR(100) NOT NULL
)
CREATE TABLE UserLogins (
id INT IDENTITY(1, 1) PRIMARY KEY,
user_id INT REFERENCES Users(id) NOT NULL,
login VARCHAR(100) NOT NULL
)
And I need to get a count of user logins for each user. And the query result should contain user name, for example.
Which query will work faster:
SELECT MAX(name), count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.id
or the next one:
SELECT name, count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.name
So, I'm not sure, if it will be better to group by the column with an index and then use MAX or MIN aggregate function. Or just group by Users.name, which doesn't have any indexes.
Thank you in advance!

The answer is: neither is really correct.
The second version is completely wrong as name is not unique. The first version is correct, although it may not be efficient.
Since name has a functional dependency on id, every unique value of id also defines a value of name. Grouping by name is wrong, because name is not necessarily unique. Grouping only by id means you need to aggregate name, which makes no sense if there is a functional dependency. So you actually want to group by both columns:
SELECT
u.name,
count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY
u.id,
u.name;
Note that id does not actually need to be selected.
This query is almost certainly going to be faster than grouping by name alone, because the server cannot deduce that name is unique and needs to sort and aggregate it.
It may also be faster than grouping by id, although that may depend on whether the optimizer is clever enough to deduce the functional dependency (and therefore no aggregation would be necessary). Even if it isn't clever, this probably won't be slow, as id is already unique, so a scan of an index over id would not require a sort, only aggregation.

Related

Preventing SQLite query from doing USE TEMP B-TREE FOR GROUP BY

I have a table
CREATE TABLE user_records (
pos smallint PRIMARY KEY,
username MEDIUMINT unsigned not null,
anime_id smallint UNSIGNED NOT NULL,
score tinyint not null,
scaled_score DECIMAL(1,5) not null
)
with indexes
(anime_id,username,scaled_score)
(username,anime_id,scaled_score)
(username)
(anime_id)
I know those last two were redundant I was just testing
And lastly here is my query:
select aggregate_func(score2) scores,anime_id from
(select anime_id as anime_id2,username as username2,scaled_score as score2 from user_records where anime_id in(666))
inner join
(select anime_id,username,scaled_score from user_records)
where username = username2 group by anime_id order by scores desc limit 1000;
The goal of this query is to run an aggregation function on every combination of scores a user has given to a specified show (666 in this case) and all other shows in the table. I have tried every type of join SQLite supports, which isn't many, and reordering the select statements, but the outcome is always the same except with a cross join with the unconstrained select first, which takes a very long time for obvious reasons. I am confident after executing each part separately that the part taking the most time is the USE TEMP B-TREE FOR GROUP BY. My goal is for the query planner to somehow use an index for the GROUP BY, but no matter what I try it chooses B-TREE, and the grouping process takes exponentially longer depending on the size of the result set from the join.
For reference, this table has 70,000,000 rows of user show ratings and the GROUP by often has to work on millions of joined rows. Thanks in advance.

Postgresql "column must appear in the GROUP BY clause or be used in an aggregate function" and unique field

CREATE TABLE posts (
id bigint NOT NULL,
user_id bigint NOT NULL,
content text
);
CREATE TABLE users (
id bigint NOT NULL,
email character varying DEFAULT ''::character varying NOT NULL
)
CREATE UNIQUE INDEX index_users_on_email ON users USING btree (email);
The following SQL request:
SELECT posts.content, users.email /*, other aggregate fields not relevant for the question */
FROM posts
INNER JOIN users ON posts.user_id = users.id
GROUP BY posts.id;
gives the error column "users.email" must appear in the GROUP BY clause or be used in an aggregate function.
But the email field is unique (if it changes anything) and a post can only have one user (so one email).
Why is this request not valid, since it's not possible to have multiple values of email per post?
You need to add the primary key of the user table to the group by clause to make the query a valid aggregation query:
SELECT p.content, u.email /*, other aggregate fields not relevant for the question */
FROM posts p
INNER JOIN users u ON p.user_id = u.id
/* Other `inner join`s but not relevant for the question */
GROUP BY posts.id, u.id;
Postgres is quite smart about functional dependencies, but not that smart. It understands the concept of functionally-dependent columns, but not across tables so it cannot foresee that a post uniquely refers to a user, even if you have a proper foreign key set up. I don't think that such thing is defined in standard ANSI SQL either.

Order by date, while grouping matches by another column

I have this query
SELECT *, COUNT(app.id) AS totalApps FROM users JOIN app ON app.id = users.id
GROUP BY app.id ORDER BY app.time DESC LIMIT ?
which is supposed to get all results from "users" ordered by another column (time) in a related table (the id from the app tables references the id from the users table).
The issue I have is that the grouping is done before the ordering by date, so I get very old results. But I need the grouping in order to get distinct users, because each user can have multiple 'apps'... Is there a different way to achieve this?
Table users:
id TEXT PRIMARY KEY
Table app:
id TEXT
time DATETIME
FOREIGN KEY(id) REFERENCES users(id)
in my SELECT query I want to get a list of users, ordered by the app.time column. But because one user can have multiple app records associated, I could get duplicate users, that's why I used GROUP BY. But then the order is messed up
The underlying issue is that the SELECT is an aggregate query as it contains a GROUP BY clause :-
There are two types of simple SELECT statement - aggregate and
non-aggregate queries. A simple SELECT statement is an aggregate query
if it contains either a GROUP BY clause or one or more aggregate
functions in the result-set.
SQL As Understood By SQLite - SELECT
And thus that the column's value for that group, will be an arbitrary value the column of that group (first according to scan/search, I suspect, hence the lower values) :-
If the SELECT statement is an aggregate query without a GROUP BY
clause, then each aggregate expression in the result-set is evaluated
once across the entire dataset. Each non-aggregate expression in the
result-set is evaluated once for an arbitrarily selected row of the
dataset. The same arbitrarily selected row is used for each
non-aggregate expression. Or, if the dataset contains zero rows, then
each non-aggregate expression is evaluated against a row consisting
entirely of NULL values.
So in short you cannot rely upon the column values that aren't part of the group/aggregation, when it's an aggregate query.
Therefore have have to retrieve the required values using an aggregate expression, such as max(app.time). However, you can't ORDER by this value (not sure exactly why by it's probably inherrent in the efficiency aspect)
HOWEVER
What you can do is use the query to build a CTE and then sort without aggregates involved.
Consider the following, which I think mimics your problem:-
DROP TABLE IF EXISTS users;
DROP TABLE If EXISTS app;
CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, username TEXT);
INSERT INTO users (username) VALUES ('a'),('b'),('c'),('d');
CREATE TABLE app (the_id INTEGER PRIMARY KEY, id INTEGER, appname TEXT, time TEXT);
INSERT INTO app (id,appname,time) VALUES
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212),
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212)
;
SELECT * FROM users;
SELECT * FROM app;
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
ORDER BY max(app.time)
;
This results in :-
Where although the latest time for each group has been extracted the final result hasn't been sorted as you would think.
Wrapping it into a CTE can fix that e.g. :-
WITH cte1 AS
(
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
)
SELECT * FROM cte1 ORDER BY cast(latest_time AS INTEGER) DESC;
and now :-
Note simple integers have been used instead of real times for my convenience.
Since you need the newest date in every group, you could just MAX them:
SELECT
*,
COUNT(app.id) AS totalApps,
MAX(app.time) AS latestDate
FROM users
JOIN app ON app.id = users.id
GROUP BY app.id
ORDER BY latestDate DESC
LIMIT ?
You could use windowed COUNT:
SELECT *, COUNT(app.id) OVER(PARTITION BY app.id) AS totalApps
FROM users
JOIN app
ON app.id = users.id
ORDER BY app.time DESC
LIMIT ?
Maybe you could use?
SELECT DISTINCT
Read more here: https://www.w3schools.com/sql/sql_distinct.asp
Try to grouping by id and time and then order by time.
select ...
group by app.id desc, app.time
I assume that id is unique in app table.
and how you assign ID to? maybe you have enough to order by id desc

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

complex sql query help needed

I'm not sure how to write this query in SQL. there are two tables
**GroupRecords**
Id (int, primary key)
Name (nvarchar)
SchoolYear (datetime)
RecordDate (datetime)
IsUpdate (bit)
**People**
Id (int, primary key)
GroupRecordsId (int, foreign key to GroupRecords.Id)
Name (nvarchar)
Bio (nvarchar)
Location (nvarchar)
return a distinct list of people who belong to GroupRecords that have a SchoolYear of '2000'. In the returned list, people.name should be unique (no duplicate People.Name), in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
It would probably be better to write a stored procedure for this right?
This is untested, but it should do what is required in the question.
It selects all details about the person.
The subquery will make it match only the latest RecordDate for a single name. It will also look only in the right GroupRecord because of the Match between the ids.
SELECT
People.Id,
People.GroupRecordsId,
People.Name,
People.Group,
People.Bio,
People.Location
FROM
People
INNER JOIN GroupRecords ON GroupRecords.Id = People.GroupRecordsId
WHERE
GroupRecords.SchoolYear = '2000/1/1' AND
GroupRecords.RecordDate = (
SELECT
MAX(GR2.RecordDate)
FROM
People AS P2
INNER JOIN GroupRecords AS GR2 ON P2.GroupRecordsId = GR2.Id
WHERE
P2.Name = People.Name AND
GR2.Id = GroupRecords.Id
)
Select Distinct ID
From People
Where GroupRecordsID In
(Select Id From GroupRecords
Where SchoolYear = '2000/1/1')
This will produce a distinct list of those individuals in the 2000 class...
but I don't understand what you're getting at with the cpmment about duplicates... please elaborate...
It reads as though you're talking about when two different people happen to have the same name you don't want them both listed... Is that really what you want?
MySQL specific:
SELECT *
FROM `People`
LEFT JOIN `GroupRecords` ON `GroupRecordsId` = `GroupRecords`.`Id`
GROUP BY `People`.`Name`
ORDER BY `GroupRecords`.`RecordDate` DESC
WHERE `GroupRecords`.`SchoolYear` = '2000/1/1'
people.name should be unique (no duplicate People.Name)
? Surely you mean no duplicate People.ID?
in case of a duplication only the person who belong to the GroupRecords with the later RecordDate should be returned.
There's the rub — that's the bit that it's not obvious how to do in plain SQL. There are a number of approaches to the “For each X, select the row Y with maximum/minimum Z” question; which work and which perform better depend on which database software you're using.
http://kristiannielsen.livejournal.com/6745.html has some good discussion of some of the usual techniques for attacking this (in the context of MySQL, but widely applicable).