Preventing SQLite query from doing USE TEMP B-TREE FOR GROUP BY - sql

I have a table
CREATE TABLE user_records (
pos smallint PRIMARY KEY,
username MEDIUMINT unsigned not null,
anime_id smallint UNSIGNED NOT NULL,
score tinyint not null,
scaled_score DECIMAL(1,5) not null
)
with indexes
(anime_id,username,scaled_score)
(username,anime_id,scaled_score)
(username)
(anime_id)
I know those last two were redundant I was just testing
And lastly here is my query:
select aggregate_func(score2) scores,anime_id from
(select anime_id as anime_id2,username as username2,scaled_score as score2 from user_records where anime_id in(666))
inner join
(select anime_id,username,scaled_score from user_records)
where username = username2 group by anime_id order by scores desc limit 1000;
The goal of this query is to run an aggregation function on every combination of scores a user has given to a specified show (666 in this case) and all other shows in the table. I have tried every type of join SQLite supports, which isn't many, and reordering the select statements, but the outcome is always the same except with a cross join with the unconstrained select first, which takes a very long time for obvious reasons. I am confident after executing each part separately that the part taking the most time is the USE TEMP B-TREE FOR GROUP BY. My goal is for the query planner to somehow use an index for the GROUP BY, but no matter what I try it chooses B-TREE, and the grouping process takes exponentially longer depending on the size of the result set from the join.
For reference, this table has 70,000,000 rows of user show ratings and the GROUP by often has to work on millions of joined rows. Thanks in advance.

Related

What will be faster for GROUP BY statement

Imagine that I have the next two SQL Server tables:
CREATE TABLE Users (
id INT IDENTITY(1, 1) PRIMARY KEY,
name VARCHAR(100) NOT NULL
)
CREATE TABLE UserLogins (
id INT IDENTITY(1, 1) PRIMARY KEY,
user_id INT REFERENCES Users(id) NOT NULL,
login VARCHAR(100) NOT NULL
)
And I need to get a count of user logins for each user. And the query result should contain user name, for example.
Which query will work faster:
SELECT MAX(name), count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.id
or the next one:
SELECT name, count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.name
So, I'm not sure, if it will be better to group by the column with an index and then use MAX or MIN aggregate function. Or just group by Users.name, which doesn't have any indexes.
Thank you in advance!
The answer is: neither is really correct.
The second version is completely wrong as name is not unique. The first version is correct, although it may not be efficient.
Since name has a functional dependency on id, every unique value of id also defines a value of name. Grouping by name is wrong, because name is not necessarily unique. Grouping only by id means you need to aggregate name, which makes no sense if there is a functional dependency. So you actually want to group by both columns:
SELECT
u.name,
count(*)
FROM Users u
INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY
u.id,
u.name;
Note that id does not actually need to be selected.
This query is almost certainly going to be faster than grouping by name alone, because the server cannot deduce that name is unique and needs to sort and aggregate it.
It may also be faster than grouping by id, although that may depend on whether the optimizer is clever enough to deduce the functional dependency (and therefore no aggregation would be necessary). Even if it isn't clever, this probably won't be slow, as id is already unique, so a scan of an index over id would not require a sort, only aggregation.

Order by date, while grouping matches by another column

I have this query
SELECT *, COUNT(app.id) AS totalApps FROM users JOIN app ON app.id = users.id
GROUP BY app.id ORDER BY app.time DESC LIMIT ?
which is supposed to get all results from "users" ordered by another column (time) in a related table (the id from the app tables references the id from the users table).
The issue I have is that the grouping is done before the ordering by date, so I get very old results. But I need the grouping in order to get distinct users, because each user can have multiple 'apps'... Is there a different way to achieve this?
Table users:
id TEXT PRIMARY KEY
Table app:
id TEXT
time DATETIME
FOREIGN KEY(id) REFERENCES users(id)
in my SELECT query I want to get a list of users, ordered by the app.time column. But because one user can have multiple app records associated, I could get duplicate users, that's why I used GROUP BY. But then the order is messed up
The underlying issue is that the SELECT is an aggregate query as it contains a GROUP BY clause :-
There are two types of simple SELECT statement - aggregate and
non-aggregate queries. A simple SELECT statement is an aggregate query
if it contains either a GROUP BY clause or one or more aggregate
functions in the result-set.
SQL As Understood By SQLite - SELECT
And thus that the column's value for that group, will be an arbitrary value the column of that group (first according to scan/search, I suspect, hence the lower values) :-
If the SELECT statement is an aggregate query without a GROUP BY
clause, then each aggregate expression in the result-set is evaluated
once across the entire dataset. Each non-aggregate expression in the
result-set is evaluated once for an arbitrarily selected row of the
dataset. The same arbitrarily selected row is used for each
non-aggregate expression. Or, if the dataset contains zero rows, then
each non-aggregate expression is evaluated against a row consisting
entirely of NULL values.
So in short you cannot rely upon the column values that aren't part of the group/aggregation, when it's an aggregate query.
Therefore have have to retrieve the required values using an aggregate expression, such as max(app.time). However, you can't ORDER by this value (not sure exactly why by it's probably inherrent in the efficiency aspect)
HOWEVER
What you can do is use the query to build a CTE and then sort without aggregates involved.
Consider the following, which I think mimics your problem:-
DROP TABLE IF EXISTS users;
DROP TABLE If EXISTS app;
CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, username TEXT);
INSERT INTO users (username) VALUES ('a'),('b'),('c'),('d');
CREATE TABLE app (the_id INTEGER PRIMARY KEY, id INTEGER, appname TEXT, time TEXT);
INSERT INTO app (id,appname,time) VALUES
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212),
(4,'app9',721),(4,'app10',7654),(4,'app11',11),
(3,'app1',1000),(3,'app2',7),
(2,'app3',10),(2,'app4',101),(2,'app5',1),
(1,'app6',15),(1,'app7',7),(1,'app8',212)
;
SELECT * FROM users;
SELECT * FROM app;
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
ORDER BY max(app.time)
;
This results in :-
Where although the latest time for each group has been extracted the final result hasn't been sorted as you would think.
Wrapping it into a CTE can fix that e.g. :-
WITH cte1 AS
(
SELECT username
,count(app.id)
, max(app.time) AS latest_time
, min(app.time) AS earliest_time
FROM users JOIN app ON users.id = app.id
GROUP BY users.id
)
SELECT * FROM cte1 ORDER BY cast(latest_time AS INTEGER) DESC;
and now :-
Note simple integers have been used instead of real times for my convenience.
Since you need the newest date in every group, you could just MAX them:
SELECT
*,
COUNT(app.id) AS totalApps,
MAX(app.time) AS latestDate
FROM users
JOIN app ON app.id = users.id
GROUP BY app.id
ORDER BY latestDate DESC
LIMIT ?
You could use windowed COUNT:
SELECT *, COUNT(app.id) OVER(PARTITION BY app.id) AS totalApps
FROM users
JOIN app
ON app.id = users.id
ORDER BY app.time DESC
LIMIT ?
Maybe you could use?
SELECT DISTINCT
Read more here: https://www.w3schools.com/sql/sql_distinct.asp
Try to grouping by id and time and then order by time.
select ...
group by app.id desc, app.time
I assume that id is unique in app table.
and how you assign ID to? maybe you have enough to order by id desc

SQL*Plus Query help to sum and count from different table and get the correct answer

This is my query and is giving me the wrong amount. How can I fix it? It should be giving a total of 167700 but it is giving me 2515500 for the total loan. The count is working fine = 15. I can't change anything on the tables.
create table loan
(loan_number varchar(15) not null,
branch_name varchar(15) not null,
amount number not null,
primary key(loan_number));
create table customer
(customer_name varchar(15) not null,
customer_street varchar(12) not null,
customer_city varchar(15) not null,
primary key(customer_name));
select SUM(amount),
COUNT( distinct customer_name)
from loan,customer;
Simple rule: Never use commas in the FROM clause. Always use explicit, proper JOIN syntax with the conditions in the ON clause. Then you won't forget them!
So:
select SUM(amount),
COUNT( distinct customer_name)
from loan l join
customer c
on l.customerid = c.customerid;
Of course, I made up the names of the columns used for the join, because your question has no information describing the tables.
Duh! No common key in the two tables. How is that? How do you keep track of which customer took which loan?
Why are you running ONE query for data from two UNRELATED tables? The CROSS JOIN you created (by having no condition whatsoever on the enumeration of the two tables) simply joins every row from the first table to every row from the second table.
It appears the customer table has 15 rows, and all 15 names are distinct. When you COUNT DISTINCT, you get the correct number, even though in the cross join each customer_name appears many times.
On the other hand, each loan amount is repeated 15 times. 167,700 x 15 = 2,515,500.
If you need to show both the total loan amount and the number of (distinct) customers in a single row, you want something like
select (select sum(amount) from loan) as total_amount,
(select count (distinct customer_name) from customer) as distinct_customers
from dual
;

"Data warehouse"-like SQLite store design

I am interested in designing a SQL-based (SQLite, actually) storage for an application processing a large number of similar data entries. For this example, let it be a chat messages storage.
The application has to provide the capabilities of filtering and analyzing the data by message participants, tags, etc., all of those implying N-to-N relationships.
So, the schema (kind of star) will look something like:
create table messages (
message_id INTEGER PRIMARY KEY,
time_stamp INTEGER NOT NULL
-- other fact fields
);
create table users (
user_id INTEGER PRIMARY KEY,
-- user dimension data
);
create table message_participants (
user_id INTEGER references users(user_id),
message_id INTEGER references messages(message_id)
);
create table tags (
tag_id INTEGER PRIMARY KEY,
tag_name TEXT NOT NULL,
-- tag dimension data
);
create table message_tags (
tag_id INTEGER references tags(tag_id),
message_id INTEGER references messages(message_id)
);
-- etc.
So, all good and well, until I have to perform analytic operations and filtering based on the N-to-N dimensions. Given millions of rows in the messages table and thousands in the dimensions (there are more than shown in the example), all the joins are simply too much a performance hit.
For example, I would like to analyze the number of messages each user participated in, given the data is filtered based on selected tags, selected users and other aspects:
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
and message_id in
(select message_id
from message_tags
where tag_id in ( /* some tag ID's set */ ))
and
-- more N-to-N filtering
group by U.user_id
I am constrained to SQL and, specifically, SQLite. And I do use indices on the tables.
I there some way I don't see to improve the schema, maybe a clever way to de-normalize it?
Or maybe there is a way to somehow index the dimension keys inside the message row (I thought about using FTS capabilities but not sure if searching the textual index and joining on the results will provide any performance leverage)?
Too long to put in a comment, and might help with performance but isn't exactly a direct answer to your question (your schema seems fine): have you tried messing with your query itself?
I often see that kind of subselect filter for many-to-many, and I have found that on large queries like this I frequently see improvements in performance from running a CTE/join rather than a where blag in (subselect):
;with tagMesages as (
select distinct message_id
from message_tags
where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
group by U.user_id
We can tell they're the same, but the query planner can sometimes find this more helpful
Disclaimer: I don't do SQLite, I do SQL Server, so sorry if I've made some obvious (or otherwise) error.

Creating a denormalized table from a normalized key-value table using 100s of joins

I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.