Group By clause in SQLite - sql

Aim:
I would like to query the table to pick only the latest version of each item.
Question:
Why does Query1 work in SQLite (I was thinking the group by clause would throw an error, because select statement contains the column content and it not part of the group by clause) ?
Would Query1 throw an error in Oracle ?
Is Query1 better than Query2 ?
Is there a better way to write the query ?
Query1:
select item_id,
max(version_number),
content
from item_version
group by item_id;
Query2:
select iv.*
from item_version iv,
(select item_id,
max(version_number) latest_version_number
from item_version
group by item_id) liv
where iv.item_id = liv.item_id
and iv.version_number = liv.latest_version_number;
Setting up the table:
create table item_version(
item_id varchar,
version_number integer,
content varchar,
primary key (item_id, version_number)
);
insert into item_version values (1, 1, null);
insert into item_version values (2, 1, "Content A");
insert into item_version values (2, 2, "Content B");
insert into item_version values (3, 1, "Content C");
insert into item_version values (3, 2, null);
insert into item_version values (4, 1, "Content D");
insert into item_version values (4, 2, null);

From the documentation:
In most SQL implementations, output columns of an aggregate query may only reference aggregate functions or columns named in the GROUP BY clause. It does not make good sense to reference an ordinary column in an aggregate query because each output row might be composed from two or more rows in the input table(s).
SQLite does not impose this restriction. The output columns from an aggregate query can be arbitrary expressions that include columns not found in GROUP BY clause.
With SQLite (but not any other SQL implementation that we know of) if an aggregate query contains a single min() or max() function, then the values of columns used in the output are taken from the row where the min() or max() value was achieved. If two or more rows have the same min() or max() value, then the columns values will be chosen arbitrarily from one of those rows.
For example to find the highest paid employee:
SELECT max(salary), first_name, last_name FROM employee;
In the query above, the values for the first_name and last_name columns will correspond to the row that satisfied the max(salary) condition.
If a query contains no aggregate functions at all, then a GROUP BY clause can be added as a substitute of DISTINCT ON clause. In other words, output rows are filtered so that only one row is shows for each distinct set of values in the GROUP BY clause. If two or more output rows would have otherwise had the same set of values for the GROUP BY columns, then one of the rows is chosen arbitrarily.
Your query 1 would cause an error in most databases, yes, but as long as you're only going to use it with sqlite, it's perfectly fine.
An alternative to finding the highest version of each item uses the window functions added in Sqlite 3.25:
SELECT item_id, version_number, content
FROM (SELECT item_id, version_number, content
, row_number() OVER (PARTITION BY item_id ORDER BY version_number DESC) AS rnk
FROM item_version) AS sq
WHERE rnk = 1
ORDER BY item_id;
giving
item_id version_number content
---------- -------------- ----------
1 1
2 2 Content B
3 2
4 2
This one should work on other databases like Oracle, as long as they support window functions too.

Shawn does a really good job of explaining the issue. A typical way to solve this uses a correlated subquery:
select iv.*
from item_version iv
where iv.version_number = (select max(iv2.version_number)
from item_version iv2
where iv2.item_id = iv.item_id
);
With an index on item_version(item_id, version_number) this may be the fastest way to get the results that you want. You already have this index with your primary key definition.

Related

Is the ordering of a GROUP BY with a MAX aggregate well defined?

Let's assume I run the following in SQLite:
CREATE TABLE my_table
(
id INTEGER PRIMARY KEY,
NAME VARCHAR(20),
date DATE,
num INTEGER,
important VARCHAR(20)
);
INSERT INTO my_table (NAME, date, num, important)
VALUES ('A', '2000-01-01', 10, 'Important 1');
INSERT INTO my_table (NAME, date, num, important)
VALUES ('A', '2000-02-01', 20, 'Important 2');
INSERT INTO my_table (NAME, date, num, important)
VALUES ('A', '1999-12-01', 30, 'Important 3');
The table looks like this:
id
NAME
date
num
important
1
A
2000-01-01
10
Important 1
2
A
2000-02-01
20
Important 2
3
A
1999-12-01
30
Important 3
If I execute:
SELECT id
FROM my_table
GROUP BY NAME;
the results are:
+----+
| id |
+----+
| 1 |
+----+
If I execute:
SELECT id, MAX(date)
FROM my_table
GROUP BY NAME;
The results are:
+----+------------+
| id | max(date) |
+----+------------+
| 2 | 2000-02-01 |
+----+------------+
And if I execute:
SELECT id,
MAX(date),
MAX(num)
FROM my_table
GROUP BY NAME;
The results are:
+----+------------+----------+
| id | max(date) | max(num) |
+----+------------+----------+
| 3 | 2000-02-01 | 30 |
+----+------------+----------+
My question is, is this well defined? Specifically, am I guaranteed to always get id = 2 when doing the second query (with the single Max(date) aggregate), or is this just a side effect of how SQLite is likely ordering the table to grab the Max before grouping?
I ask this because I specifically do want id = 2. I will then execute another query that selects the important field for that row (for my actual problem the first query would return multiple ids and I'd select all important fields for all those rows at once.
Additionally, this is all happening in an iOS Core Data query, so I'm not able to do more complicated subqueries. If I knew that the ordering of a GROUP BY is defined by an aggregate then I'd feel pretty confident my queries wouldn't break (until Apple moves away from SQLite for Core Data).
Thanks!
From the Sqlite manual
2.5. Bare columns in an aggregate query
The usual case is that all column names in an aggregate query are either arguments to aggregate functions or else appear in the GROUP BY clause. A result column which contains a column name that is not within an aggregate function and that does not appear in the GROUP BY clause (if one exists) is called a "bare" column. Example:
SELECT a, b, sum(c) FROM tab1 GROUP BY a;
In the query above, the "a" column is part of the GROUP BY clause and so each row of the output contains one of the distinct values for "a". The "c" column is contained within the sum() aggregate function and so that output column is the sum of all "c" values in rows that have the same value for "a". But what is the result of the bare column "b"? The answer is that the "b" result will be the value for "b" in one of the input rows that form the aggregate. The problem is that you usually do not know which input row is used to compute "b", and so in many cases the value for "b" is undefined.
Special processing occurs when the aggregate function is either min() or max(). Example:
SELECT a, b, max(c) FROM tab1 GROUP BY a;
When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max() functions work this way.
If bare columns appear in an aggregate query that lacks a GROUP BY clause, and the number of input rows is zero, then the values of the bare columns are arbitrary. For example, in this query:
SELECT count(*), b FROM tab1;
If the tab1 table contains no rows (of count(*) evaluates to 0) then the bare column "b" will have an arbitrary and meaningless value.
Most other SQL database engines disallow bare columns. If you include a bare column in a query, other database engines will usually raise an error. The ability to include bare columns in a query is an SQLite-specific extension.
https://www.sqlite.org/lang_select.html
am I guaranteed to always get id = 2 when doing the second query (with
the single Max(date) aggregate), or is this just a side effect of how
SQLite is likely ordering the table to grab the Max before grouping?
Yes, the result that you get is guaranteed because it is documented in Bare columns in an aggregate query.
The value for the column id that you get is from the row that contains the max date.

Deduping data in BigQuery

I have a query the shows only the non duplicate values, I am looking for a solution on how to use this deduped data in other queries.
I do not have permissions to create anything, so i need to find a solution for that.
IDAN
EDIT (from "answer"):
this are the fields in my table "Purchases": user_id purchase_amount purchase_sku source device_type uuid - a unique identifier for each row
duplicate is considered when all fields except the uuid are identical. i need to return deduplicated data and prepare it for use for other queries.
this is the basic data, with duplicated values in rows 5-6 and 7-8
i want to show to non duplicate rows ,and on the duplicated row show only one row,like this:
deduped data
Consider below generic solution - you do not need to enlist all the column names at all - only uuid is used in query)
select any_value(t).*
from `project.dataset.table` t
group by to_json_string((select as struct * except(uuid) from unnest([t])))
You can use qualify with row_number():
select p.*
from purchases p
where 1=1
qualify row_number() over (partition by user_id, purchase_amount, purchase_sku, source, device_type order by uuid) = 1;
You can also use aggregation:
select purchase_amount, purchase_sku, source, device_type,
min(uuid) as uuid
from purchases
group by 1, 2, 3, 4;

Get unique records from table avoiding all duplicates based on two key columns

I have a table Trial_tb with columns p_id,t_number and rundate.
Sample values:
p_id|t_number|rundate
=====================
111|333 |1/7/2016||
111|333 |1/1/2016||
222|888 |1/8/2016||
222|444 |1/2/2016||
666|888 |1/6/2016||
555|777 |1/5/2016||
pid and tnumber are key columns. I need fetch values such that the result should not have any record in which pid-tnumber combination are duplicated. For example there is duplication for 111|333 and hence not valid. The query should fetch all other than first two records.
I wrote below script but it fetches only the last record. :(
select rundate,p_id,t_number from
(
select rundate,p_id,t_number,
count(p_id) over (partition by p_id) PCnt,
count(t_number) over (partition by t_number) TCnt
from trialtb
)a
where a.PCnt=1 and a.TCnt=1
The having clause is ideal for this job. Having allows you to filter on aggregated records.
-- Finding unique combinations.
SELECT
p_id,
t_number
FROM
trialtb
GROUP BY
p_id,
t_number
HAVING
COUNT(*) = 1
;
This query returns combinations of p_id and t_number that occur only once.
If you want to include rundate you could add MAX(rundate) AS rundate to the select clause. Because you are only looking at unique occurrences the max or min would always be the same.
Do you mean:
select
p_id,t_number
from
trialtb
group by
p_id,t_number
having
count(*) = 1
or do you need the run date too?
select
p_id,t_number,max(rundate)
from
trialtb
group by
p_id,t_number
having
count(*) = 1
Seeing as you are only looking items with one result using max or min should work fine

SQL unpivot & insert

Sorry for the lack of info -- SQL Server 2008.
I'm struggling to get a couple of column values from table A into a new row in table B for each row in A where a column isn't null.
Table A's structure is as:
UserID | ClientUserID | ClientSessionID | [and a load of other irrelevant columns)
Table B:
UserID | Name | Value
I want to create rows in table B for each non-null ClientUserID or ClientSessionID in A - using the column name as B's "Name", and column value as "B's Value".
I'm struggling to write my "unpivot" statement - just getting the syntax correct! I'm trying to follow along with some samples but can't
Here's my SQL query so far - any further help would be appreciated (just getting this SELECT is frustrating me, let alone doing the insert!)
SELECT UserID, ClientUserID, ClientSessionID FROM websiteuser WHERE ClientSessionID IS NOT null
This gives me the rows that I need to perform actions upon -- but I just can't get the syntax correct for UNPIVOTing this data and turning it into my insert.
You can unpivot records in this fashion by using UNION to get each new row:
INSERT INTO TableB (UserID, Name, Value)
SELECT UserID, 'ClientUserID' AS Name, ClientUserID AS Value
FROM TableA
WHERE ClientUserID IS NOT NULL
UNION ALL
SELECT UserID, 'ClientSessionID' AS Name, ClientSessionID AS Value
FROM TableA
WHERE ClientSessionID IS NOT NULL
I am using UNION ALL in this case as UNION implies a DISTINCT operation across the entire set, which should normally be unnecessary when pivoting unique records.
If your ClientUserID and ClientSessionID columns are not the same datatype, you may have to cast one or both to the same.

Normalizing a table, from one to the other

I'm trying to normalize a mysql database....
I currently have a table that contains 11 columns for "categories". The first column is a user_id and the other 10 are category_id_1 - category_id_10. Some rows may only contain a category_id up to category_id_1 and the rest might be NULL.
I then have a table that has 2 columns, user_id and category_id...
What is the best way to transfer all of the data into separate rows in table 2 without adding a row for columns that are NULL in table 1?
thanks!
You can create a single query to do all the work, it just takes a bit of copy and pasting, and adjusting the column name:
INSERT INTO table2
SELECT * FROM (
SELECT user_id, category_id_1 AS category_id FROM table1
UNION ALL
SELECT user_id, category_id_2 FROM table1
UNION ALL
SELECT user_id, category_id_3 FROM table1
) AS T
WHERE category_id IS NOT NULL;
Since you only have to do this 10 times, and you can throw the code away when you are finished, I would think that this is the easiest way.
One table for users:
users(id, name, username, etc)
One for categories:
categories(id, category_name)
One to link the two, including any extra information you might want on that join.
categories_users(user_id, category_id)
-- or with extra information --
categories_users(user_id, category_id, date_created, notes)
To transfer the data across to the link table would be a case of writing a series of SQL INSERT statements. There's probably some awesome way to do it in one go, but since there's only 11 categories, just copy-and-paste IMO:
INSERT INTO categories_users
SELECT user_id, 1
FROM old_categories
WHERE category_1 IS NOT NULL