Create DISTINCT list of users from table with NULL values - sql

DB-Fiddle
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
customer VARCHAR(255),
confirmed_date DATE,
first_order DATE
);
INSERT INTO customers
(customer, confirmed_date, first_order)
VALUES
('user_01', '2020-03-12', '2020-04-10'),
('user_01', NULL , '2020-04-10'),
('user_02', '2020-04-07', '2020-05-28'),
('user_03', '2020-05-19', '2020-05-22'),
('user_04', NULL, '2020-07-09'),
('user_05', '2020-06-03', '2020-06-04'),
('user_05', NULL , '2020-06-04'),
('user_06', '2020-07-18', '2020-10-23');
Expected Result:
customer | confirmed_date | first_order |
----------|-----------------|----------------|------
user_01 | 2020-03-12 | 2020-04-10 |
user_02 | 2020-04-07 | 2020-05-28 |
user_03 | 2020-05-19 | 2020-05-22 |
user_04 | NULL | 2020-07-09 |
user_05 | 2020-06-03 | 2020-06-04 |
user_06 | 2020-07-18 | 2020-10-23 |
I want to list all DISTINCT users from the results in the table.
However, the data inlcudes:
a) users that do not have any confirmed_date (e.g. user_04)
b) users that appear with a row incl. a confirmed_date and another row without a confimred_date (e.g. user_01, user_05)
In case of a) I want to include the user with confirmed_date NULL.
In case of b) I want to use the row which includes a confirmed_date
So far I came up with this query:
SELECT
DISTINCT c.customer AS customer,
c.confirmed_date AS confirmed_date,
c.first_order AS first_order
FROM customers c
WHERE c.confirmed_date IS NOT NULL
ORDER BY 1;
It almost provides the expected results but excludes user_04.
How do I need to modify it to get the correct results?

The first "confirmed" or the only existing row per a customer
SELECT *
FROM (
SELECT *, row_number() over(partition by customer order by confirmed_date) rn
FROM customers c
) t
WHERE rn = 1;

Use DISTINCT ON. The second ORDER BY expression will push records with non-null first_order up and include them in the selection correctly.
SELECT distinct on (customer) *
from customers
order by customer,
case when confirmed_date is null then 1 else 0 end,
confirmed_date;
The result is what you expect.

If I understand your requirement correctly, you should be able to just aggregate by customer and then select the max value of the date and order columns:
SELECT customer, MAX(confirmed_date) AS confirmed_date, MAX(first_order) AS first_order
FROM customers
GROUP BY customer;
Just for reference, the MAX function (along with most other aggregate functions) ignore NULL values by default. So taking the MAX here works well, because it disregards the missing dates and instead selects the non NULL values.

Related

SQL to find max of sum of data in one table, with extra columns

Apologies if this has been asked elsewhere. I have been looking on Stackoverflow all day and haven't found an answer yet. I am struggling to write the query to find the highest month's sales for each state from this example data.
The data looks like this:
| order_id | month | cust_id | state | prod_id | order_total |
+-----------+--------+----------+--------+----------+--------------+
| 67212 | June | 10001 | ca | 909 | 13 |
| 69090 | June | 10011 | fl | 44 | 76 |
... etc ...
My query
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders GROUP BY `month`, `state`
ORDER BY sales;
| month | state | sales |
+------------+--------+--------+
| September | wy | 435 |
| January | wy | 631 |
... etc ...
returns a few hundred rows: the sum of sales for each month for each state. I want it to only return the month with the highest sum of sales, but for each state. It might be a different month for different states.
This query
SELECT `state`, MAX(order_sum) as topmonth
FROM (SELECT `state`, SUM(order_total) order_sum FROM orders GROUP BY `month`,`state`)
GROUP BY `state`;
| state | topmonth |
+--------+-----------+
| ca | 119586 |
| ga | 30140 |
returns the correct number of rows with the correct data. BUT I would also like the query to give me the month column. Whatever I try with GROUP BY, I cannot find a way to limit the results to one record per state. I have tried PartitionBy without success, and have also tried unsuccessfully to do a join.
TL;DR: one query gives me the correct columns but too many rows; the other query gives me the correct number of rows (and the correct data) but insufficient columns.
Any suggestions to make this work would be most gratefully received.
I am using Apache Drill, which is apparently ANSI-SQL compliant. Hopefully that doesn't make much difference - I am assuming that the solution would be similar across all SQL engines.
This one should do the trick
SELECT t1.`month`, t1.`state`, t1.`sales`
FROM (
/* this one selects month, state and sales*/
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
) AS t1
JOIN (
/* this one selects the best value for each state */
SELECT `state`, MAX(sales) AS best_month
FROM (
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
)
GROUP BY `state`
) AS t2
ON t1.`state` = t2.`state` AND
t1.`sales` = t2.`best_month`
It's basically the combination of the two queries you wrote.
Try this:
SELECT `month`, `state`, SUM(order_total) FROM orders WHERE `month` IN
( SELECT TOP 1 t.month FROM ( SELECT `month` AS month, SUM(order_total) order_sum FROM orders GROUP BY `month`
ORDER BY order_sum DESC) t)
GROUP BY `month`, state ;

Query to count the frequence of many-to-many associations

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
For performing the association, a join table exists between those two tables:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
I should have something like:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
Or, eventually, something like :
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
I can't find the SQL query to do this.
I think you can get what you want using this query:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.
Here is a sqlfiddle to demonstrate
As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.
We need to compare sorted lists of reasons to identify equal sets.
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.
And we don't strictly need the innermost subquery at all. This works as well:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
But it's typically slower for processing all or most of the table. Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.
If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
Related, with more explanation:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations
You can do this using string_agg():
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;

PostgreSQL return multiple rows with DISTINCT though only latest date per second column

Lets says I have the following database table (date truncated for example only, two 'id_' preix columns join with other tables)...
+-----------+---------+------+--------------------+-------+
| id_table1 | id_tab2 | date | description | price |
+-----------+---------+------+--------------------+-------+
| 1 | 11 | 2014 | man-eating-waffles | 1.46 |
+-----------+---------+------+--------------------+-------+
| 2 | 22 | 2014 | Flying Shoes | 8.99 |
+-----------+---------+------+--------------------+-------+
| 3 | 44 | 2015 | Flying Shoes | 12.99 |
+-----------+---------+------+--------------------+-------+
...and I have a query like the following...
SELECT id, date, description FROM inventory ORDER BY date ASC;
How do I SELECT all the descriptions, but only once each while simultaneously only the latest year for that description? So I need the database query to return the first and last row from the sample data above; the second it not returned because the last row has a later date.
Postgres has something called distinct on. This is usually more efficient than using window functions. So, an alternative method would be:
SELECT distinct on (description) id, date, description
FROM inventory
ORDER BY description, date desc;
The row_number window function should do the trick:
SELECT id, date, description
FROM (SELECT id, date, description,
ROW_NUMBER() OVER (PARTITION BY description
ORDER BY date DESC) AS rn
FROM inventory) t
WHERE rn = 1
ORDER BY date ASC;

Trending sum over time

I have a table (in Postgres 9.1) that looks something like this:
CREATE TABLE actions (
user_id: INTEGER,
date: DATE,
action: VARCHAR(255),
count: INTEGER
)
For example:
user_id | date | action | count
---------------+------------+--------------+-------
1 | 2013-01-01 | Email | 1
1 | 2013-01-02 | Call | 3
1 | 2013-01-03 | Email | 3
1 | 2013-01-04 | Call | 2
1 | 2013-01-04 | Voicemail | 2
1 | 2013-01-04 | Email | 2
2 | 2013-01-04 | Email | 2
I would like to be able to view a user's total actions over time for a specific set of actions; for example, Calls + Emails:
user_id | date | count
-----------+-------------+---------
1 | 2013-01-01 | 1
1 | 2013-01-02 | 4
1 | 2013-01-03 | 7
1 | 2013-01-04 | 11
2 | 2013-01-04 | 2
The monstrosity that I've created so far looks like this:
SELECT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
actions
WHERE
action IN ('Call', 'Email')
GROUP BY
user_id, date, count;
Which works for single actions, but seems to break for multiple actions when they happen on the same day, for example instead of the expected 11 on 2013-01-04, we get 9:
date | user_id | count
------------+--------------+-------
2013-01-01 | 1 | 1
2013-01-02 | 1 | 4
2013-01-03 | 1 | 7
2013-01-04 | 1 | 9 <-- should be 11?
2013-01-04 | 2 | 2
Is it possible to tweak my query to resolve this issue? I tried removing the grouping on count, but Postgres doesn't seem to like that:
column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2: date, user_id, SUM(count) OVER (PARTITION BY user...
^
This query produces the result you are looking for:
SELECT DISTINCT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email');
The default window is already what you want, according to the official docs and the "DISTINCT" eliminates duplicate rows when both Emails and Calls happen on the same day.
See SQL Fiddle.
The table has a column named "count", and the expresion in the SELECT clause is aliased as "count", it is ambiguous.
Read documentation: http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY
In case of ambiguity, a GROUP BY name will be interpreted as an
input-column name rather than an output column name.
That means, that your query does not group by "count" evaluated in the SELECT clause, but rather it groups by "count" values taken from the table.
This query gives expected results, see SQL Fiddle
SELECT date, user_id, count
from (
Select date, user_id,
SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email')
) alias
GROUP BY
user_id, date, count;
Asserts
It is unclear whether you want to sort by user_id or date
It is also unclear whether you want to include dates in the result list, for which there is no row in the base table. In this case, refer to this closely related answer:
PostgreSQL: running count of rows for a query 'by minute'
Repair names
First off, I am using this test table instead of your problematic table:
CREATE TEMP TABLE actions (
user_id integer,
thedate date,
action text,
ct integer
);
Your use of reserved words and function names as identifiers (column names) is part of the problem.
Repair query
Combine aggregate and window functions
Since aggregate functions are applied first, your original query lumps the two rows found for user_id = 1 and thedate = '2013-01-04' into one. You have to multiply by count(*) to get the actual running count.
You can do this without subquery, since you can combine aggregate functions and window functions. Aggregate functions are applied first. You can even have a window functions over the result of aggregate functions.
SELECT thedate
, user_id
, sum(ct * count(*)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
GROUP BY user_id, thedate, ct
ORDER BY user_id, thedate;
Or simplify to:
...
, sum(sum(ct)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
...
This should also be the fastest of the solutions presented.
Here, the inner sum() is an aggregate function, while the outer sum() is a window function - over the result of the aggregate function.
Or use DISTINCT
Another way would to use DISTINCT or DISTINCT ON, since that is applied after window functions:
DISTINCT - this is possible, since running_ct is guaranteed to be the same in this case anyway, since all peers are summed at once for the default frame definition of window functions.
SELECT DISTINCT
thedate
, user_id
, sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
ORDER BY thedate, user_id;
Or simplify with DISTINCT ON:
SELECT DISTINCT ON (thedate, user_id)
...
->SQLfiddle demonstrating all variants.

SQL - Select unique rows from a group of results

I have wrecked my brain on this problem for quite some time. I've also reviewed other questions but was unsuccessful.
The problem I have is, I have a list of results/table that has multiple rows with columns
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 005DTHGP | 1966 | 2006-09-12 | Tracker
| 013DTHGP | 2281 | 2006-11-01 | Tracker
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 2404 | 2006-10-20 | Tracker
| 017DTNGP | 508 | 2007-11-10 | MBio
I am trying to select rows with unique REGISTRATIONS and where the DATE is max (the latest). The IDs are not proportional to the DATE, meaning the ID could be a low value yet the DATE is higher than the other matching row and vise-versa. Therefore I can't use MAX() on both the DATE and ID and grouping just doesn't seem to work.
The results I want are as follows;
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 508 | 2007-11-10 | MBio
PLEASE HELP!!!?!?!?!?!?!?
You want embedded queries, which not all SQLs support. In t-sql you'd have something like
select r.registration, r.recent, t.id, t.unittype
from (
select registration, max([date]) recent
from #tmp
group by
registration
) r
left outer join
#tmp t
on r.recent = t.[date]
and r.registration = t.registration
TSQL:
declare #R table
(
Registration varchar(16),
ID int,
Date datetime,
UnitType varchar(16)
)
insert into #R values ('A','1','20090824','A')
insert into #R values ('A','2','20090825','B')
select R.Registration,R.ID,R.UnitType,R.Date from #R R
inner join
(select Registration,Max(Date) as Date from #R group by Registration) M
on R.Registration = M.Registration and R.Date = M.Date
This can be inefficient if you have thousands of rows in your table depending upon how the query is executed (i.e. if it is a rowscan and then a select per row).
In PostgreSQL, and assuming your data is indexed so that a sort isn't needed (or there are so few rows you don't mind a sort):
select distinct on (registration), * from whatever order by registration,"date" desc;
Taking each row in registration and descending date order, you will get the latest date for each registration first. DISTINCT throws away the duplicate registrations that follow.
select registration,ID,date,unittype
from your_table
where (registration, date) IN (select registration,max(date)
from your_table
group by registration)
This should work in MySQL:
SELECT registration, id, date, unittype FROM
(SELECT registration AS temp_reg, MAX(date) as temp_date
FROM table_name GROUP BY registration) AS temp_table
WHERE registration=temp_reg and date=temp_date
The idea is to use a subquery in a FROM clause which throws up a single row containing the correct date and registration (the fields subjected to a group); then use the correct date and registration in a WHERE clause to fetch the other fields of the same row.