Is there a way in SQL to aggregate a column across rows and potentially duplicate rows based on another field value in Redshift? - sql

So I have a table, let's call it shipment_items that lists by a shipment_id the individual items contained within a shipment and their quantities.
+-------------+-------------+----------+
| shipment_id | item_name | quantity |
+-------------+-------------+----------+
| 1 | cleanser | 1 |
| 1 | moisturizer | 2 |
| 2 | cleanser | 2 |
| 2 | body wash | 1 |
| 3 | cleanser | 1 |
| 3 | moisturizer | 2 |
| 4 | cleanser | 1 |
| 4 | moisturizer | 1 |
+-------------+-------------+----------+
What I want is to return a table that looks like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser, moisturizer, moisturizer | 2 |
| body wash, cleanser, cleanser | 1 |
| cleanser, moisturizer | 1 |
+------------------------------------+----------+
Is there any way in sql to do that? I'm thinking something with list_agg, but the tricky part is duplicating the item_names based on the quantity field. What I'm trying to show in the new table is that there were 2 shipments that contained 2 moisturizers and 1 cleanser, and 1 shipment containing 2 cleansers and 1 body wash.
** EDIT **
Resolved thanks to #Gordon Linoff
new resulting table will look like this
+------------------------------------+----------+
| items | num_ship |
+------------------------------------+----------+
| cleanser: 1, moisturizer: 2 | 2 |
| body wash: 1, cleanser: 2 | 1 |
| cleanser: 1, moisturizer: 1 | 1 |

You can use listagg():
select listagg(item_name, ', ') within group (order by item_name) as items,
quantity
from t
group by quantity
order by quantity desc;
EDIT:
I think you want two levels of aggregation:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;
This does not include duplicates in the item list.
EDIT II:
For exact matches, include the quantity:
select items, count(*)
from (select shipment_id,
listagg(distinct item_name || ':' || quantity, ', ') within group (order by item_name) as items
from t
group by shipment_id
) s
group by items
order by count(*) desc;

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

postgresql aggregate by max string length

I have a one to many relationship. In this case, it's a pipelines entity that can have many segments. The segments entity has a column to list the wells associated with this pipeline. This column is purely informational, and is only updated from a regulatory source as a comma separated list, so the data type is text.
What I want to do is to list all the pipelines and show the segment column that has the most associated wells. Each well is identified with a standardized land location (text is the same length for each well). I am also doing other aggregate functions on the segments, so my query looks something like this (I have to simplify it because it's pretty large):
SELECT pipelines.*, max(segments.associated_wells), min(segments.days_without_production), max(segments.production_water_m3)
FROM pipelines
JOIN segments ON segments.pipeline_id = pipelines.id
GROUP BY pipelines.id
This selects the associated_wells that has the highest alphabetical value, which makes sense, but is not what I want.
max(length(segments.associated_wells)) will select the record I want, but only show the length. I need to show the column value.
How can I aggregate based on the string length but show the value?
Here's an example of what I am expecting:
Segment entity:
| id | pipeline_id | associated_wells | days_without_production | production_water_m3 |
|----|-------------|--------------------------|-------------------------|---------------------|
| 1 | 1 | 'location1', 'location2' | 30 | 2.3 |
| 2 | 1 | 'location1' | 15 | 1.4 |
| 3 | 2 | 'location1' | 20 | 1.8 |
Pipeline entity:
| id | name |
|----|-------------|
| 1 | 'Pipeline1' |
| 2 | 'Pipeline2' |
| | |
Desired Query Result:
| id | name | associated_wells | days_without_production | production_water_m3 |
|----|-------------|--------------------------|-------------------------|---------------------|
| 1 | 'Pipeline1' | 'location1', 'location2' | 15 | 2.3 |
| 2 | 'Pipeline2' | 'location1' | 20 | 1.8 |
| | | | | |
If I understand correctly, you want DISTINCT ON:
SELECT DISTINCT ON (p.id) p.*, s.*
FROM pipelines p JOIN
segments s
ON s.pipeline_id = p.id
ORDER BY p.id, LENGTH(s.associated_wells) DESC;
Keep normalising and verticalise the locations/associated wells, by cross joining with a series of integers, and then group twice:
WITH
segment(seg_id,pipeline_id,associated_wells,days_without_production,production_water_m3) AS (
SELECT 1,1,'location1, location2',30,2.3
UNION ALL SELECT 2,1,'location1',15,1.4
UNION ALL SELECT 3,2,'location1',20,1.8
)
,
pipeline(pipeline_id,name) AS (
SELECT 1,'Pipeline1'
UNION ALL SELECT 2,'Pipeline2'
)
,
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
,
location AS (
SELECT
seg_id
, i AS loc_id
, SPLIT_PART(associated_wells,', ',i) AS location
FROM segment CROSS JOIN i
WHERE SPLIT_PART(associated_wells,',' ,i) <>''
)
,
pregroup AS (
SELECT
segment.pipeline_id
, location.location
, MIN(days_without_production) AS days_without_production
, MAX(production_water_m3) AS production_water_m3
FROM segment
JOIN pipeline USING(pipeline_id)
JOIN location USING(seg_id)
GROUP BY 1,2
)
SELECT
pipeline_id
, STRING_AGG(location,',') AS locations
, MIN(days_without_production) AS days_without_production
, MAX(production_water_m3) AS production_water_m3
FROM pregroup
GROUP BY 1;
pipeline_id | locations | days_without_production | production_water_m3
-------------+---------------------+-------------------------+---------------------
1 | location1,location2 | 15 | 2.3
2 | location1 | 20 | 1.8

how to bake in a record count in a sql query

I have a query that looks like this:
select id, extension, count(distinct(id)) from publicids group by id,extension;
This is what the results looks like:
id | extension | count
-------------+-------------------------+-------
18459154909 | 12333 | 1
18459154909 | 9891114 | 1
18459154919 | 43244 | 1
18459154919 | 8776232 | 1
18766145025 | 12311 | 1
18766145025 | 1122111 | 1
18766145201 | 12422 | 1
18766145201 | 14141 | 1
But what I really want is for the results to look like this:
id | extension | count
-------------+-------------------------+-------
18459154909 | 12333 | 2
18459154909 | 9891114 | 2
18459154919 | 43244 | 2
18459154919 | 8776232 | 2
18766145025 | 12311 | 2
18766145025 | 1122111 | 2
18766145201 | 12422 | 2
18766145201 | 14141 | 2
I'm trying to get the count field to show the total number of records that have the same id.
Any suggestions would be appreciated
I think you want to count distincts extentions, not ids.
Run this query:
select id
, extension
(select count(*) from publicids p1 where p.id = p1.id ) distinct_id_count
from publicids p
group by id,extension;
This is more or less the same as Pastor's answer. Depending on what the optimizer does it might be faster with higher record count source tables.
select p.id, p.extension, p2.id_count
from publicids p
inner join (
select id, count(*) as id_count
from publicids group by id
) as p2 on p.id = p2.id

Doing a market basket analysis on the order details

I have a table that looks (abbreviated) like:
| order_id | item_id | amount | qty | date |
|---------- |--------- |-------- |----- |------------ |
| 1 | 1 | 10 | 1 | 10-10-2014 |
| 1 | 2 | 20 | 2 | 10-10-2014 |
| 2 | 1 | 10 | 1 | 10-12-2014 |
| 2 | 2 | 20 | 1 | 10-12-2014 |
| 2 | 3 | 45 | 1 | 10-12-2014 |
| 3 | 1 | 10 | 1 | 9-9-2014 |
| 3 | 3 | 45 | 1 | 9-9-2014 |
| 4 | 2 | 20 | 1 | 11-11-2014 |
I would like to run a query that would calculate the list of items
that most frequently occur together.
In this case the result would be:
|items|frequency|
|-----|---------|
|1,2, |2 |
|1,3 |1 |
|2,3 |1 |
|2 |1 |
Ideally, first presenting orders with more than one items, then presenting
the most frequently ordered single items.
Could anyone please provide an example for how to structure this SQL?
This query generate all of the requested output, in the cases where 2 items occur together. It doesn't include the last item of the requested output since a single value (2) technically doesn't occur together with anything... although you could easily add a UNION query to include values that happen alone.
This is written for PostgreSQL 9.3
create table orders(
order_id int,
item_id int,
amount int,
qty int,
date timestamp
);
INSERT INTO ORDERS VALUES(1,1,10,1,'10-10-2014');
INSERT INTO ORDERS VALUES(1,2,20,1,'10-10-2014');
INSERT INTO ORDERS VALUES(2,1,10,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,2,20,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,3,45,1,'10-12-2014');
INSERT INTO ORDERS VALUES(3,1,10,1,'9-9-2014');
INSERT INTO ORDERS VALUES(3,3,45,1,'9-9-2014');
INSERT INTO ORDERS VALUES(4,2,10,1,'11-11-2014');
with order_pairs as (
select (pg1.item_id, pg2.item_id) as items, pg1.date
from
(select distinct item_id, date
from orders) as pg1
join
(select distinct item_id, date
from orders) as pg2
ON
(
pg1.date = pg2.date AND
pg1.item_id != pg2.item_id AND
pg1.item_id < pg2.item_id
)
)
SELECT items, count(*) as frequency
FROM order_pairs
GROUP by items
ORDER by items;
output
items | frequency
-------+-----------
(1,2) | 2
(1,3) | 2
(2,3) | 1
(3 rows)
Market Basket Analysis with Join.
Join on order_id and compare if item_id < self.item_id. So for every item_id you get its associated items sold. And then group by items and count the number of rows for each combinations.
select items,count(*) as 'Freq' from
(select concat(x.item_id,',',y.item_id) as items from orders x
JOIN orders y ON x.order_id = y.order_id and
x.item_id != y.item_id and x.item_id < y.item_id) A
group by A.items order by A.items;

Count rows grouped by condition in SQL

We have a table like this:
+----+--------+
| Id | ItemId |
+----+--------+
| 1 | 1100 |
| 1 | 1101 |
| 1 | 1102 |
| 2 | 2001 |
| 2 | 2002 |
| 3 | 1101 |
+----+--------+
We want to count how many items each guy has, and show the guys with 2 items or more. Like this:
+----+-----------+
| Id | ItemCount |
+----+-----------+
| 1 | 3 |
| 2 | 2 |
+----+-----------+
We didn't count the guy with Id = 3 because he's got only 1 item.
How can we do this in SQL?
SELECT id, COUNT(itemId) AS ItemCount
FROM YourTable
GROUP BY id
HAVING COUNT(itemId) > 1
Use this query
SELECT *
FROM (
SELECT COUNT(ItemId ) AS COUNT, Id FROM ITEM
GROUP BY Id
)
my_select
WHERE COUNT>1
SELECT id,
count(1)
FROM YOUR_TABLE
GROUP BY id
HAVING count(1) > 1;
select Id, count(ItemId) as ItemCount
from table_name
group by Id
having ItemCount > 1