Cross join remaining combinations - hive

I am trying to build a table that would bring be a combination of all products that I could sell, based on the current ones.
Product Status Table
+-------------+--------------+----------------+
| customer_id | product_name | product_status |
+-------------+--------------+----------------+
| 1 | A | Active |
| 2 | B | Active |
| 2 | C | Active |
| 3 | A | Cancelled |
+-------------+--------------+----------------+
Now I am trying to cross join with a hard code table that would give be 4 rows per customer_id, based on all 4 product we have in our portfolio, and statuses that I would like to apply.
Portfolio Table
+--------------+------------+----------+
| product_name | status_1 | status_2 |
+--------------+------------+----------+
| A | Inelegible | Inactive |
| B | Inelegible | Inactive |
| C | Ineligible | Inactive |
| D | Inelegible | Inactive |
+--------------+------------+----------+
On my code I tried to use a CROSS JOIN in order to achieve 4 rows per customer_id. Unfortunately, for customers that have more than one product, I have double/triple rows.
This is my code:
SELECT
p.customer_id,
CASE WHEN p.product_name = pt.product_name THEN p.product_name ELSE pt.product_name END AS product_name,
CASE
WHEN p.product_name = pt.product_name THEN p.product_status
ELSE pt.status_1
END AS product_status
FROM
products AS p
CROSS JOIN
portfolio as pt
This is my current output:
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | A | Ineligible |
| 7 | 2 | B | Active |
| 8 | 2 | B | Ineligible |
| 9 | 2 | C | Active |
| 10 | 2 | C | Ineligible |
| 11 | 2 | D | Ineligible |
| 12 | 2 | D | Ineligible |
| 13 | 3 | A | Cancelled |
| 14 | 3 | B | Ineligible |
| 15 | 3 | C | Ineligible |
| 16 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
As you may see, for the customer_id 2, I have two rows for each product having products B and C with different statuses then what I have on the product_status table.
What I would like to achieve, in this case, is a table with 12 rows, in which the current product/status from the product_status table is shown, and the remaining product/statuses from the portfolio table are added.
Expected output
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | B | Active |
| 7 | 2 | C | Active |
| 8 | 2 | D | Ineligible |
| 9 | 3 | A | Cancelled |
| 10 | 3 | B | Ineligible |
| 11 | 3 | C | Ineligible |
| 12 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
Not sure if the CROSS JOIN is the best alternative, but now I am running out of ideas.

EDIT:
I thought of another cleaner solution. Do a cross join first, then a right join on the customer_id and product_name, and coalesce the product statuses.
SELECT customer_id, product_name, coalesce(product_status, status_1)
FROM products p
RIGHT JOIN (
SELECT *
FROM (SELECT DISTINCT customer_id FROM products) pro
CROSS JOIN portfolio
) pt
USING (customer_id, product_name)
ORDER BY customer_id, product_name
Old answer:
The idea is to include information of all product names for a customer_id into a list, and check whether the product in portfolio is in that list.
(SELECT customer_id, pt_product_name as product_name, first(status_1) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt
)
WHERE NOT array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
UNION ALL
(SELECT customer_id, p_product_name as product_name, first(product_status) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt)
WHERE array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
ORDER BY customer_id, product_name;
which gives
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| B| Active|
| 2| C| Active|
| 2| D| Inelegible|
| 3| A| Cancelled|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
FYI the chunk before UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| D| Inelegible|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
And the chunk after UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 2| B| Active|
| 2| C| Active|
| 3| A| Cancelled|
+-----------+------------+--------------+
Hope that helps!

Related

Found number of rows before a value changes with a group by

I have a table like this one
CREATE TABLE Levels
([userid] int, [counter1] int, [counter2] int, [date] datetime)
;
The counter2 is an incremental value. The date is just the datetime the row was created. The counter1 is a field that can take different integer values. And the userid the id of the user.
This is an example of the data. You can find a bigger example with two users in sqlfiddle
| userid | counter1 | counter2 | date |
|--------|----------|----------|----------------------|
| 123 | 6 | 42 | 2010-07-31T00:12:28Z |
| 123 | 6 | 43 | 2010-11-20T00:11:15Z |
| 123 | 6 | 44 | 2011-03-12T00:15:07Z |
| 123 | 5 | 45 | 2011-07-02T01:11:09Z |
| 123 | 5 | 46 | 2011-10-22T00:24:18Z |
| 123 | 5 | 47 | 2012-02-10T23:51:54Z |
| 123 | 5 | 48 | 2012-06-01T23:43:26Z |
| 123 | 5 | 49 | 2012-09-21T23:43:59Z |
| 123 | 4 | 50 | 2013-01-11T23:52:43Z |
| 123 | 4 | 51 | 2013-05-03T23:49:25Z |
| 123 | 4 | 52 | 2013-08-23T23:48:24Z |
| 123 | 3 | 53 | 2013-12-14T00:01:20Z |
| 123 | 3 | 54 | 2014-04-04T23:45:45Z |
| 123 | 4 | 55 | 2014-07-25T23:44:34Z |
| 123 | 5 | 56 | 2014-11-14T23:46:11Z |
What I try to do is to count how many times the counter1 has the same value before it changes. Why the rest of the questions I found in stackoverflow didn't work?
The counter1 field can get the same value multiple times later on, which I don't want to count as the same case.
I am working in SQL Server 2008 and LAG function is not available
The desired result for the full example in sqlfiddle is
| userid | counter1 | count |
|--------|----------|-------|
| 123| 6| 3|
| 123| 5| 5|
| 123| 4| 3|
| 123| 3| 2|
| 123| 4| 1|
| 123| 5| 1|
| 123| 6| 2|
| 123| 5| 5|
| 123| 4| 2|
| 123| 5| 1|
| 123| 4| 5|
| 123| 5| 5|
| 345| 6| 2|
| 345| 6| 9|
This is a type of gaps-and-islands problem. Fortunately, you can use the difference of row numbers:
select userid, counter1, count(*)
from (select t.*,
row_number() over (partition by userid order by counter2) as seqnum,
row_number() over (partition by userid, counter1 order by counter2) as seqnum_2
from t
) t
group by userid, counter1, (seqnum - seqnum_2)
order by userid, min(counter2);
Note: This assumes that the ordering is based on counter2. If it is really based on date then you can use that column instead.
Why this works is a little tricky to explain. But if you look at the results from the subquery, you will see how the difference between the two row_number() values is constant when counter1 has the same value on adjacent rows.
YOu don't actually need LEAD and LAG here, however, getting to a supported version of SQL Server, where LAG (and LEAD) are available should be priority.
WITH YourTable AS(
SELECT *
FROM (VALUES(123,6,42,CONVERT(datetime2(0),'2010-07-31T00:12:28Z')),
(123,6,43,CONVERT(datetime2(0),'2010-11-20T00:11:15Z')),
(123,6,44,CONVERT(datetime2(0),'2011-03-12T00:15:07Z')),
(123,5,45,CONVERT(datetime2(0),'2011-07-02T01:11:09Z')),
(123,5,46,CONVERT(datetime2(0),'2011-10-22T00:24:18Z')),
(123,5,47,CONVERT(datetime2(0),'2012-02-10T23:51:54Z')),
(123,5,48,CONVERT(datetime2(0),'2012-06-01T23:43:26Z')),
(123,5,49,CONVERT(datetime2(0),'2012-09-21T23:43:59Z')),
(123,4,50,CONVERT(datetime2(0),'2013-01-11T23:52:43Z')),
(123,4,51,CONVERT(datetime2(0),'2013-05-03T23:49:25Z')),
(123,4,52,CONVERT(datetime2(0),'2013-08-23T23:48:24Z')),
(123,3,53,CONVERT(datetime2(0),'2013-12-14T00:01:20Z')),
(123,3,54,CONVERT(datetime2(0),'2014-04-04T23:45:45Z')),
(123,4,55,CONVERT(datetime2(0),'2014-07-25T23:44:34Z')),
(123,5,56,CONVERT(datetime2(0),'2014-11-14T23:46:11Z')))V(userid,counter1,counter2,date)),
Grps AS (
SELECT userid,
counter1,
counter2,
date,
ROW_NUMBER() OVER (PARTITION BY userid ORDER BY [date]) -
ROW_NUMBER() OVER (PARTITION BY userid,counter1 ORDER BY [date]) AS Grp
FROM YourTable)
SELECT userid,
counter1,
COUNT(*)
FROM Grps
GROUP BY userid,
counter1,
Grp;

Group query in subquery to get column value as column name

The data i've in my database:
| id| some_id| status|
| 1| 1 | SUCCESS|
| 2| 2 | SUCCESS|
| 3| 1 | SUCCESS|
| 4| 3 | SUCCESS|
| 5| 1 | SUCCESS|
| 6| 4 | FAILED |
| 7| 1 | SUCCESS|
| 8| 1 | FAILED |
| 9| 4 | FAILED |
| 10| 1 | FAILED |
.......
I ran a query to group by id and status to get the below result:
| some_id| count| status|
| 1 | 20| SUCCESS|
| 2 | 5 | SUCCESS|
| 3 | 10| SUCCESS|
| 2 | 15| FAILED |
| 3 | 12| FAILED |
| 4 | 25 | FAILED |
I want to use the above query as subquery to get the result below, where the distinct status are column name.
| some_id| SUCCESS| FAILED|
| 1 | 20 | null/0|
| 2 | 5 | 15 |
| 3 | 10 | 12 |
| 4 | null/0| 25 |
Any other approach to get the final data is also appreciated. Let me know if need more info.
Thanks
You may use a pivot query here with the help of FILTER:
SELECT
some_id,
COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS,
COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED
FROM yourTable
GROUP BY
some_id;
Demo

SQL to add position depending on multiple columns

I have a table that I am adding a position column in. I will need to add a numbered position to all rows already in the table. The numbering depends on 4 columns that would match each other between rows. For example
id| name| fax | cart| area |
1| jim | 1 | 4 | 1 |
2| jim | 1 | 4 | 1 |
3| jim | 2 | 4 | 1 |
4| jim | 2 | 4 | 1 |
5| bob | 1 | 4 | 1 |
6| bob | 1 | 4 | 1 |
7| bob | 2 | 5 | 1 |
8| bob | 2 | 5 | 2 |
9| bob | 2 | 5 | 2 |
10| bob | 2 | 5 | 2 |
would result with
id| name| fax | cart| area | position
1| jim | 1 | 4 | 1 | 1
2| jim | 1 | 4 | 1 | 2
3| jim | 2 | 4 | 1 | 1
4| jim | 2 | 4 | 1 | 2
5| bob | 1 | 4 | 1 | 1
6| bob | 1 | 4 | 1 | 2
7| bob | 2 | 5 | 1 | 1
8| bob | 2 | 5 | 2 | 1
9| bob | 2 | 5 | 2 | 2
10| bob | 2 | 5 | 2 | 3
I need an sql query that will iterate over the table and add the position.
Use row_number():
select
t.*,
row_number() over(partition by name, fax, cart, area order by id) position
from mytable t
If you wanted an update query:
update mytable as t
set position = rn
from (
select id, row_number() over(partition by name, fax, cart, area order by id) rn
from mytable
) x
where x.id = t.id

How to group by with a condition in PySpark

How to group by with a condition in PySpark?
This is an example data:
+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x | Adult | 123 |
| 124 | x | Children | 231 |
| 123 | x | Children | 456 |
| 156 | x | Adult | 453 |
| 124 | y | Adult | 34 |
| 432 | y | Adult | 23 |
| 234 | y | Children | 13 |
| 432 | z | Children | 22 |
| 234 | z | Adult | 44 |
+-----+-------+-------------+------------+
then wanted to see the data as:
+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x | 1 | 1 | 2 |
| 124 | x | 1 | 1 | 2 |
| 156 | x | 1 | 0 | 1 |
| 432 | y | 1 | 1 | 2 |
| 234 | z | 1 | 1 | 2 |
+-----+-------+-------+----------+------------+
How can I do this?
Here is the spark sql version.
df.createOrReplaceTempView('table')
spark.sql('''
select zip, state,
count(if(Agegrouping = 'Adult', 1, null)) as adult,
count(if(Agegrouping = 'Children', 1, null)) as children,
count(1) as patient_id
from table
group by zip, state;
''').show()
+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123| x| 1| 1| 2|
|156| x| 1| 0| 1|
|234| z| 1| 0| 1|
|432| z| 0| 1| 1|
|234| y| 0| 1| 1|
|124| y| 0| 0| 1|
|124| x| 0| 1| 1|
|432| y| 1| 0| 1|
+---+-----+-----+--------+----------+
You can use conditional aggregation:
select zip, state,
sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
count(*) as num_patients
from t
group by zip, state;
Use conditional aggreagation:
select
zip,
state,
sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult
sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
count(*) patient_id
from mytable
group by zip, state

sql query for getting items not rated by both users

Say I have the following table:
+------------+------+--------+
| reviewerID | item | rating |
+------------+------+--------+
| 1 | 1 | 5|
| 1 | 2 | 5|
| 1 | 3 | 5|
| 2 | 4 | 5|
| 2 | 1 | 5|
| 2 | 2 | 5|
+------------+------+--------+
And I want to get the items not rated by reviewer 1 but rated by reviewer 2 and vice versa into one table. The output should be something like this:
+------------+------+--------+
| reviewerID | item | rating |
+------------+------+--------+
| 1 | 3 | 5|
| 2 | 4 | 5|
+------------+------+--------+
You could count the number of reviewers the items had (between those two reviewers) and only select those with one reviewer:
SELECT *
FROM mytable
WHERE item IN (SELECT item
FROM mytable
WHERE reviewerID IN (1, 2)
GROUP BY item
HAVING COUNT(*) = 1)
Here's what you need, to get the desired results....
SELECT a.* FROM Reviewer a
JOIN ( SELECT DISTINCT item FROM Reviewer
GROUP BY item
HAVING count(item) < 2) b
ON a.item = b.item
Hope it helps!!
Good luck!!