Cross join remaining combinations - hive
I am trying to build a table that would bring be a combination of all products that I could sell, based on the current ones.
Product Status Table
+-------------+--------------+----------------+
| customer_id | product_name | product_status |
+-------------+--------------+----------------+
| 1 | A | Active |
| 2 | B | Active |
| 2 | C | Active |
| 3 | A | Cancelled |
+-------------+--------------+----------------+
Now I am trying to cross join with a hard code table that would give be 4 rows per customer_id, based on all 4 product we have in our portfolio, and statuses that I would like to apply.
Portfolio Table
+--------------+------------+----------+
| product_name | status_1 | status_2 |
+--------------+------------+----------+
| A | Inelegible | Inactive |
| B | Inelegible | Inactive |
| C | Ineligible | Inactive |
| D | Inelegible | Inactive |
+--------------+------------+----------+
On my code I tried to use a CROSS JOIN in order to achieve 4 rows per customer_id. Unfortunately, for customers that have more than one product, I have double/triple rows.
This is my code:
SELECT
p.customer_id,
CASE WHEN p.product_name = pt.product_name THEN p.product_name ELSE pt.product_name END AS product_name,
CASE
WHEN p.product_name = pt.product_name THEN p.product_status
ELSE pt.status_1
END AS product_status
FROM
products AS p
CROSS JOIN
portfolio as pt
This is my current output:
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | A | Ineligible |
| 7 | 2 | B | Active |
| 8 | 2 | B | Ineligible |
| 9 | 2 | C | Active |
| 10 | 2 | C | Ineligible |
| 11 | 2 | D | Ineligible |
| 12 | 2 | D | Ineligible |
| 13 | 3 | A | Cancelled |
| 14 | 3 | B | Ineligible |
| 15 | 3 | C | Ineligible |
| 16 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
As you may see, for the customer_id 2, I have two rows for each product having products B and C with different statuses then what I have on the product_status table.
What I would like to achieve, in this case, is a table with 12 rows, in which the current product/status from the product_status table is shown, and the remaining product/statuses from the portfolio table are added.
Expected output
+----+-------------+--------------+----------------+
| # | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
| 1 | 1 | A | Active |
| 2 | 1 | B | Inelegible |
| 3 | 1 | C | Inelegible |
| 4 | 1 | D | Inelegible |
| 5 | 2 | A | Ineligible |
| 6 | 2 | B | Active |
| 7 | 2 | C | Active |
| 8 | 2 | D | Ineligible |
| 9 | 3 | A | Cancelled |
| 10 | 3 | B | Ineligible |
| 11 | 3 | C | Ineligible |
| 12 | 3 | D | Ineligible |
+----+-------------+--------------+----------------+
Not sure if the CROSS JOIN is the best alternative, but now I am running out of ideas.
EDIT:
I thought of another cleaner solution. Do a cross join first, then a right join on the customer_id and product_name, and coalesce the product statuses.
SELECT customer_id, product_name, coalesce(product_status, status_1)
FROM products p
RIGHT JOIN (
SELECT *
FROM (SELECT DISTINCT customer_id FROM products) pro
CROSS JOIN portfolio
) pt
USING (customer_id, product_name)
ORDER BY customer_id, product_name
Old answer:
The idea is to include information of all product names for a customer_id into a list, and check whether the product in portfolio is in that list.
(SELECT customer_id, pt_product_name as product_name, first(status_1) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt
)
WHERE NOT array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
UNION ALL
(SELECT customer_id, p_product_name as product_name, first(product_status) as product_status
FROM (
SELECT
customer_id,
p.product_name as p_product_name,
pt.product_name as pt_product_name,
product_status,
status_1,
status_2,
collect_list(p.product_name) over (partition by customer_id) AS product_list
FROM products p
CROSS JOIN portfolio pt)
WHERE array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)
ORDER BY customer_id, product_name;
which gives
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| B| Active|
| 2| C| Active|
| 2| D| Inelegible|
| 3| A| Cancelled|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
FYI the chunk before UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| B| Inelegible|
| 1| C| Ineligible|
| 1| D| Inelegible|
| 2| A| Inelegible|
| 2| D| Inelegible|
| 3| B| Inelegible|
| 3| C| Ineligible|
| 3| D| Inelegible|
+-----------+------------+--------------+
And the chunk after UNION ALL gives:
+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
| 1| A| Active|
| 2| B| Active|
| 2| C| Active|
| 3| A| Cancelled|
+-----------+------------+--------------+
Hope that helps!
Related
Found number of rows before a value changes with a group by
I have a table like this one CREATE TABLE Levels ([userid] int, [counter1] int, [counter2] int, [date] datetime) ; The counter2 is an incremental value. The date is just the datetime the row was created. The counter1 is a field that can take different integer values. And the userid the id of the user. This is an example of the data. You can find a bigger example with two users in sqlfiddle | userid | counter1 | counter2 | date | |--------|----------|----------|----------------------| | 123 | 6 | 42 | 2010-07-31T00:12:28Z | | 123 | 6 | 43 | 2010-11-20T00:11:15Z | | 123 | 6 | 44 | 2011-03-12T00:15:07Z | | 123 | 5 | 45 | 2011-07-02T01:11:09Z | | 123 | 5 | 46 | 2011-10-22T00:24:18Z | | 123 | 5 | 47 | 2012-02-10T23:51:54Z | | 123 | 5 | 48 | 2012-06-01T23:43:26Z | | 123 | 5 | 49 | 2012-09-21T23:43:59Z | | 123 | 4 | 50 | 2013-01-11T23:52:43Z | | 123 | 4 | 51 | 2013-05-03T23:49:25Z | | 123 | 4 | 52 | 2013-08-23T23:48:24Z | | 123 | 3 | 53 | 2013-12-14T00:01:20Z | | 123 | 3 | 54 | 2014-04-04T23:45:45Z | | 123 | 4 | 55 | 2014-07-25T23:44:34Z | | 123 | 5 | 56 | 2014-11-14T23:46:11Z | What I try to do is to count how many times the counter1 has the same value before it changes. Why the rest of the questions I found in stackoverflow didn't work? The counter1 field can get the same value multiple times later on, which I don't want to count as the same case. I am working in SQL Server 2008 and LAG function is not available The desired result for the full example in sqlfiddle is | userid | counter1 | count | |--------|----------|-------| | 123| 6| 3| | 123| 5| 5| | 123| 4| 3| | 123| 3| 2| | 123| 4| 1| | 123| 5| 1| | 123| 6| 2| | 123| 5| 5| | 123| 4| 2| | 123| 5| 1| | 123| 4| 5| | 123| 5| 5| | 345| 6| 2| | 345| 6| 9|
This is a type of gaps-and-islands problem. Fortunately, you can use the difference of row numbers: select userid, counter1, count(*) from (select t.*, row_number() over (partition by userid order by counter2) as seqnum, row_number() over (partition by userid, counter1 order by counter2) as seqnum_2 from t ) t group by userid, counter1, (seqnum - seqnum_2) order by userid, min(counter2); Note: This assumes that the ordering is based on counter2. If it is really based on date then you can use that column instead. Why this works is a little tricky to explain. But if you look at the results from the subquery, you will see how the difference between the two row_number() values is constant when counter1 has the same value on adjacent rows.
YOu don't actually need LEAD and LAG here, however, getting to a supported version of SQL Server, where LAG (and LEAD) are available should be priority. WITH YourTable AS( SELECT * FROM (VALUES(123,6,42,CONVERT(datetime2(0),'2010-07-31T00:12:28Z')), (123,6,43,CONVERT(datetime2(0),'2010-11-20T00:11:15Z')), (123,6,44,CONVERT(datetime2(0),'2011-03-12T00:15:07Z')), (123,5,45,CONVERT(datetime2(0),'2011-07-02T01:11:09Z')), (123,5,46,CONVERT(datetime2(0),'2011-10-22T00:24:18Z')), (123,5,47,CONVERT(datetime2(0),'2012-02-10T23:51:54Z')), (123,5,48,CONVERT(datetime2(0),'2012-06-01T23:43:26Z')), (123,5,49,CONVERT(datetime2(0),'2012-09-21T23:43:59Z')), (123,4,50,CONVERT(datetime2(0),'2013-01-11T23:52:43Z')), (123,4,51,CONVERT(datetime2(0),'2013-05-03T23:49:25Z')), (123,4,52,CONVERT(datetime2(0),'2013-08-23T23:48:24Z')), (123,3,53,CONVERT(datetime2(0),'2013-12-14T00:01:20Z')), (123,3,54,CONVERT(datetime2(0),'2014-04-04T23:45:45Z')), (123,4,55,CONVERT(datetime2(0),'2014-07-25T23:44:34Z')), (123,5,56,CONVERT(datetime2(0),'2014-11-14T23:46:11Z')))V(userid,counter1,counter2,date)), Grps AS ( SELECT userid, counter1, counter2, date, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY [date]) - ROW_NUMBER() OVER (PARTITION BY userid,counter1 ORDER BY [date]) AS Grp FROM YourTable) SELECT userid, counter1, COUNT(*) FROM Grps GROUP BY userid, counter1, Grp;
Group query in subquery to get column value as column name
The data i've in my database: | id| some_id| status| | 1| 1 | SUCCESS| | 2| 2 | SUCCESS| | 3| 1 | SUCCESS| | 4| 3 | SUCCESS| | 5| 1 | SUCCESS| | 6| 4 | FAILED | | 7| 1 | SUCCESS| | 8| 1 | FAILED | | 9| 4 | FAILED | | 10| 1 | FAILED | ....... I ran a query to group by id and status to get the below result: | some_id| count| status| | 1 | 20| SUCCESS| | 2 | 5 | SUCCESS| | 3 | 10| SUCCESS| | 2 | 15| FAILED | | 3 | 12| FAILED | | 4 | 25 | FAILED | I want to use the above query as subquery to get the result below, where the distinct status are column name. | some_id| SUCCESS| FAILED| | 1 | 20 | null/0| | 2 | 5 | 15 | | 3 | 10 | 12 | | 4 | null/0| 25 | Any other approach to get the final data is also appreciated. Let me know if need more info. Thanks
You may use a pivot query here with the help of FILTER: SELECT some_id, COUNT(*) FILTER (WHERE status = 'SUCCESS') AS SUCCESS, COUNT(*) FILTER (WHERE status = 'FAILED') AS FAILED FROM yourTable GROUP BY some_id; Demo
SQL to add position depending on multiple columns
I have a table that I am adding a position column in. I will need to add a numbered position to all rows already in the table. The numbering depends on 4 columns that would match each other between rows. For example id| name| fax | cart| area | 1| jim | 1 | 4 | 1 | 2| jim | 1 | 4 | 1 | 3| jim | 2 | 4 | 1 | 4| jim | 2 | 4 | 1 | 5| bob | 1 | 4 | 1 | 6| bob | 1 | 4 | 1 | 7| bob | 2 | 5 | 1 | 8| bob | 2 | 5 | 2 | 9| bob | 2 | 5 | 2 | 10| bob | 2 | 5 | 2 | would result with id| name| fax | cart| area | position 1| jim | 1 | 4 | 1 | 1 2| jim | 1 | 4 | 1 | 2 3| jim | 2 | 4 | 1 | 1 4| jim | 2 | 4 | 1 | 2 5| bob | 1 | 4 | 1 | 1 6| bob | 1 | 4 | 1 | 2 7| bob | 2 | 5 | 1 | 1 8| bob | 2 | 5 | 2 | 1 9| bob | 2 | 5 | 2 | 2 10| bob | 2 | 5 | 2 | 3 I need an sql query that will iterate over the table and add the position.
Use row_number(): select t.*, row_number() over(partition by name, fax, cart, area order by id) position from mytable t If you wanted an update query: update mytable as t set position = rn from ( select id, row_number() over(partition by name, fax, cart, area order by id) rn from mytable ) x where x.id = t.id
How to group by with a condition in PySpark
How to group by with a condition in PySpark? This is an example data: +-----+-------+-------------+------------+ | zip | state | Agegrouping | patient_id | +-----+-------+-------------+------------+ | 123 | x | Adult | 123 | | 124 | x | Children | 231 | | 123 | x | Children | 456 | | 156 | x | Adult | 453 | | 124 | y | Adult | 34 | | 432 | y | Adult | 23 | | 234 | y | Children | 13 | | 432 | z | Children | 22 | | 234 | z | Adult | 44 | +-----+-------+-------------+------------+ then wanted to see the data as: +-----+-------+-------+----------+------------+ | zip | state | Adult | Children | patient_id | +-----+-------+-------+----------+------------+ | 123 | x | 1 | 1 | 2 | | 124 | x | 1 | 1 | 2 | | 156 | x | 1 | 0 | 1 | | 432 | y | 1 | 1 | 2 | | 234 | z | 1 | 1 | 2 | +-----+-------+-------+----------+------------+ How can I do this?
Here is the spark sql version. df.createOrReplaceTempView('table') spark.sql(''' select zip, state, count(if(Agegrouping = 'Adult', 1, null)) as adult, count(if(Agegrouping = 'Children', 1, null)) as children, count(1) as patient_id from table group by zip, state; ''').show() +---+-----+-----+--------+----------+ |zip|state|adult|children|patient_id| +---+-----+-----+--------+----------+ |123| x| 1| 1| 2| |156| x| 1| 0| 1| |234| z| 1| 0| 1| |432| z| 0| 1| 1| |234| y| 0| 1| 1| |124| y| 0| 0| 1| |124| x| 0| 1| 1| |432| y| 1| 0| 1| +---+-----+-----+--------+----------+
You can use conditional aggregation: select zip, state, sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult, sum(case when agegrouping = 'Children' then 1 else 0 end) as children, count(*) as num_patients from t group by zip, state;
Use conditional aggreagation: select zip, state, sum(case when agregrouping = 'Adult' then 1 else 0 end ) as adult sum(case when agregrouping = 'Children' then 1 else 0 end ) as children, count(*) patient_id from mytable group by zip, state
sql query for getting items not rated by both users
Say I have the following table: +------------+------+--------+ | reviewerID | item | rating | +------------+------+--------+ | 1 | 1 | 5| | 1 | 2 | 5| | 1 | 3 | 5| | 2 | 4 | 5| | 2 | 1 | 5| | 2 | 2 | 5| +------------+------+--------+ And I want to get the items not rated by reviewer 1 but rated by reviewer 2 and vice versa into one table. The output should be something like this: +------------+------+--------+ | reviewerID | item | rating | +------------+------+--------+ | 1 | 3 | 5| | 2 | 4 | 5| +------------+------+--------+
You could count the number of reviewers the items had (between those two reviewers) and only select those with one reviewer: SELECT * FROM mytable WHERE item IN (SELECT item FROM mytable WHERE reviewerID IN (1, 2) GROUP BY item HAVING COUNT(*) = 1)
Here's what you need, to get the desired results.... SELECT a.* FROM Reviewer a JOIN ( SELECT DISTINCT item FROM Reviewer GROUP BY item HAVING count(item) < 2) b ON a.item = b.item Hope it helps!! Good luck!!