Creating a rank that resets on a specific value of a column - sql

My current data looks like this (note that it is sorted on datetime):
+----------------+---------------------+---------+
| CustomerNumber | Date | Channel |
+----------------+---------------------+---------+
| 120584446 | 2015-05-22 21:16:05 | A |
| 120584446 | 2015-05-25 18:04:16 | A |
| 120584446 | 2015-05-25 18:05:25 | B |
| 120584446 | 2015-05-28 20:35:09 | A |
| 120584446 | 2015-05-28 20:36:01 | A |
| 120584446 | 2015-05-28 20:37:02 | B |
| 120584446 | 2015-05-29 13:39:00 | B |
+----------------+---------------------+---------+
I want to create a rank in hive that splits on cutomer number and whenever the channel is A. It should look like this:
+----------------+---------------------+----------------+------+
| CustomerNumber | Date | Channel | Rank |
+----------------+---------------------+----------------+------+
| 120584446 | 2015-05-22 21:16:05 | A | 1 |
| 120584446 | 2015-05-25 18:04:16 | A | 1 |
| 120584446 | 2015-05-25 18:05:25 | B | 2 |
| 120584446 | 2015-05-28 20:35:09 | A | 1 |
| 120584446 | 2015-05-28 20:36:01 | A | 1 |
| 120584446 | 2015-05-28 20:37:02 | B | 2 |
| 120584446 | 2015-05-29 13:39:00 | B | 3 |
+----------------+---------------------+----------------+------+

One approach is to use a cumulative conditional sum to identify the groups and then use row_number() for the ranking:
select t.*,
row_number() over (partition by CustomerNumber, grp
order by date
) as rank
from (select t.*,
sum(case when channel = 'A' then 1 else 0 end) over
(partition by CustomerNumber order by date) as grp
from t
) t;

Related

How to get Max date and sum of its rows SQL

I have following table,
+------+-------------+----------+---------+
| id | date | amount | amount2 |
+------+-------------+----------+---------+
| | | | 500 |
| 1 | 1/1/2020 | 1000 | |
+------+-------------+----------+---------+
| | | | 100 |
| 1 | 1/3/2020 | 1558 | |
+------+-------------+----------+---------+
| | | | 200 |
| 1 | 1/3/2020 | 126 | |
+------+-------------+----------+---------+
| | | | 500 |
| 2 | 2/5/2020 | 4921 | |
+------+-------------+----------+---------+
| | | | 100 |
| 2 | 2/5/2020 | 15 | |
+------+-------------+----------+---------+
| | | | 140 |
| 2 | 1/1/2020 | 5951 | |
+------+-------------+----------+---------+
| | | | 10 |
| 2 | 1/2/2020 | 1588 | |
+------+-------------+----------+---------+
| | | | 56 |
| 2 | 1/3/2020 | 1568 | |
+------+-------------+----------+---------+
| | | | 45 |
| 2 | 1/4/2020 | 12558 | |
+------+-------------+----------+---------+
I need to get each Id's max date and its amount and amount2 summations, how can I do this. according to above data, I need following output.
+------+-------------+----------+---------+
| | | | 300 |
| 1 | 1/3/2020 | 1684 | |
+------+-------------+----------+---------+
| | | | 600 |
| 2 | 2/5/2020 | 4936 | |
+------+-------------+----------+---------+
How can I do this.
Aggregate and use MAX OVER to get the IDs' maximum dates:
select id, [date], sum_amount, sum_amount2
from
(
select
id, [date], sum(amount) as sum_amount, sum(amount2) as sum_amount2,
max([date]) over (partition by id) as max_date_for_id
from mytable group by id, [date]
) aggregated
where [date] = max_date_for_id
order by id;
first is to use dense_rank() to find the row with latest date
dense_rank () over (partition by id order by [date] desc)
after that, just simply group by with sum() on the amount
select id, [date], sum(amount), sum(amount2)
from
(
select *,
dr = dense_rank () over (partition by id order by [date] desc)
from your_table
) t
where dr = 1
group by id, [date]

Selecting the first instance of a vendor, part combination

I am trying to create an indicator for if a particular transaction was the first time a part was purchased from a particular vendor.
I have a dataset that looks like this:
| transaction_id | vendor_id | part_id | trans_date |
|:--------------:|:---------:|:-------:|:-----------------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 |
| 1Po.4Ot, | a | 473 | 4/22/2016 |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 |
| 2Lz&7Hu& | a | 873 | 12/20/2017 |
| 8Lz)5Is# | b | 743 | 10/22/2016 |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 |
| 0Ra&8Hb& | a | 653 | 10/4/2017 |
| 4Wc-8Of* | c | 333 | 8/3/2017 |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 |
| 6Qh!1Ha- | c | 333 | 3/28/2017 |
| 2Ol%4Rs# | c | 333 | 5/2/2017 |
| 1Gg#8Cm% | c | 333 | 11/15/2016 |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 |
| 1Gy/7Zw, | a | 443 | 10/12/2018 |
| 2Gz,4Gp. | b | 103 | 1/5/2018 |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 |
| 5Hl-8Ds! | a | 903 | 12/8/2017 |
| 8Ws$3Vy* | b | 873 | 1/13/2018 |
What I am looking to do is determine if the transaction_id was the first time (sorted by trans_date), that the part_id was purchased from a vendor_id. I would imagine the ideal output to look like this:
| transaction_id | vendor_id | part_id | trans_date | first_time |
|:--------------:|:---------:|:-------:|:-----------------:|:----------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 | N |
| 1Po.4Ot, | a | 473 | 4/22/2016 | Y |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 | Y |
| 2Lz&7Hu& | a | 873 | 12/20/2017 | Y |
| 8Lz)5Is# | b | 743 | 10/22/2016 | Y |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 | Y |
| 0Ra&8Hb& | a | 653 | 10/4/2017 | Y |
| 4Wc-8Of* | c | 333 | 8/3/2017 | N |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 | N |
| 6Qh!1Ha- | c | 333 | 3/28/2017 | N |
| 2Ol%4Rs# | c | 333 | 5/2/2017 | N |
| 1Gg#8Cm% | c | 333 | 11/15/2016 | Y |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 | Y |
| 1Gy/7Zw, | a | 443 | 10/12/2018 | Y |
| 2Gz,4Gp. | b | 103 | 1/5/2018 | Y |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 | Y |
| 5Hl-8Ds! | a | 903 | 12/8/2017 | Y |
| 8Ws$3Vy* | b | 873 | 1/13/2018 | Y |
So far, I have tried (which was influenced by this post):
WITH
first_instance AS (
SELECT
tbl_trans.*,
ROW_NUMBER() OVER (PARTITION BY vendor_id||part_id ORDER BY trans_date) AS row_nums
FROM
tbl_trans
)
SELECT
x.*,
CASE WHEN y.row_nums = 1 THEN 'Y' ELSE 'N' END AS first_time_indicator
FROM
tbl_trans x
LEFT JOIN first_instance y
But I am met with:
ORA-00905: missing keyword
I have created a SQL FIDDLE with this data and the query thus far for testing. How can I determine the if a transaction was a first time purchase for a part/vendor combination?
Use window functions:
select t.*,
(case when row_number() over (partition by vendor_id, part_id order by trans_date) = 1
then 'Y' else 'N'
end) as first_time
from tbl_trans t;
You don't need a join.
Apart from row_number, there are multiple ways of achieving the desired result using analytical function as follows.
You can use first_value analytical function as follows:
Select t.*,
Case
when first_value(trans_date)
over (partition by vendor_id, part_id order by trans_date) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;
The same way, you can also use min as follows:
Select t.*,
Case
when min(trans_date)
over (partition by vendor_id, part_id) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;

hive/sql:count each user_id gets how many uid

There is a table like:
+-----------+---------+------------+
| uid | user_id | month |
+-----------+---------+------------+
| d23fsdfsa | 101 | 2017-01-02 |
| 43gdasc | 102 | 2017-05-06 |
| b65hrfd | 101 | 2017-08-11 |
| 1wseda | 103 | 2017-09-13 |
| vdfhryd | 101 | 2017-08-06 |
| b6thd3d | 105 | 2017-05-03 |
| ve32h65 | 102 | 2017-01-02 |
| 43gdasc | 102 | 2017-09-06 |
+-----------+---------+------------+
How can one count each user_id where if the user_id appears in the same month, then only count one?
The final table should look like below: (because '101' has two uid in the same month so it only counts one for it)
+---------+-----------+
| user_id | count_num |
+---------+-----------+
| 101 | 2 |
| 102 | 3 |
| 103 | 1 |
| 105 | 1 |
+---------+-----------+
If I understand correctly, you want the number of distinct months for each user. If so:
select user_id, count(distinct trunc(month, 'MONTH')) as count_num
from t
group by user_id;

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

equating an entry to an aggregated version of itself

I am trying to find if an entry's value is the max of the grouped value. Its purpose is to sit in a larger if logic.
Which I'd expect would look something like this:
SELECT
t.id as t_id,
sum(if(t.value = max(t.value), 1, 0)) AS is_max_value
FROM dataset.table AS t
GROUP BY t_id
The response is:
Error: Expression 't.value' is not present in the GROUP BY list
How should my code look to do this?
You first need to compile in a subquery the max value, then join again the value to the table.
Using the public data set available here is an example:
SELECT
t.word,
t.word_count,
t.corpus_date
FROM
[publicdata:samples.shakespeare] t
JOIN (
SELECT
corpus_date,
MAX(word_count) word_count,
FROM
[publicdata:samples.shakespeare]
GROUP BY
1 ) d
ON
d.corpus_date=t.corpus_date
AND t.word_count=d.word_count
LIMIT
25
Results:
+-----+--------+--------------+---------------+---+
| Row | t_word | t_word_count | t_corpus_date | |
+-----+--------+--------------+---------------+---+
| 1 | the | 762 | 1597 | |
| 2 | the | 894 | 1598 | |
| 3 | the | 841 | 1590 | |
| 4 | the | 680 | 1606 | |
| 5 | the | 942 | 1607 | |
| 6 | the | 779 | 1609 | |
| 7 | the | 995 | 1600 | |
| 8 | the | 937 | 1599 | |
| 9 | the | 738 | 1612 | |
| 10 | the | 612 | 1595 | |
| 11 | the | 848 | 1592 | |
| 12 | the | 753 | 1594 | |
| 13 | the | 740 | 1596 | |
| 14 | I | 828 | 1603 | |
| 15 | the | 525 | 1608 | |
| 16 | the | 363 | 0 | |
| 17 | I | 629 | 1593 | |
| 18 | I | 447 | 1611 | |
| 19 | the | 715 | 1602 | |
| 20 | the | 717 | 1610 | |
+-----+--------+--------------+---------------+---+
You can see that retains the word that have the maximum word_count in the partition defined by corpus_date
Use window function to "spread" the max value over all relevant records.
this way you can avoid the Join.
SELECT
*
FROM (
SELECT
corpus,
corpus_date,
word,
word_count,
MAX(word_count) OVER (PARTITION BY corpus) AS Max_Word_Count
FROM
[publicdata:samples.shakespeare] )
WHERE
word_count=Max_Word_Count
select
id,
value,
integer(value = max_value) as is_max_value
from (
select id, value, max(value) over(partition by id) as max_value
from dataset.table
)
Explanation:
Inner select - for each row/record calculates max of value among all rows with the same id
Outer select - for each row/record compares row's value with max value for respective group and then converts true or false into respectively 1 or 0 (as per expectation in question)