hive:how to get last 3 month total spend when use join

hive:how to get last 3 month total spend when use join - hive

How to get last 3 month total spend when use join source1 and source2 then get target table?
source1:
+--------+----------+
| cst_id | date |
+--------+----------+
| a | 20180125 |
| b | 20180627 |
| c | 20181122 |
| d | 20180304 |
+--------+----------+
source2:
join source1 and source2 table
+--------+--------+-------+
| cst_id | month | spend |
+--------+--------+-------+
| a | 201710 | 6.2 |
| a | 201711 | 0.5 |
| a | 201712 | 4.3 |
| a | 201801 | 6.5 |
| a | 201802 | 7 |
| a | 201803 | 11 |
| a | 201804 | 23 |
| a | 201805 | 67 |
| a | 201806 | 8.1 |
| a | 201807 | 0.2 |
| a | 201808 | 9.1 |
| a | 201809 | 1 |
| a | 201810 | 5 |
| a | 201811 | 6 |
| a | 201812 | 9 |
| b | 201710 | 6.2 |
| b | 201711 | 0.5 |
| b | 201712 | 4.3 |
| b | 201801 | 6.5 |
| b | 201802 | 7 |
| b | 201803 | 11 |
| b | 201804 | 23 |
| b | 201805 | 67 |
| b | 201806 | 8.1 |
| b | 201807 | 0.2 |
| b | 201808 | 9.1 |
| b | 201809 | 1 |
| b | 201810 | 5 |
| b | 201811 | 6 |
| b | 201812 | 9 |
+--------+--------+-------+
target table：
finally,every cst_id only get one row
+--------+----------+-----------------+
| cst_id | date | last3monthSpend |
+--------+----------+-----------------+
| a | 20180125 | 11 |
| b | 20180627 | 101 |
+--------+----------+-----------------+

You can do what you want using join, group by, and window functions. The following shows the logic:
select s1.cst_id, s1.date, sum(s1.spend)
from (select s1.*,
row_number() over (partition by s2.cst_id order by s2.month desc) as seqnum
from source1 s1 join
source2 s2
on s2.cst_id = s1.cst_id and
s2.month < s1.date
) s
where seqnum <= 3
group by s1.cst_id, s1.date;
The only issue is how to compare the date and month column. This version works if the values are strings.

Related

Selecting the first instance of a vendor, part combination

I am trying to create an indicator for if a particular transaction was the first time a part was purchased from a particular vendor.
I have a dataset that looks like this:
| transaction_id | vendor_id | part_id | trans_date |
|:--------------:|:---------:|:-------:|:-----------------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 |
| 1Po.4Ot, | a | 473 | 4/22/2016 |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 |
| 2Lz&7Hu& | a | 873 | 12/20/2017 |
| 8Lz)5Is# | b | 743 | 10/22/2016 |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 |
| 0Ra&8Hb& | a | 653 | 10/4/2017 |
| 4Wc-8Of* | c | 333 | 8/3/2017 |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 |
| 6Qh!1Ha- | c | 333 | 3/28/2017 |
| 2Ol%4Rs# | c | 333 | 5/2/2017 |
| 1Gg#8Cm% | c | 333 | 11/15/2016 |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 |
| 1Gy/7Zw, | a | 443 | 10/12/2018 |
| 2Gz,4Gp. | b | 103 | 1/5/2018 |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 |
| 5Hl-8Ds! | a | 903 | 12/8/2017 |
| 8Ws$3Vy* | b | 873 | 1/13/2018 |
What I am looking to do is determine if the transaction_id was the first time (sorted by trans_date), that the part_id was purchased from a vendor_id. I would imagine the ideal output to look like this:
| transaction_id | vendor_id | part_id | trans_date | first_time |
|:--------------:|:---------:|:-------:|:-----------------:|:----------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 | N |
| 1Po.4Ot, | a | 473 | 4/22/2016 | Y |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 | Y |
| 2Lz&7Hu& | a | 873 | 12/20/2017 | Y |
| 8Lz)5Is# | b | 743 | 10/22/2016 | Y |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 | Y |
| 0Ra&8Hb& | a | 653 | 10/4/2017 | Y |
| 4Wc-8Of* | c | 333 | 8/3/2017 | N |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 | N |
| 6Qh!1Ha- | c | 333 | 3/28/2017 | N |
| 2Ol%4Rs# | c | 333 | 5/2/2017 | N |
| 1Gg#8Cm% | c | 333 | 11/15/2016 | Y |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 | Y |
| 1Gy/7Zw, | a | 443 | 10/12/2018 | Y |
| 2Gz,4Gp. | b | 103 | 1/5/2018 | Y |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 | Y |
| 5Hl-8Ds! | a | 903 | 12/8/2017 | Y |
| 8Ws$3Vy* | b | 873 | 1/13/2018 | Y |
So far, I have tried (which was influenced by this post):
WITH
first_instance AS (
SELECT
tbl_trans.*,
ROW_NUMBER() OVER (PARTITION BY vendor_id||part_id ORDER BY trans_date) AS row_nums
FROM
tbl_trans
)
SELECT
x.*,
CASE WHEN y.row_nums = 1 THEN 'Y' ELSE 'N' END AS first_time_indicator
FROM
tbl_trans x
LEFT JOIN first_instance y
But I am met with:
ORA-00905: missing keyword
I have created a SQL FIDDLE with this data and the query thus far for testing. How can I determine the if a transaction was a first time purchase for a part/vendor combination?

Use window functions:
select t.*,
(case when row_number() over (partition by vendor_id, part_id order by trans_date) = 1
then 'Y' else 'N'
end) as first_time
from tbl_trans t;
You don't need a join.

Apart from row_number, there are multiple ways of achieving the desired result using analytical function as follows.
You can use first_value analytical function as follows:
Select t.*,
Case
when first_value(trans_date)
over (partition by vendor_id, part_id order by trans_date) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;
The same way, you can also use min as follows:
Select t.*,
Case
when min(trans_date)
over (partition by vendor_id, part_id) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;

Postgres multi row to string and calculate

I have these table :
tblproduk :
| skuid | namabarang |
|--------|-----------------|
| 123456 | INDOMIE GORENG |
| 234567 | COKLAT BENGBENG |
| 345678 | BISKUIT |
tblproduk_satuan:
| id | skuid | kodebarang | satuan | konversi | price |
|----|--------|------------|--------|----------|--------|
| 1 | 123456 | ABC1 | PCS | 1 | 6000 |
| 2 | 123456 | ABC2 | DUS | 20 | 100000 |
| 3 | 234567 | BCD | PCS | 1 | 3000 |
| 4 | 345678 | CDE1 | BKS | 1 | 4500 |
| 5 | 345678 | CDE2 | LSN | 12 | 50000 |
| 6 | 345678 | CDE3 | DUS | 48 | 190000 |
tblproduk_stock:
| id | skuid | awal | masuk | keluar |
|----|--------|------|-------|--------|
| 1 | 123456 | 10 | 50 | 30 |
| 2 | 234567 | 0 | 100 | 20 |
| 3 | 345678 | 20 | 400 | 21 |
Here is the sqlfiddle of my table.
What is the the most efficient way to convert multi row to string from tblproduct_satuan, make calculation and display it like this :
| skuid | namabarang | stock | satuan |Remarks | Amount
|--------|-----------------|-------|--------|-------------------------------
| 123456 | INDOMIE GORENG | 30 | PCS | 1 DUS 10 PCS | 160.000
| 234567 | COKLAT BENGBENG | 80 | PCS | 80 PCS | 240.000
| 345678 | BISKUIT | 399 | BKS | 8 DUS 1 LSN 3 BKS | 1.583.500
Hope to get help from the expert.
Thank you

If I understood correctly, Here is the query for your requirement:
WITH CTE AS (
select
t1.skuid,
t1.namabarang,
t3.masuk+t3.awal-t3.keluar "stock",
t2.satuan,
t2.konversi,
floor(mod((t3.masuk+t3.awal-t3.keluar),coalesce(lag(t2.konversi) over (partition by t1.skuid order by t2.konversi desc ),(t3.masuk+t3.awal-t3.keluar)+1))/t2.konversi) "count_",
t2.price,
row_number() over (partition by t1.skuid order by t2.konversi) "rn"
from
tblproduct t1
inner join tblproduct_satuan t2 on t1.skuid=t2.skuid
inner join tblproduct_onhand t3 on t3.skuid=t1.skuid
)
select
skuid,
namabarang,
stock,
min(satuan) filter (where rn=1) "satuan",
string_agg(concat(count_,' ',satuan), ' ' order by konversi desc) "Remarks",
sum(price*count_) "Amount"
from cte
group by 1,2,3
In With block I have calculated all the required values and then aggregated for final output.
DEMO

How I can I add a count to rank null values in SQL Hive?

This is what I have right now:
| time | car_id | order | in_order |
|-------|--------|-------|----------|
| 12:31 | 32 | null | 0 |
| 12:33 | 32 | null | 0 |
| 12:35 | 32 | null | 0 |
| 12:37 | 32 | 123 | 1 |
| 12:38 | 32 | 123 | 1 |
| 12:39 | 32 | 123 | 1 |
| 12:41 | 32 | 123 | 1 |
| 12:43 | 32 | 123 | 1 |
| 12:45 | 32 | null | 0 |
| 12:47 | 32 | null | 0 |
| 12:49 | 32 | 321 | 1 |
| 12:51 | 32 | 321 | 1 |
I'm trying to rank orders, including those who have null values, in this case by car_id.
This is the result I'm looking for:
| time | car_id | order | in_order | row |
|-------|--------|-------|----------|-----|
| 12:31 | 32 | null | 0 | 1 |
| 12:33 | 32 | null | 0 | 1 |
| 12:35 | 32 | null | 0 | 1 |
| 12:37 | 32 | 123 | 1 | 2 |
| 12:38 | 32 | 123 | 1 | 2 |
| 12:39 | 32 | 123 | 1 | 2 |
| 12:41 | 32 | 123 | 1 | 2 |
| 12:43 | 32 | 123 | 1 | 2 |
| 12:45 | 32 | null | 0 | 3 |
| 12:47 | 32 | null | 0 | 3 |
| 12:49 | 32 | 321 | 1 | 4 |
| 12:51 | 32 | 321 | 1 | 4 |
I just don't know how to manage a count for the null values.
Thanks!

You can count the number of non-NULL values before each row and then use dense_rank():
select t.*,
dense_rank() over (partition by car_id order by grp) as row
from (select t.*,
count(order) over (partition by car_id order by time) as grp
from t
) t;

Join with other table timestamp and sum of the columns

I'm trying to generate an aggregate table.
Lets say this is my tblA.
| type | name | timestap |
|------|------|---------------------|
| prod | t1 | 2020-06-01 01:00:00 |
| prod | t2 | 2020-06-01 01:00:02 |
| prod | t3 | 2020-06-01 01:00:03 |
| test | t4 | 2020-06-01 02:20:02 |
| test | t5 | 2020-06-01 02:20:03 |
and tblB
| tid | starttime | name | subtask | maintask |
|-----|---------------------|------|---------|----------|
| 1 | 2020-06-01 01:10:00 | t1 | 5 | 10 |
| 1 | 2020-06-01 01:10:00 | t1 | 6 | 10 |
| 1 | 2020-06-01 01:10:00 | t1 | 7 | 10 |
| 1 | 2020-06-01 01:10:00 | t1 | 8 | 10 |
| 2 | 2020-06-01 00:01:00 | t1 | 3 | 10 |
| 2 | 2020-05-01 00:02:00 | t1 | 5 | 15 |
| 4 | 2020-06-01 01:00:00 | t2 | 10 | 10 |
| 5 | 2020-06-01 11:00:10 | t2 | 10 | 20 |
| 5 | 2020-06-01 11:00:10 | t2 | 11 | 20 |
| 5 | 2020-06-01 11:00:10 | t2 | 12 | 20 |
Now I need to create a report table with the sum of subtask and main task. But there is where condition, we need to pick the tid,subtask, maintask where the starttime is greater than than the tblA's timestamp for each name.Then do the SUM.
Expected output:
| type | name | sum_of_subtask | sum_of_maintask | diff |
|------|------|----------------|-----------------|------|
| prod | t1 | 26 | 40 | 14 |
| prod | t2 | 33 | 60 | 27 |
For t1, the tid would be 1, because its starttime is > tblA.timestamp
for t2, the tid is 5, tblB.starttime > tblA.timestamp
Also the other condition the rows Im going to pick is the MAX(tid)
where starttime is > tblA.timestamp.
Then get the rows and do the sum find the difference between sum_of_subtask,sum_of_maintask on diff column.
I'm not sure how to write the logic for this.

You need simple join and sum for aggregation. here is the demo.
select
type,
ta.name,
sum(subtask) as sum_of_subtask,
sum(maintask) as sum_of_maintask,
sum(maintask - subtask) as diff
from tblA ta
join tblB tb
on ta.name = tb.name
where starttime > timestap
group by
type,
ta.name;
output:
| type | name | sum_of_subtask | sum_of_maintask | diff |
| ---- | ---- | -------------- | --------------- | ---- |
| prod | t1 | 26 | 40 | 14 |
| prod | t2 | 33 | 60 | 27 |

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.

You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

hive:how to get last 3 month total spend when use join - hive

Related

Selecting the first instance of a vendor, part combination

Postgres multi row to string and calculate

How I can I add a count to rank null values in SQL Hive?

Join with other table timestamp and sum of the columns

SQL window excluding current group?

Categories

Resources