Hive self join based on one column - hive

I have one table in Hive for which data has come from SAP system. This table has columns and data as given below:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | | 586 |
+----------------------------------------------------------------------+
As shown above, value for vendor_account_number column is present in only 1 row and I want to bring it on all the rest of the rows.
Expected output is as follows:
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
To achieve this, I wrote following CTE in Hive
with non_blank_account_no as(
select document_number, vendor_account_number
from my_table
where vendor_account_number != ''
)
and then did self left outer join as follows:
select
a.document_number, a.year,
a.cost_centre, a.amount,
b.vendor_account_number
from my_table a
left outer join non_blank_account_no b on a.document_number = b.document_number
where a.document_number = ' '
but I am getting duplicated output as shown below
+======================================================================+
|document_number | year | cost_centre | vendor_account_number | amount |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 123.5 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 25.96 |
+----------------------------------------------------------------------+
| 1 | 2016 | XZ10 | 1234567890 | 586 |
+----------------------------------------------------------------------+
Can anyone please help me understanding what is wrong with my Hive query?

In many use-cases a self-join can be replaced by a windows function
select document_number
,year
,cost_center
,max (case when vendor_account_number <> '' then vendor_account_number end) over
(
partition by document_number
) as vendor_account_number
,amount
from my_table

Related

Selecting the first instance of a vendor, part combination

I am trying to create an indicator for if a particular transaction was the first time a part was purchased from a particular vendor.
I have a dataset that looks like this:
| transaction_id | vendor_id | part_id | trans_date |
|:--------------:|:---------:|:-------:|:-----------------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 |
| 1Po.4Ot, | a | 473 | 4/22/2016 |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 |
| 2Lz&7Hu& | a | 873 | 12/20/2017 |
| 8Lz)5Is# | b | 743 | 10/22/2016 |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 |
| 0Ra&8Hb& | a | 653 | 10/4/2017 |
| 4Wc-8Of* | c | 333 | 8/3/2017 |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 |
| 6Qh!1Ha- | c | 333 | 3/28/2017 |
| 2Ol%4Rs# | c | 333 | 5/2/2017 |
| 1Gg#8Cm% | c | 333 | 11/15/2016 |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 |
| 1Gy/7Zw, | a | 443 | 10/12/2018 |
| 2Gz,4Gp. | b | 103 | 1/5/2018 |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 |
| 5Hl-8Ds! | a | 903 | 12/8/2017 |
| 8Ws$3Vy* | b | 873 | 1/13/2018 |
What I am looking to do is determine if the transaction_id was the first time (sorted by trans_date), that the part_id was purchased from a vendor_id. I would imagine the ideal output to look like this:
| transaction_id | vendor_id | part_id | trans_date | first_time |
|:--------------:|:---------:|:-------:|:-----------------:|:----------:|
| 9Bx*2Pc' | a | 873 | 10/12/2018 | N |
| 1Po.4Ot, | a | 473 | 4/22/2016 | Y |
| 9Sk"7Kv/ | b | 123 | 7/23/2016 | Y |
| 2Lz&7Hu& | a | 873 | 12/20/2017 | Y |
| 8Lz)5Is# | b | 743 | 10/22/2016 | Y |
| 5Sc'6Jl/ | a | 113 | 10/6/2016 | Y |
| 0Ra&8Hb& | a | 653 | 10/4/2017 | Y |
| 4Wc-8Of* | c | 333 | 8/3/2017 | N |
| 8Vv+9Yo/ | c | 333 | 12/7/2016 | N |
| 6Qh!1Ha- | c | 333 | 3/28/2017 | N |
| 2Ol%4Rs# | c | 333 | 5/2/2017 | N |
| 1Gg#8Cm% | c | 333 | 11/15/2016 | Y |
| 0Lw(6Pv/ | d | 873 | 8/13/2017 | Y |
| 1Gy/7Zw, | a | 443 | 10/12/2018 | Y |
| 2Gz,4Gp. | b | 103 | 1/5/2018 | Y |
| 5Dj)6Wc+ | a | 893 | 12/17/2016 | Y |
| 5Hl-8Ds! | a | 903 | 12/8/2017 | Y |
| 8Ws$3Vy* | b | 873 | 1/13/2018 | Y |
So far, I have tried (which was influenced by this post):
WITH
first_instance AS (
SELECT
tbl_trans.*,
ROW_NUMBER() OVER (PARTITION BY vendor_id||part_id ORDER BY trans_date) AS row_nums
FROM
tbl_trans
)
SELECT
x.*,
CASE WHEN y.row_nums = 1 THEN 'Y' ELSE 'N' END AS first_time_indicator
FROM
tbl_trans x
LEFT JOIN first_instance y
But I am met with:
ORA-00905: missing keyword
I have created a SQL FIDDLE with this data and the query thus far for testing. How can I determine the if a transaction was a first time purchase for a part/vendor combination?
Use window functions:
select t.*,
(case when row_number() over (partition by vendor_id, part_id order by trans_date) = 1
then 'Y' else 'N'
end) as first_time
from tbl_trans t;
You don't need a join.
Apart from row_number, there are multiple ways of achieving the desired result using analytical function as follows.
You can use first_value analytical function as follows:
Select t.*,
Case
when first_value(trans_date)
over (partition by vendor_id, part_id order by trans_date) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;
The same way, you can also use min as follows:
Select t.*,
Case
when min(trans_date)
over (partition by vendor_id, part_id) = trans_date
then 'Y'
else 'N'
end as first_time
From your_table t;

Add a field in table/view

I want to calculate average values in a sql table/View by taking average of multiple values and want to group by another column.
For example in the attached excel sheet, I want to calculate average density from SG (Calc) field when Block ID is same.
+-----------------------------------+-----------+--------------+
| BlockID | SG (Calc) | Ave. Density |
+-----------------------------------+-----------+--------------+
| SESS_5835_01_OXD_SAP_AL01 | 1.86 | |
| SESS_5835_01_OXD_SAP_AL01 | 1.71 | |
| SESS_5835_01_OXD_SAP_MG04 | 2.08 | |
| SESS_5835_01_OXD_SAP_MG04 | 2.14 | |
| KCD_5897.5_01_OXD_TRA_VG02 | 2.74 | |
| KCD_5897.5_01_OXD_TRA_VG02 | 2.74 | |
| KCD_5897.5_01_OXD_TRA_VG02 | 2.51 | |
| KCD_5895_01_OXD_TRA_MG06 | 3.19 | |
| KCD_5895_01_OXD_TRA_MG06 | 3.02 | |
| SESS_58932.5_01_OXD_TRA_MG05 | 2.24 | |
| SESS_58932.5_01_OXD_TRA_MG05 | 2.27 | |
+-----------------------------------+-----------+--------------+
this will work:
select a.*,avg(SG(Calc)) from table_name a group by a.BlockID;
Are you looking for window functions?
select a.*, avg(SG) over (partition by blockid) as avg_block_density
from t a ;

Grouping by a column to compare values between similar rows

I'm trying to turn this
+----+---------+-------------------+-----------+
| id | year | desc | amount |
+----+---------+-------------------+-----------+
| 1 | 2017 | car | 500 |
| 2 | 2017 | car | 550 |
| 1 | 2018 | car | 490 |
| 2 | 2018 | car | 550 |
| 1 | 2017 | house | 200 |
| 2 | 2017 | house | 300 |
| 1 | 2018 | house | 210 |
| 2 | 2018 | house | 320 |
| 1 | 2019 | house | 290 |
| 2 | 2019 | house | 325 |
+----+---------+-------------------+-----------+
Into something like this
+----+---------+---------+-------------------+-----------+-----------+
| id | year_0 | year_1 | desc | amount_0 | amount_1 |
+----+---------+---------+-------------------+-----------+-----------+
| 1 | 2017 | 2018 | car | 500 | 490 |
| 2 | 2017 | 2018 | car | 550 | 550 |
| 1 | 2017 | 2018 | house | 200 | 210 |
| 2 | 2017 | 2018 | house | 300 | 320 |
+----+---------+---------+-------------------+-----------+-----------+
But I'm having difficulty getting the two years and two amounts to group by description.
You can achieve the result by applying join:
SELECT A.id,a.year year_0,b.year year_1, A.[desc], A.amount amount_0,B.amount amount_1
FROM
(SELECT * FROM YourTable WHERE Year= Datepart(year,GETDATE())-1) AS A
INNER JOIN
(SELECT * FROM YourTable WHERE Year= Datepart(year,GETDATE())) AS B
ON A.id=B.id AND A.[desc]=B.[desc]

How to check dates condition from one table to another in SQL

Which way we can use to check and compare the dates from one table to another.
Table : inc
+--------+---------+-----------+-----------+-------------+
| inc_id | cust_id | item_id | serv_time | inc_date |
+--------+---------+-----------+-----------+-------------+
| 1 | john | HP | 40 | 17-Apr-2015 |
| 2 | John | HP | 60 | 10-Jan-2016 |
| 3 | Nick | Cisco | 120 | 11-Jan-2016 |
| 4 | samanta | EMC | 180 | 12-Jan-2016 |
| 5 | Kerlee | Oracle | 40 | 13-Jan-2016 |
| 6 | Amir | Microsoft | 300 | 14-Jan-2016 |
| 7 | John | HP | 120 | 15-Jan-2016 |
| 8 | samanta | EMC | 20 | 16-Jan-2016 |
| 9 | Kerlee | Oracle | 10 | 2-Feb-2017 |
+--------+---------+-----------+-----------+-------------+
Table: Contract:
+-----------+---------+----------+------------+
| item_id | con_id | Start | End |
+-----------+---------+----------+------------+
| Dell | DE2015 | 1/1/2015 | 12/31/2015 |
| HP | HP2015 | 1/1/2015 | 12/31/2015 |
| Cisco | CIS2016 | 1/1/2016 | 12/31/2016 |
| EMC | EMC2016 | 1/1/2016 | 12/31/2016 |
| HP | HP2016 | 1/1/2016 | 12/31/2016 |
| Oracle | OR2016 | 1/1/2016 | 12/31/2016 |
| Microsoft | MS2016 | 1/1/2016 | 12/31/2016 |
| Microsoft | MS2017 | 1/1/2017 | 12/31/2017 |
+-----------+---------+----------+------------+
Result:
+-------+---------+---------+--------------+
| Calls | Cust_id | Con_id | Tot_Ser_Time |
+-------+---------+---------+--------------+
| 2 | John | HP2016 | 180 |
| 2 | samanta | EMC2016 | 200 |
| 1 | Nick | CIS2016 | 120 |
| 1 | Amir | MS2016 | 300 |
| 1 | Oracle | OR2016 | 40 |
+-------+---------+---------+--------------+
MY Query:
select count(inc_id) as Calls, inc.cust_id, contract.con_id,
sum(inc.serv_time) as tot_serv_time
from inc inner join contract on inc.item_id = contract.item_id
where inc.inc_date between '2016-01-01' and '2016-12-31'
group by inc.cust_id, contract.con_id
The result from inc table with filter between 1-jan-2016 to 31-Dec-2016 with
count of inc_id based on the items and its contract start and end dates .
If I understand correctly your problem, this query will return the desidered result:
select
count(*) as Calls,
inc.cust_id,
contract.con_id,
sum(inc.serv_time) as tot_serv_time
from
inc inner join contract
on inc.item_id = contract.item_id
and inc.inc_date between contract.start and contract.end
where
inc.inc_date between '2016-01-01' and '2016-12-31'
group by
inc.cust_id,
contract.con_id
the question is a little vague so you might need some adjustments to this query.
select
Calls = count(*)
, Cust = i.Cust_id
, Contract = c.con_id
, Serv_Time = sum(Serv_Time)
from inc as i
inner join contract as c
on i.item_id = c.item_id
and i.inc_date >= c.[start]
and i.inc_date <= c.[end]
where c.[start]>='20160101'
group by i.Cust_id, c.con_id
order by i.Cust_Id, c.con_id
returns:
+-------+---------+----------+-----------+
| Calls | Cust | Contract | Serv_Time |
+-------+---------+----------+-----------+
| 1 | Amir | MS2016 | 300 |
| 2 | John | HP2016 | 180 |
| 1 | Kerlee | OR2016 | 40 |
| 1 | Nick | CIS2016 | 120 |
| 2 | samanta | EMC2016 | 200 |
+-------+---------+----------+-----------+
test setup: http://rextester.com/WSYDL43321
create table inc(
inc_id int
, cust_id varchar(16)
, item_id varchar(16)
, serv_time int
, inc_date date
);
insert into inc values
(1,'john','HP', 40 ,'17-Apr-2015')
,(2,'John','HP', 60 ,'10-Jan-2016')
,(3,'Nick','Cisco', 120 ,'11-Jan-2016')
,(4,'samanta','EMC', 180 ,'12-Jan-2016')
,(5,'Kerlee','Oracle', 40 ,'13-Jan-2016')
,(6,'Amir','Microsoft', 300 ,'14-Jan-2016')
,(7,'John','HP', 120 ,'15-Jan-2016')
,(8,'samanta','EMC', 20 ,'16-Jan-2016')
,(9,'Kerlee','Oracle', 10 ,'02-Feb-2017');
create table contract (
item_id varchar(16)
, con_id varchar(16)
, [Start] date
, [End] date
);
insert into contract values
('Dell','DE2015','20150101','20151231')
,('HP','HP2015','20150101','20151231')
,('Cisco','CIS2016','20160101','20161231')
,('EMC','EMC2016','20160101','20161231')
,('HP','HP2016','20160101','20161231')
,('Oracle','OR2016','20160101','20161231')
,('Microsoft','MS2016','20160101','20161231')
,('Microsoft','MS2017','20170101','20171231');

PDI Kettle - How to Normalize Advanced Structure?

I have 7 columns of data in a MySQL Database. The Year1 column belongs to the Revenue1 column. The following columns have the same structure. I know how to handle this in SQL, but not in PDI. Can anyone describe how to do it?
mySQL table structure
+--------+-------+-------+-------+----------+----------+----------+
| Ticker | Year1 | Year2 | Year3 | Revenue1 | Revenue2 | Revenue3 |
+--------+-------+-------+-------+----------+----------+----------+
| | | | | | | |
| ABC | 2010 | 2011 | 2012 | 250000 | 500000 | 1000000 |
+--------+-------+-------+-------+----------+----------+----------+
Desired normalized output from PDI:
+------------+------+-----------+---------+
| Ticker | Year | Keyfigure | Value |
+------------+------+-----------+---------+
| | | | |
| ABC | 2010 | Revenue | 250000 |
| | | | |
| ABC | 2011 | Revenue | 500000 |
| | | | |
| ABC | 2012 | Revenue | 1000000 |
+------------+------+-----------+---------+
Have you tried using the row denormaliser?