Fill unexistent data with Hive/Pig - hive

I have a hive table with the following structure:
id1, id2, year, value
1, 1, 2000, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2008, 60
I have 2 ids which identify the 'user', and for each user and year I have a value, but there are years with no values which do not appear in the table. I would like to add for each id [id1,id2] and year (considering all the years between the minimum and maximum year) a value, using the previous year value in case a year does not exists. So the table should become:
id1, id2, year, value
1, 1, 2000, 20
1, 1, 2001, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2007, 56
2, 3, 2008, 60
I need to do that in hive or pig, or in the worst case I could go with spark
thanks,

This is best achieved if years can be stored as a table.
create table dbname.years
location 'hdfs_location' as
select 2000 as yr union all select 2001 as yr --include as many years as possible
1) With this table in place, the id's can be cross joined to generate all year combinations and then left join ing the original table.
2) Then classify rows into groups, so a null value from the previous step (year missing from the original table for the id's) gets assigned the same group as previous non-null value. This is accomplished with a running sum. Run the sub-query to see how groups are assigned.
3) Thereafter, select the max for each id1,id2,group combination.
select id1,id2,yr,max(val) over(partition by id1,id2,grp) as val
from (select i.id1,i.id2,y.yr,t.val
,sum(case when t.val is null then 0 else 1 end)
over(partition by i.id1,i.id2 order by y.yr) as grp
from (select distinct id1,id2 from tbl) i
cross join (select yr from years) y
left join tbl t on i.id1=t.id1 and i.id2=t.id2 and y.yr=t.yr
) t

I would do this on using a temporary table. The year per id1 and id2 varies so I will create a series of years per id1, id2 instead of creating a series of years for all.
1) get the min year and max year per id1, id2. Call this series_dtes table
2) do a left join to the table at hand (I call it cal_date)
3) create a temp table out of combined series_dtes table and cal_date table. This will fill in the missing years per id1, id2 say 2001 and 2007.
4) fill in the missing values for 2001 and 2007 using lag function.
create table tmp as
with series_dtes as (
select id1, id2, (t.min_dt+pe.idx) as series_year
from (select id1, id2, min(year) as min_dt, max(year) as max_dt from cal_date group by id1, id2) t
lateral view posexplode(split(space(t.max_dt-t.min_dt),' ')) pe as idx, dte)
select dte.id1, dte.id2, dte.series_year, t.value
from series_dtes dte
left join cal_date t
on dte.series_year=t.year and t.id1=dte.id1 and t.id2=dte.id2
order by dte.id1, dte.id2, dte.series_year;
select id1, id2, series_year as year,
(case when value is null then (lag(value) over (partition by id1,id2 order by series_year)) else value end) as value
from tmp;
Result:
id1 id2 year value
1 1 2000 20
1 1 2001 20
1 1 2002 23
1 1 2003 24
1 2 1999 34
1 2 2000 35
1 2 2001 37
2 3 2005 50
2 3 2006 56
2 3 2007 56
2 3 2008 60

Related

Oracle SQL How to select records that have the latest date [duplicate]

This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Oracle SQL query: Retrieve latest values per group based on time [duplicate]
(2 answers)
Return row with the max value of one column per group [duplicate]
(3 answers)
Closed last month.
I need to extract one record from column Y where in column date has the last date
example
id
Y
DATE
a
1
2020
a
2
2021
a
2
2022
b
1
1999
b
1
2015
c
3
2001
c
3
2004
c
7
2010
One option is to rank rows per each id sorted by years in descending order, and then fetch the ones that ranked as the highest.
Sample data:
SQL> with
2 test (id, y, datum) as
3 (select 'a', 1, 2020 from dual union all
4 select 'a', 2, 2021 from dual union all
5 select 'a', 2, 2022 from dual union all
6 select 'b', 1, 1999 from dual union all
7 select 'b', 1, 2015 from dual union all
8 select 'c', 3, 2001 from dual union all
9 select 'c', 3, 2004 from dual union all
10 select 'c', 7, 2010 from dual
11 ),
Query:
12 temp as
13 (select id, y, datum,
14 rank() over (partition by id order by datum desc) rnk
15 from test
16 )
17 select id, y, datum
18 from temp
19 where rnk = 1;
ID Y DATUM
-- ---------- ----------
a 2 2022
b 1 2015
c 7 2010
SQL>

Need to list string from tbl 1 in tbl 2 randomly - Big query sql

I have two tables. Need to list the Name field randomly in the User col in tbl 2 using Big query SQL. Can someone help me please?
Table 1
Id
Name
1
Tom
2
Jack
3
Harry
Table 2
Month
Year
User
Jan
2023
Feb
2023
Mar
2023
Apr
2023
May
2033
First generate both tables tbl1 and tbl2. Then add a row_number as id_ok to tbl1. In the table helper we extract the maximum row numbers of tbl1.
floor((max_.A)*rand())+1 as id_ok The rand generates a number between 0 and 1. Multiplied with the row number of table tbl1 (max_.A) and rounding downwards with floor gives a range between 0 and row_number -1. Therefore we add +1 and call it id_ok. In the last step we union the helper table with tbl1_.
with tbl1 as (select * from unnest(split("Tom Jack Harry"," ")) name),
tbl2 as (select * from unnest(split("Jan Feb Mar Apr May"," ")) month),
tbl1_ as (select row_number() over () as id_ok, * from tbl1),
helper as (
Select tbl2.*, floor((max_.A)*rand())+1 as id_ok
from tbl2,((Select max(id_ok) A from tbl1_)) max_
)
Select *
from helper
left join tbl1_
using(id_ok)
Try this approach below:
with table_1 as (
select 1 as id, 'Tom' as name
union all select 2 as id, 'Jack' as name
union all select 3 as id, 'Harry' as name
),
table_2 as (
select 'Jan' as month, 2023 as year
union all select 'Feb' as month, 2023 as year
union all select 'Mar' as month, 2023 as year
union all select 'Apr' as month, 2023 as year
union all select 'May' as month, 2023 as year
),
add_t2_id as (
select
month,
year,
cast(ROUND(1 + RAND() * (3 - 1)) as int64) as rand_val
from table_2
)
select
t2.month,
t2.year,
t1.name as user
from add_t2_id t2
inner join table_1 t1
on t1.id=t2.rand_val
I created a cte table namely add_t2_id that pair table_2 rows with a random number cast(ROUND(1 + RAND() * (3 - 1)) as int64) as rand_val (generates random number to 1 -3). Then made a query joining add_t2_id and table_1 tying the random number generated by add_t2_id to the id from table_1.
Sample Result:

How to fetch data from another row in sql without joining the table with itself

I have a table :
Date ID1 ID2 ID3 Data1 Data2
JAN-17 1 7 1 2 3
JAN-17 1 7 2 3 4
Feb-17 1 7 1 3 4
MAR-17 1 7 1 2 3
JAN-17 2 8 1 4 1
FEB-17 2 7 1 1 2
MAR-17 2 7 2 1 2
The composite key for the table is Date+ID1+ID2+ID3.
The output should be :
Month ID1 ID2 ID3 Data
Jan-17 1 7 1 Data2(Jan)+Data2(Feb)+Data2(Mar(3+4+3)
Feb-17 1 7 1 Data1(Jan)+Data2(Feb)+Data2(Mar(2+4+3)
Mar-17 1 7 1 Data1(Jan)+Data1(Feb)+Data2(Mar)(2+3+3)
Quarter starts from Jan, if the month is first month of quarter , output for first month should be should be :Data1 for firstmonth+Data2 for next 2 months
If the month is 2nd month of quarter , output should be Data1 from last month and data2 for 2nd month+data2 for 3rd month
If the month is 3rd month of quarter , output for 3rd month should be Data1 from first month+Data1 from 2nd month +Data2 for 3rd month .
I am using oracle database .
Can someone help .
As is very often the case, analytic functions can help. In this case, you need to take advantage of the windowing capabilities of analytic functions. That is: the ability to specify exactly which range of rows you want to aggregate using the ROWS BETWEEN syntax in the windowing clause.
In your case, you want to PARTITION BY the quarter, SUM(data1) for the rows in the quarter that are before the current row, and SUM(data2) for the rows in the quarter that are the current month or greater. Then, add those two sums together.
(I am assuming the "+" in your question means a sum and not a string concatenation like in your sample results).
Like this:
with d ( month, id1, id2, id3, data1, data2 ) AS (
SELECT to_date('JAN-17','MON-YY'),1, 7, 1, 2, 3 FROM DUAL UNION ALL
SELECT to_date('JAN-17','MON-YY'),1, 7, 2, 3, 4 FROM DUAL UNION ALL
SELECT to_date('Feb-17','MON-YY'),1, 7, 1, 3, 4 FROM DUAL UNION ALL
SELECT to_date('MAR-17','MON-YY'),1, 7, 1, 2, 3 FROM DUAL UNION ALL
SELECT to_date('JAN-17','MON-YY'),2, 8, 1, 4, 1 FROM DUAL UNION ALL
SELECT to_date('FEB-17','MON-YY'),2, 7, 1, 1, 2 FROM DUAL UNION ALL
SELECT to_date('MAR-17','MON-YY'),2, 7, 2, 1, 2 FROM DUAL UNION ALL
SELECT to_date('APR-17','MON-YY'),2, 8, 1, 4, 1 FROM DUAL UNION ALL
SELECT to_date('MAY-17','MON-YY'),2, 7, 1, 1, 2 FROM DUAL UNION ALL
SELECT to_date('JUN-17','MON-YY'),2, 7, 2, 1, 2 FROM DUAL)
SELECT month, id1, id2, id3,
nvl(sum(data1) over (
partition by to_char(month,'Q-YYYY'), id1, id2, id3
order by month
rows between unbounded preceding and 1 preceding),0)
+
nvl(sum(data2) over (
partition by to_char(month,'Q-YYYY'), id1, id2, id3
order by month
rows between current row and unbounded following),0) result
FROM d
order by id1, id2, id3, month

SQL : Subquery without FROM clause

This is the table structure
CREATE TABLE Book_Tag (id INT, Book_Id INT, tag varchar(20))
CREATE TABLE Book_Master (Book_Id INT, Book_title VARCHAR(50), price INT)
And the data looks like this :
INSERT INTO Book_Master
SELECT 1, 'Good Profit', 28 UNION ALL
SELECT 2, 'The Secret', 20 UNION ALL
SELECT 3, 'The One Minute Manager', 9 UNION ALL
SELECT 4, 'The 7 Habits of Highly Effective People', 35 UNION ALL
SELECT null, 'Who Moved My Cheese?', 15 UNION ALL
SELECT null, 'Blink: The Power of Thinking Without Thinking', 40
INSERT INTO Book_Tag
SELECT 1, 1, 'Management' UNION ALL
SELECT 2, 1, 'Profit' UNION ALL
SELECT 3, 2, 'Mind' UNION ALL
SELECT 4, 3, 'Management' UNION ALL
SELECT 5, 3, 'Efficiency' UNION ALL
SELECT 6, 3, 'Success' UNION ALL
SELECT 7, 4, 'Success' UNION ALL
SELECT 8, null, 'Time' UNION ALL
SELECT 9, 6, 'SelfHelp' UNION ALL
SELECT 10, 6, 'Motivation' UNION ALL
SELECT 11, 8, 'Mind'
select * from Book_Master
Book_Id Book_title price
1 Good Profit 28
2 The Secret 20
3 The One Minute Manager 9
4 The 7 Habits of Highly Effective People 35
NULL Who Moved My Cheese? 15
NULL Blink: The Power of Thinking Without Thinking 40
select * from Book_Tag
id Book_Id tag
1 1 Management
2 1 Profit
3 2 Mind
4 3 Management
5 3 Efficiency
6 3 Success
7 4 Success
8 NULL Time
9 6 SelfHelp
10 6 Motivation
11 8 Mind
I dont know why the following is working and also why the result is that.
select BT.* from Book_Tag BT
where BT.Book_Id in (select id)
id Book_Id tag
1 1 Management
or this
select BT.* from Book_Tag BT
where BT.Book_Id not in (select id)
id Book_Id tag
2 1 Profit
3 2 Mind
4 3 Management
5 3 Efficiency
6 3 Success
7 4 Success
9 6 SelfHelp
10 6 Motivation
11 8 Mind
Your first query:
select BT.*
from Book_Tag BT
where BT.Book_Id in (select id);
It is the same as:
select BT.*
from Book_Tag BT
where BT.Book_Id = BT.id;
That is why you get
1 1 Management
Keep in mind that NULL is not equal NULL or anything else.
In second example you have:
select BT.*
from Book_Tag BT
where BT.Book_Id not in (select id);
Which is the same as:
select BT.*
from Book_Tag BT
where BT.Book_Id <> BT.id;
Note that there is no
8 NULL Time row.
EDIT:
But,In the Subquery shouldnt we be specifying the table from which the id is coming.
From MSDN:
The general rule is that column names in a statement are implicitly qualified by the table referenced in the FROM clause at the same level. If a column does not exist in the table referenced in the FROM clause of a subquery, it is implicitly qualified by the table referenced in the FROM clause of the outer query.
and
If a column is referenced in a subquery that does not exist in the table referenced by the subquery's FROM clause, but exists in a table referenced by the outer query's FROM clause, the query executes without error. SQL Server implicitly qualifies the column in the subquery with the table name in the outer query.
You're asking here "where BT.Book_Id in (select id)", is saying where Book_ID = ID. So in this case it prints that out. Book_ID and ID are both 1. Then not, would be when they aren't equal, which is everything else from that table.

Oracle SQL (Toad): Expand table

Suppose I have an SQL (Oracle Toad) table named "test", which has the following fields and entries (dates are in dd/mm/yyyy format):
id ref_date value
---------------------
1 01/01/2014 20
1 01/02/2014 25
1 01/06/2014 3
1 01/09/2014 6
2 01/04/2015 7
2 01/08/2015 43
2 01/09/2015 85
2 01/12/2015 4
I know from how the table has been created that, since there are value entries for id = 1 for February 2014 and June 2014, the values for March through May 2014 must be 0. The same applies to July and August 2014 for id = 1, and for May through July 2015 and October through November 2015 for id = 2.
Now, if I want to calculate, say, the median of the value column for a given id, I will not arrive at the correct result using the table as it stands - as I'm missing 5 zero entries for each id.
I would therefore like to create/use the following (potentially just temporary table)...
id ref_date value
---------------------
1 01/01/2014 20
1 01/02/2014 25
1 01/03/2014 0
1 01/04/2014 0
1 01/05/2014 0
1 01/06/2014 3
1 01/07/2014 0
1 01/08/2014 0
1 01/09/2014 6
2 01/04/2015 7
2 01/05/2015 0
2 01/06/2015 0
2 01/07/2015 0
2 01/08/2015 43
2 01/09/2015 85
2 01/10/2015 0
2 01/11/2015 0
2 01/12/2015 4
...on which I could then compute the median by id:
select id, median(value) as med_value from test group by id
How do I do this? Or would there be an alternative way?
Many thanks,
Mr Clueless
In this solution, I build a table with all the "needed dates" and value of 0 for all of them. Then, instead of a join, I do a union all, group by id and ref_date and ADD the values in each group. If the date had a row with a value in the original table, then that's the resulting value; and if it didn't, the value will be 0. This avoids a join. In almost all cases a union all + aggregate will be faster (sometimes much faster) than a join.
I added more input data for more thorough testing. In your original question, you have two id's, and for both of them you have four positive values. You are missing five values in each case, so there will be five zeros (0) which means the median is 0 in both cases. For id=3 (which I added) I have three positive values and three zeros; the median is half of the smallest positive number. For id=4 I have just one value, which then should be the median as well.
The solution includes, in particular, an answer to your specific question - how to create the temporary table (which most likely doesn't need to be a temporary table at all, but an inline view). With factored subqueries (in the WITH clause), the optimizer decides if to treat them as temporary tables or inline views; you can see what the optimizer decided if you look at the Explain Plan.
with
inputs ( id, ref_date, value ) as (
select 1, to_date('01/01/2014', 'dd/mm/yyyy'), 20 from dual union all
select 1, to_date('01/02/2014', 'dd/mm/yyyy'), 25 from dual union all
select 1, to_date('01/06/2014', 'dd/mm/yyyy'), 3 from dual union all
select 1, to_date('01/09/2014', 'dd/mm/yyyy'), 6 from dual union all
select 2, to_date('01/04/2015', 'dd/mm/yyyy'), 7 from dual union all
select 2, to_date('01/08/2015', 'dd/mm/yyyy'), 43 from dual union all
select 2, to_date('01/09/2015', 'dd/mm/yyyy'), 85 from dual union all
select 2, to_date('01/12/2015', 'dd/mm/yyyy'), 4 from dual union all
select 3, to_date('01/01/2016', 'dd/mm/yyyy'), 12 from dual union all
select 3, to_date('01/03/2016', 'dd/mm/yyyy'), 23 from dual union all
select 3, to_date('01/06/2016', 'dd/mm/yyyy'), 2 from dual union all
select 4, to_date('01/11/2014', 'dd/mm/yyyy'), 9 from dual
),
-- the "inputs" table constructed above is for testing only,
-- it is not part of the solution.
ranges ( id, min_date, max_date ) as (
select id, min(ref_date), max(ref_date)
from inputs
group by id
),
prep ( id, ref_date, value ) as (
select id, add_months(min_date, level - 1), 0
from ranges
connect by level <= 1 + months_between( max_date, min_date )
and prior id = id
and prior sys_guid() is not null
),
v ( id, ref_date, value ) as (
select id, ref_date, sum(value)
from ( select id, ref_date, value from prep union all
select id, ref_date, value from inputs
)
group by id, ref_date
)
select id, median(value) as median_value
from v
group by id
order by id -- ORDER BY is optional
;
ID MEDIAN_VALUE
-- ------------
1 0
2 0
3 1
4 9
If ref_date is date and is second
with int1 as (select id
, max(ref_date) as max_date
, min(ref_date) as min_date from test group by id )
, s(n) as (select level -1 from dual connect by level <= (select max(months_between(max_date, min_date)) from int1 ) )
select i.id
, add_months(i.min_date,s.n) as ref_date
, nvl(value,0) as value
from int1 i
join s on add_months(i.min_date,s.n) <= i.max_date
LEFT join test t on t.id = i.id and add_months(i.min_date,s.n) = t.ref_date
And with median
with int1 as (select id
, max(ref_date) as max_date
, min(ref_date) as min_date from test group by id )
, s(n) as (select level -1 from dual connect by level <= (select max(months_between(max_date, min_date)) from int1 ) )
select i.id
, MEDIAN(nvl(value,0)) as value
from int1 i
join s on add_months(i.min_date,s.n) <= i.max_date
LEFT join test t on t.id = i.id and add_months(i.min_date,s.n) = t.ref_date
group by i.id