How to ignore null values in sum in hive

How to ignore null values in sum in hive - hive

I am having a table in hive with below values
ID value
1
1
ID value
1
1 2
while doing sum i need the output as
select id,sum(val) from table group by id;
first required output
id sum
1
Second table Output
id sum
1 2

In math default 2+0=2 so its anyway going to work .Don't worry about this hivewill be default will take care this.
hive> create table first (Id int,value int);
OK
Time taken: 3.895 seconds
hive> select * from first;
OK
1 2
1 NULL
hive> select id, sum(value) as sum from first group by id;
Total MapReduce CPU Time Spent: 4 seconds 610 msec
OK
1 2
Time taken: 83.483 seconds, Fetched: 1 row(s)

If you need to filter out rows with null sum, use having:
select id, sum(value) from table group by id having sum(value) is not null;

Related

Copy rows and increase Version-Column without Cursor / Loop

I have a table with this structure:
ID Version Content
-------------------------------------------------------
1 1 sometext
1 2 anothertext
1 3 someverydifferenttext
So all rows have the same ID but a different Version. I want to copy these rows (Insert Statement) and increase the Version-Column to the next free number.
Expected Result:
ID Version Content
-------------------------------------------------------
1 1 sometext
1 2 anothertext
1 3 someverydifferenttext
1 4 sometext
1 5 anothertext
1 6 someverydifferenttext
Is there way to do this in a single Select-Insert Statement?
I tried with...
Insert Into MyTable
SELECT ID
, MAX([Version])OVER(PARTITION BY ID ORDER BY [Version] DESC) + 1
,Content
FROM MyTable
But this does not work because MAX() would have to be evaluated again after each individual insert of a row. And the only option I currently see is a loop.
I use T-SQL.

Seems you could achieve this with ROW_NUMBER and a windowed MAX:
INSERT INTO dbo.YourTable
SELECT ID,
ROW_NUMBER() OVER (ORDER BY Version) + MAX(Version) OVER () AS Version,
Content
FROM dbo.YourTable WITH (UPDLOCK, HOLDLOCK);
db<>fiddle

SQL:How do I group by the ids with only the null directly located directly below it?

How do I group by the ids with only the null directly located directly below it. Then get sum of the time?
ID time
1 time1
null time1
null time1
null time1
2 time1
null time1
null time1
3 time1
null time1
null time1
Result wanted
ID time
1 sumTime
2 sumTime
3 sumTime

SQL tables represent unordered sets. In order for you to do what you want, you need a column that specifies the ordering. Once you have that, you can identify the groups by counting up the cumulative number of non-null values in id and aggregating:
select id, sum(time)
from (select t.*,
count(id) over (order by <ordering col>) as grp
from t
) t
group by id;
If you do not have an ordering column, your question does not make sense, because the table is unordered.

I agree with Gordon Linoff that what you are asking falls outside the rules of SQL Server because tables in SQL Server are unordered sets.
However, assuming that if you run the command
SELECT * FROM YourTimeTable
returns the data following the order and structure you showed:
ID time
1 time1
null time1
null time1
null time1
2 time1
null time1
null time1
3 time1
null time1
null time1
You can make it work with the following strategy:
Add a new column with row numbers that we can use to add ordering
Then run an update statement to set the ID = to the highest ID in row numbers smaller than the current row number.
if OBJECT_ID('tempdb.dbo.#tempTimeTable') IS NOT NULL
begin
drop table #tempTimeTable
end
SELECT ROW_NUMBER() OVER(ORDER BY TIME) AS RowN, * into #tempTimeTable FROM YourTimeTable
update t1 set ID = (select max(ID) from #tempTimeTable t2 where t2.RowN < t1.RowN) from #tempTimeTable t1 where id is null
select ID, SUM([time]) from #tempTimeTable group by [ID]
What we are doing is:
Insert the data from the original table into a temp table with a new column added that indicates the row number.
Update the ID fields on the rows that are NULL and set it = to the highest ID from lower number rows only. it will look like this:
1 time1
1 time1
1 time1
1 time1
2 time1
2 time1
2 time1
3 time1
3 time1
3 time1
Retrieve the data after summing all the times for each ID together.
Let me know if this works for you.

Keyset pagination with composite key

I am using oracle 12c database and I have a table with the following structure:
Id NUMBER
SeqNo NUMBER
Val NUMBER
Valid VARCHAR2
A composite primary key is created with the field Id and SeqNo.
I would like to fetch the data with Valid = 'Y' and apply ketset pagination with a page size of 3. Assume I have the following data:
Id SeqNo Val Valid
1 1 10 Y
1 2 20 N
1 3 30 Y
1 4 40 Y
1 5 50 Y
2 1 100 Y
2 2 200 Y
Expected result:
----------------------------
Page 1
----------------------------
Id SeqNo Val Valid
1 1 10 Y
1 3 30 Y
1 4 40 Y
----------------------------
Page 2
----------------------------
Id SeqNo Val Valid
1 5 50 Y
2 1 100 Y
2 2 200 Y
Offset pagination can be done like this:
SELECT * FROM table ORDER BY Id, SeqNo OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY;
However, in the actual db it has more than 5 millions of records and using OFFSET is going to slow down the query a lot. Therefore, I am looking for a ketset pagination approach (skip records using some unique fields instead of OFFSET)
Since a composite primary key is used, I need to offset the page with information from more than 1 field.
This is a sample SQL that should work in PostgreSQL (fetch 2nd page):
SELECT * FROM table WHERE (Id, SeqNo) > (1, 4) AND Valid = 'Y' ORDER BY Id, SeqNo LIMIT 3;
How do I achieve the same in oracle?

Use row_number() analytic function with ceil arithmetic fuction. Arithmetic functions don't have a negative impact on performance, and row_number() over (order by ...) expression automatically orders the data without considering the insertion order, and without adding an extra order by clause for the main query. So, consider :
select Id,SeqNo,
ceil(row_number() over (order by Id,SeqNo)/3) as page
from tab
where Valid = 'Y';
P.S. It also works for Oracle 11g, while OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY works only for Oracle 12c.
Demo

You can use order by and then fetch rows using fetch and offset like following:
Select ID, SEQ, VAL, VALID FROM TABLE
WHERE VALID = 'Y'
ORDER BY ID, SEQ
--FETCH FIRST 3 ROWS ONLY -- first page
--OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY -- second pages
--OFFSET 6 ROWS FETCH NEXT 3 ROWS ONLY -- third page
--Update--
You can use row_number analytical function as following.
Select id, seqNo, Val, valid from
(Select t.*,
Row_number(order by id, seq) as rn from table t
Where valid = 'Y')
Where ceil(rn/3) = 2 -- for page no. 2
Cheers!!

Ungroup/disaggregate in HIVE

Is it possible to ungroup a dataset in hive? I don't believe you can lateral view explode an integer.
Current table:
event count
A 3
B 2
Result table:
event count
A 1
A 1
A 1
B 1
B 1
Count column obviously not super important in the result.

Using space() function you can convert count to string of spaces with length=count-1, then use split() to convert it to array and explode() with lateral view to generate rows.
Just replace the a subquery in my demo with your table.
Demo:
select a.event,
1 as count --calculate count somehow if necessary
from
(select stack(2,'A',3,'B',2) as (event, count)) a --Replace this subquery with your table name
lateral view explode(split(space(a.count-1),' ')) s
;
Result:
OK
A 1
A 1
A 1
B 1
B 1
Time taken: 0.814 seconds, Fetched: 5 row(s)

One option is to create a numbers table and use it for disaggregation.
--create numbers table
create table if not exists dbname.numbers
location 'some_hdfs_location' as
select stack(5,1,2,3,4,5) t as num --increase the number of values as needed
--Disaggregation
select a.event,n.num --or a.cnt
from dbname.agg_table a
join dbname.numbers n on true
where a.cnt >= n.num and a.cnt <= n.num

If number of records to dis aggregate is high and you don't want to hard code it.
Create a udf which will return seq of numbers
[prjai#lnx0689 py_ws]$ cat prime_num.py
import sys
try:
for line in sys.stdin:
num = int(line)
for i in range(1, num+1):
#print u"i".encode('utf-8')
print u"%i".encode('utf-8') %(i)
except:
print sys.exc_info()
Add python script to hive env
hive> add FILE /home/prjai/prvys/py_ws/prime_num.py
Create temporary table for above script
hive> create temporary table t1 as with t1 as (select transform(10) using 'python prime_num.py' as num1) select * from t1;
Your query would be -
hive> with t11 as (select 'A' as event, 3 as count) select t11.event, t11.count from t11, t1 where t11.count>=t1.num1;
Hope this helps.

Efficient way to calculate average time between row dates grouped by ID

Suppose I have a table like this:
thedate ID
2014-10-20 14:13:42.063 1
2014-10-20 14:13:43.063 1
2014-10-20 14:13:47.063 1
2014-10-20 14:12:50.063 2
2014-10-20 14:13:49.063 2
2014-10-20 14:13:54.063 2
2014-10-20 14:20:24.063 2
2014-10-20 14:13:02.063 3
To replicate a similar toybox table as in this example you can use the following code:
declare #tmp as table(thedate datetime,ID int)
insert into #tmp (thedate, ID) values
(dateadd(s,0,getdate()),1), (dateadd(s,1,getdate()),1), (dateadd(s,5,getdate()),1),
(dateadd(s,-52,getdate()),2), (dateadd(s,7,getdate()),2), (dateadd(s,12,getdate()),2),(dateadd(s,402,getdate()),2),
(dateadd(s,-40,getdate()),3)
For each ID I want the average time between the dates. Now the database is huge (lots of ID's and dates for each ID), so it has to be very efficient. I want a result like this:
ID AvgTime (seconds)
1 2,5
2 151,333333333333
3 NULL
The following code does what I want, but it is way too slow:
select
a.ID,
(select top 1 avg(cast(datediff(s,(select max(thedate)
from #tmp c where ID = b.ID
and thedate < b.thedate)
,thedate) as float)) over (partition by b.ID)
from #tmp b where ID = a.ID)
from #tmp a group by ID
Does anyone know how to do this efficiently?

The average is the maximum minus the minimum divided by one less than the count. You can use this to write a relatively simple query:
select id,
cast(datediff(second, min(thedate), max(thedate)) as float) / (count(*) - 1)
from #tmp
group by id;
If some of the ids only have one row, then you'll want to check for potential divide by 0:
select id,
(case when count(*) > 1
then cast(datediff(second, min(thedate), max(thedate)) as float) / (count(*) - 1)
end) as AvgDiff
from #tmp
group by id;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to ignore null values in sum in hive - hive

I am having a table in hive with below values ID value 1 1 ID value 1 1 2 while doing sum i need the output as select id,sum(val) from table group by id; first required output id sum 1 Second table Output id sum 1 2

If you need to filter out rows with null sum, use having: select id, sum(value) from table group by id having sum(value) is not null;

Related

Copy rows and increase Version-Column without Cursor / Loop

SQL:How do I group by the ids with only the null directly located directly below it?

Keyset pagination with composite key

Ungroup/disaggregate in HIVE

Efficient way to calculate average time between row dates grouped by ID

Categories

Resources