Ungroup/disaggregate in HIVE - hive

Is it possible to ungroup a dataset in hive? I don't believe you can lateral view explode an integer.
Current table:
event count
A 3
B 2
Result table:
event count
A 1
A 1
A 1
B 1
B 1
Count column obviously not super important in the result.

Using space() function you can convert count to string of spaces with length=count-1, then use split() to convert it to array and explode() with lateral view to generate rows.
Just replace the a subquery in my demo with your table.
Demo:
select a.event,
1 as count --calculate count somehow if necessary
from
(select stack(2,'A',3,'B',2) as (event, count)) a --Replace this subquery with your table name
lateral view explode(split(space(a.count-1),' ')) s
;
Result:
OK
A 1
A 1
A 1
B 1
B 1
Time taken: 0.814 seconds, Fetched: 5 row(s)

One option is to create a numbers table and use it for disaggregation.
--create numbers table
create table if not exists dbname.numbers
location 'some_hdfs_location' as
select stack(5,1,2,3,4,5) t as num --increase the number of values as needed
--Disaggregation
select a.event,n.num --or a.cnt
from dbname.agg_table a
join dbname.numbers n on true
where a.cnt >= n.num and a.cnt <= n.num

If number of records to dis aggregate is high and you don't want to hard code it.
Create a udf which will return seq of numbers
[prjai#lnx0689 py_ws]$ cat prime_num.py
import sys
try:
for line in sys.stdin:
num = int(line)
for i in range(1, num+1):
#print u"i".encode('utf-8')
print u"%i".encode('utf-8') %(i)
except:
print sys.exc_info()
Add python script to hive env
hive> add FILE /home/prjai/prvys/py_ws/prime_num.py
Create temporary table for above script
hive> create temporary table t1 as with t1 as (select transform(10) using 'python prime_num.py' as num1) select * from t1;
Your query would be -
hive> with t11 as (select 'A' as event, 3 as count) select t11.event, t11.count from t11, t1 where t11.count>=t1.num1;
Hope this helps.

Related

Compare every field in table to every other field in same table

Imagine a table with only one column.
+------+
| v |
+------+
|0.1234|
|0.8923|
|0.5221|
+------+
I want to do the following for row K:
Take row K=1 value: 0.1234
Count how many values in the rest of the table are less than or equal to value in row 1.
Iterate through all rows
Output should be:
+------+-------+
| v |output |
+------+-------+
|0.1234| 0 |
|0.8923| 2 |
|0.5221| 1 |
+------+-------+
Quick Update I was using this approach to compute a statistic at every value of v in the above table. The cross join approach was way too slow for the size of data I was dealing with. So, instead I computed my stat for a grid of v values and then matched them to the vs in the original data. v_table is the data table from before and stat_comp is the statistics table.
AS SELECT t1.*
,CASE WHEN v<=1.000000 THEN pr_1
WHEN v<=2.000000 AND v>1.000000 THEN pr_2
FROM v_table AS t1
LEFT OUTER JOIN stat_comp AS t2
Windows functions were added to ANSI/ISO SQL in 1999 and to to Hive in version 0.11, which was released on 15 May, 2013.
What you are looking for is a variation on rank with ties high which in ANSI/ISO SQL:2011 would look like this-
rank () over (order by v with ties high) - 1
Hive currently does not support with ties ... but the logic can be implemented using count(*) over (...)
select v
,count(*) over (order by v) - 1 as rank_with_ties_high_implicit
from mytable
;
or
select v
,count(*) over
(
order by v
range between unbounded preceding and current row
) - 1 as rank_with_ties_high_explicit
from mytable
;
Generate sample data
select 0.1234 as v into #t
union all
select 0.8923
union all
select 0.5221
This is the query
;with ct as (
select ROW_NUMBER() over (order by v) rn
, v
from #t ot
)
select distinct v, a.cnt
from ct ot
outer apply (select count(*) cnt from ct where ct.rn <> ot.rn and v <= ot.v) a
After seeing your edits, it really does look look like you could use a Cartesian product, i.e. CROSS JOIN here. I called your table foo, and crossed joined it to itself as bar:
SELECT foo.v, COUNT(foo.v) - 1 AS output
FROM foo
CROSS JOIN foo bar
WHERE foo.v >= bar.v
GROUP BY foo.v;
Here's a fiddle.
This query cross joins the column such that every permutation of the column's elements is returned (you can see this yourself by removing the SUM and GROUP BY clauses, and adding bar.v to the SELECT). It then adds one count when foo.v >= bar.v, yielding the final result.
You can take the full Cartesian product of the table with itself and sum a case statement:
select a.x
, sum(case when b.x < a.x then 1 else 0 end) as count_less_than_x
from (select distinct x from T) a
, T b
group by a.x
This will give you one row per unique value in the table with the count of non-unique rows whose value is less than this value.
Notice that there is neither a join nor a where clause. In this case, we actually want that. For each row of a we get a full copy aliased as b. We can then check each one to see whether or not it's less than a.x. If it is, we add 1 to the count. If not, we just add 0.

Adding first two rows result as a second row then addition of first three rows result as a third row and so on

I have a table in sql serevr,which has one column and its storing integer values.
Ex : ColumnData
100
150
20
25
300
Now by using this data i want the result as shown below.
columndata NewColumn
100 100
150 250
20 270
25 295
300 595
so in the output newcolumn is added by the logic i.e first row data as firstrow,then first two rows addition result as appears in second row,then first three rows addition result as appears in third row like so on...
could any one please provide me the query how to get my result.
Thanks In Advance,
Phani Kumar.
Assuming that you have a column that you can order the data by then you can compute a running total by either using a windowed aggregate function (this works in SQL Server 2012+) or a self join (which works in any version). If you don't have any column to order by then it can't be done in a deterministic way at all.
-- sample table:
create table t (id int identity(1,1), ColumnData int)
insert t values (100),(150),(20),(25),(300)
-- query 1 using windowed aggregate
select ColumnData, sum(ColumnData) over (order by id) as NewColumn
from t order by id
-- query 2 using self-join
select t1.ColumnData, sum(t2.ColumnData) as NewColumn
from t t1
join t t2 on t2.id <= t1.id
group by t1.id, t1.ColumnData
order by t1.id
Sample SQL Fiddle
You need to use PL SQL to to do this.
Alter the table to have a new field id to to sort and value2 having final result.
DECLARE
l_last_sum INTEGER := 0;
CURSOR test_cur
IS
SELECT id,value
FROM test
ORDER BY id ASC;
l_test test_cur%ROWTYPE;
BEGIN
OPEN test_cur;
LOOP
FETCH test_cur INTO l_test;
EXIT WHEN test_cur%NOTFOUND;
l_last_sum:=l_last_sum+l_test.value;
update test set value2=l_last_sum where id=l_test.id;
END LOOP;
CLOSE test_cur;
END;
SQL> select * from test;
ID VALUE VALUE2
---------- ---------- ----------
1 100 100
2 25 125
3 40 165
with sal as(
select a.empid,salry,row_number() over(order by empid) rn from empmaster a)
select a.empid,a.salry,b.salry,a.salry+b.salry
from sal a
left outer join sal b on a.rn = b.rn-1

Generate SQL rows

Given a number of types and a number of occurrences per type, I would like to generate something like this in T-SQL:
Occurrence | Type
-----------------
0 | A
1 | A
0 | B
1 | B
2 | B
Both the number of types and the number of occurrences per type are presented as values in different tables.
While I can do this with WHILE loops, I'm looking for a better solution.
Thanks!
This works with a number-table which i would use.
SELECT Occurrence = ROW_NUMBER() OVER (PARTITION BY Type ORDER BY Type) - 1
, Type
FROM Numbers num
INNER JOIN #temp1 t
ON num.n BETWEEN 1 AND t.Occurrence
Tested with this sample data:
create table #temp1(Type varchar(10),Occurrence int)
insert into #temp1 VALUES('A',2)
insert into #temp1 VALUES('B',3)
How to create a number-table? http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
If you have a table with the columns type and num, you have two approaches. One way is to use recursive CTEs:
with CTE as (
select type, 0 as occurrence, num
from table t
union all
select type, 1 + occurrence, num
from cte
where occurrence + 1 < num
)
select cte.*
from cte;
You may have to set the MAXRECURSION option, if the number exceeds 100.
The other way is to join in a numbers table. SQL Server uses spt_values for this purpose:
select s.number - 1 as occurrence, t.type
from table t join
spt_values s
on s.number <= t.num ;

How to SELECT top N rows that sum to a certain amount?

Suppose:
MyTable
--
Amount
1
2
3
4
5
MyTable only has one column, Amount, with 5 rows. They are not necessarily in increasing order.
How can I create a function, which takes a #SUM INT, and returns the TOP N rows that sum to this amount?
So for input 6, I want
Amount
1
2
3
Since 1 + 2 + 3 = 6. 2 + 4 / 1 + 5 won't work since I want TOP N ROWS
For 7/8/9/10, I want
Amount
1
2
3
4
I'm using MS SQL Server 2008 R2, if this matters.
Saying "top N rows" is indeed ambiguous when it comes to relational databases.
I assume that you want to order by "amount" ascending.
I would add a second column (to a table or view) like "sum_up_to_here", and create something like that:
create view mytable_view as
select
mt1.amount,
sum(mt2.amount) as sum_up_to_here
from
mytable mt1
left join mytable mt2 on (mt2.amount < mt1.amount)
group by mt1.amount
or:
create view mytable_view as
select
mt1.amount,
(select sum(amount) from mytable where amount < mt1.amount)
from mytable mt1
and then I would select the final rows:
select amount from mytable_view where sum_up_to_here < (some value)
If you don't bother about performance you may of course run it in one query:
select amount from
(
select
mt1.amount,
sum(mt2.amount) as sum_up_to_here
from
mytable mt1
left join mytable mt2 on (mt2.amount < mt1.amount)
group by mt1.amount
) t where sum_up_to_here < 20
One approach:
select t1.amount
from MyTable t1
left join MyTable t2 on t1.amount > t2.amount
group by t1.amount
having coalesce(sum(t2.amount),0) < 7
SQLFiddle here.
In Sql Server you can use CDEs to make it pretty simple to read.
Here is a CDE I did to sum up totals used in sequence. The CDE is similar to the joins above, and holds the total up to any given index. Outside of the CDE I join it back to the original table so I can select it along with other fields.
;with summrp as (
select m1.idx, sum(m2.QtyReq) as sumUsed
from #mrpe m1
join #mrpe m2 on m2.idx <= m1.idx
group by m1.idx
)
select RefNum, RefLineSuf, QtyReq, ProjectedDate, sumUsed from #mrpe m
join summrp on summrp.idx=m.idx
In SQL Server 2012 you can use this shortcut to get a result like Grzegorz's.
SELECT amount
FROM (
SELECT * ,
SUM(amount) OVER (ORDER BY amount ASC) AS total
from demo
) T
WHERE total <= 6
A fiddle in the hand... http://sqlfiddle.com/#!6/b8506/6

MySQL - How to simplify this query?

i have a query which i want to simplify:
select
sequence,
1 added
from scoredtable
where score_timestamp=1292239056000
and sequence
not in (select sequence from scoredtable where score_timestamp=1292238452000)
union
select
sequence,
0 added
from scoredtable
where score_timestamp=1292238452000
and sequence
not in (select sequence from scoredtable where score_timestamp=1292239056000);
Any ideas? basically i want to extract from the same table all the sequences that are different betweent two timestamp values. With a colum "added" which represents if a row is new or if a row has been deleted.
Source table:
score_timestamp sequence
1292239056000 0
1292239056000 1
1292239056000 2
1292238452000 1
1292238452000 2
1292238452000 3
Example between (1292239056000, 1292238452000)
Query result (2 rows):
sequence added
3 1
0 0
Example between (1292238452000, 1292239056000)
Query result (2 rows):
sequence added
0 1
3 0
Example between (1292239056000, 1292239056000)
Query result (0 rows):
sequence added
This query gets all sequences that appear only once within both timestamps, and checks if it occurs for the first or for the second timestamp.
SELECT
sequence,
CASE WHEN MIN(score_timestamp) = 1292239056000 THEN 0 ELSE 1 END AS added
FROM scoredtable
WHERE score_timestamp IN ( 1292239056000, 1292238452000 )
AND ( 1292239056000 <> 1292238452000 ) -- No rows, when timestamp is the same
GROUP BY sequence
HAVING COUNT(*) = 1
It returns your desired result:
sequence added
3 1
0 0
Given two timestamps
SET #ts1 := 1292239056000
SET #ts2 := 1292238452000
you can get your additions and deletes with:
SELECT s1.sequence AS sequence, 0 as added
FROM scoredtable s1 LEFT JOIN
scoredtable s2 ON
s2.score_timestamp = #ts2 AND
s1.sequence = s2.sequence
WHERE
s1.score_timestamp = #ts1 AND
s2.score_timestampe IS NULL
UNION ALL
SELECT s2.sequence, 1
FROM scoredtable s1 RIGHT JOIN
scoredtable s2 ON s1.score_timestamp = #ts1 AND
s1.sequence = s2.sequence
WHERE
s2.score_timestamp = #ts2 AND
s1.score_timestampe IS NULL
depending on the number of rows and the statistics the above query might perform better then group by and having count(*) = 1 version (i think that will always need full table scan, while the above union should be able to do 2 x anti-join which might fare better)
If you have substantial data set, do let us know which is faster (test with SQL_NO_CACHE for comparable results)