How to use the recursive query in Hive - hive

It has blank data.
ID Page Timestamp Sequence
Orestes Login 152356 1
Orestes Account view 152368
Orestes Transfer 152380
Orestes Account view 162382 2
Orestes Loan 162393
Antigone Login 152382 1
Antigone Transfer 152390
I wanna change it like below.
ID Page Timestamp Sequence
Orestes Login 152356 1
Orestes Account view 152368 1
Orestes Transfer 152380 1
Orestes Account view 162382 2
Orestes Loan 162393 2
Antigone Login 152382 1
Antigone Transfer 152390 1
I have tried...
with r1
as
(select id, page, timestamp, lag(sequence) over (partition id order by timestamp) as sequence from log)
r2
as
(select id, page, timestamp, sequence from log)
insert into test1
select a.id, a.page, a.timestamp, case when a.sequence is not null then a.sequence
when b.sequence is not null then b.sequence
else a.sequence
end
from r1 a join r2 b on a.id=b.id and a.timestamp=b.timestamp
;
create table test2 like test1
;
with r1
as
(select id, page, timestamp, lag(sequence) over (partition id order by timestamp) as sequence from test1)
r2
as
(select id, page, timestamp, sequence from test1)
insert into test2
select a.id, a.page, a.timestamp, case when a.sequence is not null then a.sequence
when b.sequence is not null then b.sequence
else a.sequence
end
from r1 a join r2 b on a.id=b.id and a.timestamp=b.timestamp
;
create table test3 like test2
;
and it repeat to fill another blank until my fingers are numb...
How do I fill in the blanks to the immediate preceding figures as shown above? I think I should use the recursive query, but can not find a way.

You don't need a recursive query at all.
There is two function in Hive which can help you:
LAST_VALUE - returns the last value of a column
COALESCE - returns first not null values
So you query should look like:
create table tmp_table like original_table;
insert into tmp_table
SELECT
id,
page,
ts,
COALESCE(sequence,
LAST_VALUE(sequence, TRUE) OVER(ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM original_table;

Related

Oracle SQL: using LAG function with user-defined-type returns "inconsistent datatypes"

I have a type MyType defined as follows:
create or replace type MyType as varray(20000) of number(18);
And a table MyTable defined as follows:
create table MyTable (
id number(18) primary key
,widgets MyType
)
I am trying to select the widgets for each row and its logically previous row in MyTable using the following SQL:
select t.id
,lag(t.widgets,1) over (order by t.id) as widgets_previous
from MyTable t
order by t.id;
and I get the response:
ORA-00932: inconsistent datatypes: expected - got MYSCHEMA.MYTYPE
If I run the exact same query using a column of type varchar or number instead of MyType it works fine.
The type of the column in the current row and its previous row must be the same so I can only assume it is something related to the user defined type.
Do I need to do something special to use LAG with a user defined type, or does LAG not support user defined types? If the latter, are there any other utility functions that would provide the same functionality or do I need to do a traditional self join in order to achieve the same?
After reading all the above I've opted for the following as the most effective method for achieving what I need:
select curr.id
,curr.widgets as widgets
,prev.widgets as previous_widgets
from (select a.id
,a.widgets
,lag(a.id,1) over (order by a.id) as previous_id
from mytable a
) curr
left join mytable prev on (prev.id = curr.previous_id)
order by curr.id
ie. a lag / self join hybrid using lag on a number field that it doesn't complain about to identify the join condition. It's fairly tidy I think and I get my collections as desired. Thanks to everyone for the extremely useful input.
You can use lag with UDT. The problem is varray
Does this give you a result?
select t.id
,lag(
(select listagg(column_value, ',') within group (order by column_value)
from table(t.widgets))
,1) over (order by t.id) as widgets_previous
from MyTable t
order by t.id;
You could try something like:
SQL> create or replace type TestType as varray(20000) of number(18);
Type created.
SQL> create table TestTable (
id number(18) primary key
,widgets TestType
)
Table created.
SQL> delete from testtable
0 rows deleted.
SQL> insert into TestTable values (1, TestType(1,2,3,4))
1 row created.
SQL> insert into TestTable values (2, TestType(5,6,7))
1 row created.
SQL> insert into TestTable values (3, TestType())
1 row created.
SQL> insert into TestTable values (4,null)
1 row created.
SQL> commit
Commit complete.
SQL> -- show all data with widgets
SQL> select t.id, w.column_value as widget_ids
from testtable t, table(t.widgets) w
ID WIDGET_IDS
---------- ----------
1 1
1 2
1 3
1 4
2 5
2 6
2 7
7 rows selected.
SQL> -- show with lag function
SQL> select t.id, lag(w.column_value, 1) over (order by t.id) as widgets_previous
from testtable t, table(t.widgets) w
ID WIDGETS_PREVIOUS
---------- ----------------
1
1 1
1 2
1 3
2 4
2 5
2 6
7 rows selected.

Stuck finding combination differences

I am really stuck on a problem regards to finding out if there is difference between two columns. Row value is a follows:
Serial code
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A1
D03L30225 A2
so say if there was another entry like A2 at the end , is there a way of knowing combination serial/code difference.
I have tried windows functions like partition and rank without success
This should work for you. One thing to note is that you have to order by something. Perhaps what I have ordered by is not correct for you situation, but you need something there.
IF OBJECT_ID('tempdb..#Test', 'U') IS NOT NULL DROP TABLE #Test;
create table #Test
(
Serial varchar(10),
code char(2)
)
insert into #Test values ('D03L30225', 'A1')
insert into #Test values ('D03L30225', 'A1')
insert into #Test values ('D03L30225', 'A1')
insert into #Test values ('D03L30225', 'A2')
;
with cte as
(
select rownum = row_number() over (order by Serial, code), Serial, code
from #Test
)
select curr.Serial, curr.code,
case
when curr.code <> prev.code then
1
else
0
end as 'DifferenceFlag'
from cte curr
left join cte prev on prev.rownum = curr.rownum - 1
If you are using SQL Server 2012 or higher you could use the LAG function. We are still on SQL Server 2008 R2. So I needed to do something similar recently I found the method I used above here.
According to your comment, I assume that you want to add another column third_column to your table, and set value for this column according to the change of pair Serial,Code
If that's true, you could use this:
ALTER TABLE
table_name
ADD
third_column numeric(18,0);
UPDATE t
SET t.third_column = t1.rwn
FROM table_name AS t
INNER JOIN
(select
serial, code
,row_number() over (order by serial, code) - 1 as rwn
from
table_name
group by
serial, code
order by
serial, code
) AS t1
ON
t.serial = t1.serial and t.code = t1.code;
I might write the code slightly different. The first query below will just list those codes having more than one serial number. And, the second will flag a whole group of codes where within that code are contained multiple serial numbers.
The other solutions provided will give you a proper row number. In any case, I don't know if this will help, but good luck!
select
code,
count(distinct serial) cnt_serial
from table
group by code
having
count(distinct serial) > 1
OR
select
code,
serial,
case when count(distinct serial) over (partition by code) > 1 then 'Y' end fl_code_has_dup
from table

How to make a range series in SQL

I have to improve a Stored Procedure, it uses a select query on a table as follows:
SELECT DISTINCT ProjectId FROM Project where Status ='P' Order by ProjectId
it give an output as follows:
1
2
3
7
8
11
12
13
I need to use these values in insert statement for another table as follow:
insert into Table values (othervalue, 1|1);
insert into Table values (othervalue, 2|2);
....
To decrease the number of inserts, we want to store as follows:
insert into Table values (othervalue, 1|3);
insert into Table values (othervalue, 7|8);
insert into Table values (othervalue, 11|13);
That is in range till the time there is no gap. I tried using CURSOR to loop through the resultset and have some logic to convert it and keep on inserting. But seems some error.
Can we do something in SELECTquery itself?
with t(a,en,bg) as
(
select a,case when [begin] is NULL then NULL else row_number() over(partition by [begin] order by a) end
,case when [end] is NULL then NULL else row_number() over(partition by [end] order by a) end
from (
select t.a, case when t1.a is NULL then 'end' else NULL end [end],
case when t2.a is NULL then 'begin' else NULL end [begin]
from Project as t left join Project as t1 on (t1.a=t.a+1 AND t.Status='P' AND t1.Status='P')
left join Project as t2 on (t2.a=t.a-1 AND t2.Status='P')
) as o )
select cast(t1.a as varchar)+'|'+cast(t.a as varchar) from t inner join t as t1 on t.en=t1.bg
This query will return you values from Project in '1|3' type.
It's not clear from your question whether you use plsql or sql-server. My solution will be work for MS SQL Server

Simple SQL: How to calculate unique, contiguous numbers for duplicates in a set?

Let's say I create a table with an int Page, int Section, and an int ID identity field, where the page field ranges from 1 to 8 and the section field ranges from 1 to 30 for each page. Now let's say that two records have duplicate page and section. How could I renumber those two records so that the sequence of page and section numbering is contiguous?
select page, section
from #fun
group by page, section having count(*) > 1
shows the duplicates:
page 1 section 3
page 2 section 3
page 1 section 4 and page 2 section 4 are missing. Is there a way without using a cursor to find and renumber the positions in SQL 2000 that doesn't support Row_Number()?
This rownum below of course produces exactly the same number as in section:
select page, section,
(select count(*) + 1
from #fun b
where b.page = a.page and b.section < a.section) as rownum
from #fun a
I could create a pivot table having values 1 through 100, but what would I join against?
What I want to do is something like this:
update p set section = (expression that gets 4)
from #fun p
where (expression that identifies duplicate sections by page)
I don't have a 2000 server to test this on, but I think it should work.
Create test tables/data:
CREATE TABLE #fun
(Id INT IDENTITY(100,1)
,page INT NOT NULL
,section INT NOT NULL
)
INSERT #fun (page, section)
SELECT 1,1
UNION ALL SELECT 1,3 UNION ALL SELECT 1,2
UNION ALL SELECT 1,3 UNION ALL SELECT 1,5
UNION ALL SELECT 2,1 UNION ALL SELECT 2,2
UNION ALL SELECT 2,3 UNION ALL SELECT 2,5
UNION ALL SELECT 2,3
Now the processing:
-- create a worktable
CREATE TABLE #fun2
(Id INT IDENTITY(1,1)
,funId INT
,page INT NOT NULL
,section INT NOT NULL
)
-- insert data into the second temp table ordered by the relevant columns
-- the identity column will form the basis of the revised section number
INSERT #fun2 (funId, page, section)
SELECT Id,page,section
FROM #fun
ORDER BY page,section,Id
-- write the calculated section value back where it is different
UPDATE p
SET section = y.calc_section
FROM #fun AS p
JOIN
(
SELECT f2.funId, f2.id - x.adjust calc_section
FROM #fun2 AS f2
JOIN (
-- this subquery is used to calculate an offset like
-- PARTITION BY in a 2005+ ROWNUMBER function
SELECT MIN(Id) - 1 adjust, page
FROM #fun2
GROUP BY page
) AS x
ON f2.page = x.page
) AS y
ON p.Id = y.funId
WHERE p.section <> y.calc_section
SELECT * FROM #fun order by page, section
Disclaimer: I don't have SQL Server to test.
If I understand you correctly, if you knew the ROW_NUMBER of your #fun records partitioned over (page, section) duplicates, you could use this relative ranking to increment the "section":
UPDATE p
SET section = section + (rownumber - 1)
FROM #fun AS p
INNER JOIN ( -- SELECT id, ROW_NUMBER() OVER (PARTITION BY page, section) ...
SELECT id, COUNT(1) AS rownumber
FROM #fun a
LEFT JOIN #fun b
ON a.page = b.page AND a.section = b.section AND a.id <= b.id
GROUP BY a.id, a.page, a.section) d
ON p.id = d.id
WHERE rownumber > 1
That won't handle the case where the number of duplicates push you past your upper limit of 30. It may also create new duplicates where if higher numbered sections per page already exist -- that is, one instance of (pg 1, sec 3) becomes (pg 1, sec 4), which already existed -- but you can run the UPDATE repeatedly until no duplicates exist.
And then add a unique index on (page, section).

MySQL count() problem

Setup:
create table main(id integer unsigned);
create table test1(id integer unsigned);
create table test2(id integer unsigned);
insert into main(id) value(1);
insert into test1(id) value(1);
insert into test1(id) value(1);
insert into test2(id) value(1);
insert into test2(id) value(1);
insert into test2(id) value(1);
Using:
select main.id,
count(test1.id),
count(test2.id)
from main
left join test1 on main.id=test1.id
left join test2 on main.id=test2.id
group by main.id;
...returns:
+------+-----------------+-----------------+
| id | count(test1.id) | count(test2.id) |
+------+-----------------+-----------------+
| 1 | 6 | 6 |
+------+-----------------+-----------------+
How to get the desired result of 1 2 3?
EDIT
The solution should be extensible,I'm going to query multiple count() information about main.id in the future.
Not optimal, but works:
select
count(*),
(select count(*) from test1 where test1.id = main.id) as test1_count,
(select count(*) from test2 where test2.id = main.id) as test2_count
from main
You created tables that contain the following:
Table main
id
----
1
Table test1
id
----
1
1
Table test2
id
----
1
1
1
When you join this like you do you will get the following
id id id
-----------
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
So how should SQL answer differently?
You can call:
SELECT id,COUNT(id) FROM main GROUP BY id
for every table, then join them by id.
Not sure if this works in MySQL exactly as written (I'm using Oracle):
1 select main.id, t1.rowcount, t2.rowcount
2 from main
3 left join (select id,count(*) rowcount from test1 group by id) t1
4 on t1.id = main.id
5 left join (select id,count(*) rowcount from test2 group by id) t2
6* on t2.id = main.id
SQL> /
ID ROWCOUNT ROWCOUNT
1 2 3
You're inadvertently creating a Cartesian product between test1 and test2, so every matching row in test1 is combined with every matching row in test2. The result of both counts, therefore, is the count of matching rows in test1 multiplied by the count of matching rows in test2.
This is a common SQL antipattern. A lot of people have this problem, because they think they have to get both counts in a single query.
Some other folks on this thread have suggested ways of compensating for the Cartesian product through creative use of subqueries, but the solution is simply to run two separate queries:
select main.id, count(test1.id)
from main
left join test1 on main.id=test1.id
group by main.id;
select main.id, count(test2.id)
from main
left join test2 on main.id=test2.id
group by main.id;
You don't have to do every task in a single SQL query! Frequently it's easier to code -- and easier for the RDBMS to execute -- multiple simpler queries.
You can get the desired result by using:
SELECT COUNT(*) as main_count,
(SELECT COUNT(*) FROM table1) as table1Count,
(SELECT COUNT(*) from table2) as table2Count FROM main