Select random row per column value from Hive table - hive

I am trying to retrieve a random row for each distinct value for column hash. I also need the dt column.
So far I arrived at this not-working query:
INSERT OVERWRITE TABLE t PARTITION(dt)
SELECT hash, dt FROM (
SELECT hash, RAND() as r, dt FROM t1
UNION
SELECT hash, RAND() as r, dt FROM t2
) result
WHERE r IN (SELECT MAX(r) FROM result WHERE hash=result.hash);
The query fails with error Table not found 'result' due to using it in the FROM clause FROM result.
How can I fix this query or what other approach to use here ?

You can use row_number to get the row with max value per hash ordered by r.
INSERT OVERWRITE TABLE t PARTITION(dt)
SELECT hash,dt
FROM (SELECT hash, dt, row_number() over(partition by hash order by r desc) as rnum
FROM (SELECT hash, RAND() as r, dt FROM t1
UNION ALL
SELECT hash, RAND() as r, dt FROM t2
) result
) t
WHERE rnum=1

Related

select N-1 records for update

I have a query where I want to update n-1 records from result set. Can this be done without loops?
If my query is like this:
with cte(id, count)
as
(
select e.id, count(*) as count
from data
where id in (multiple values)
group by id
having count(*) >1
)
Now I want to update the rows in another table with the resulting id's but only any n-1 rows for each id value from the above query. Something like this:
update top( count-1 or n-1) from data2
inner join cte on data2.id = cte.id
set somecolumn = 'some value'
where id in (select id from cte)
The id column is not unique. There are multiple rows with the same id values in table data 2.
This query will do what you want. It uses two CTEs; the first generates the list of eligible id values to update, and the second generates row numbers for id values in data2 which match those in the first CTE. The second CTE is then updated if the row number is greater than 1 (so only n-1 rows get updated):
with cte(id, count) as (
select id, count(*) as count
from data
where id in (2, 3, 4, 6, 7)
group by id
having count(*) >1
),
cte2 as (
select d.id, d.somecolumn,
row_number() over (partition by d.id order by rand()) as rn
from data2 d
join cte on cte.id = d.id
)
update cte2
set somecolumn = 'some value'
where rn > 1
Note I've chosen to order row numbers randomly, you might have some other scheme for deciding which n-1 values you want to update (e.g. ordered by id, or ...).
Is this what you're looking for? The CTE identifies ALL of the source rows, but the WHEREclause in the UPDATE statement limits the updates to n-1.
WITH cte AS
(
SELECT
id,
ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS RowNum
FROM data
)
UPDATE t
SET t.<whatever> = <whateverElse>
FROM
otherTable AS t
JOIN
cte AS c
ON t.id = c.id
WHERE
c.RowNum > 1;
I believe this would work just fine
;with cte(id, count)
as
(
select e.id, count(*) as count
from data
where id in (multiple values)
group by id
having count(*) >1
)
update data
set soemcolumn = 'some value'
from data join cte on cte.id = data.id
;

sql: first row after the last row with a property

I would like to write a query that returns the first row immediately after the last row with a given property (ordered by id). Id's may not be consecutive.
Ideally it would look something like this:
...
JOIN (select max(id) id from my_table where CONDITION) m
JOIN (select min(id) from my_table where id > m.id) n
However, I can not use identifier m in the second subselect.
It is possible to use nested queries in nested queries, but is there an easier way?
Thank you.
You could use lead() to get the next id before applying the condition:
select t.*
from my_table t join
(select max(next_id) as max_next_id
from (select t.*, lead(id) over (order by id) as next_id
from my_table t
) t
where <condition>
) tt
on t.id = tt.max_next_id;
You could also do:
select t.*
from my_table t
where t.id > (select max(t2.id) from my_table t2 where <condition>)
order by t2.id asc
fetch first 1 row only;
I am not sure how this is getting woven into the rest of your query, so I have used a CTE
WITH max_next AS (
SELECT r.id as max_id
,r.next_id
FROM (
SELECT m.id
,m.next_id
,ROW_NUMBER() OVER (ORDER BY m.id DESC) AS rn
FROM (
SELECT n.* -- to provide data to satisfy CONDITIONS
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
) AS r
WHERE r.rn = 1
)
I would also shrink the n.* to the columns needed by CONDITIONS to a, not be implicit as the * slows the compile time down (or historically has) as all meta data needs to be read to understand what columns is in the ANY, and the while the compile can also prune not used columns, it's faster if you just ask for what you want (in best case just a compile time savings, worse case, it read all the data when you only need x number of columns read)
And borrowing from Gordon solution, the ROW_NUMBER part could be simpler
WITH max_next AS (
SELECT m.id
,m.next_id
--, plus what ever other things you want from m
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
ORDER BY m.id DESC LIMIT 1
)
So for an example for #PIG,
WITH my_table AS (
SELECT column1 AS id
,column2 AS con1
,column3 AS other
FROM VALUES (1,'a',123),(2,'b',234),(3,'a',345),(5,'b',456),(7,'a',567),(10,'c',678)
)
SELECT m.id
,m.next_id
,m.other
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE m.con1 = 'b'
ORDER BY m.id DESC LIMIT 1;
gives 5, 7, 456 which is the last 'b' and the new row, and an extra value on my_table for entertainment purposes (and run on Snowflake to, which means I fixed the prior SQL also.)
This should work, it's pretty straightforward (easy), and it's good that you know records may not be stored in a ordered/consecutive fashion.
SELECT *
FROM my_table
WHERE id = (
SELECT min(id)
FROM my_table
WHERE id > (
SELECT max(id)
FROM my_table
WHERE CONDITION));

row_number() over() continuous over union select

I am creating a table and inserting data into that table with a query1 union query2. The issue is that I want to add row_number() to the table however when I add row_number() over() to either of the queries, the numbering only applies to query1 or query2 but not to the entire table as a whole.
I did a hack to get my result where I insert the data into the table (table_no_serial) using insert query1 union query2, then I create a second table like so
insert into table_w_serial select row_number() over(), * from table_no_serial;
is it possible to get this right the first time around?
insert into table purchase_table
select row_number() over(), w.ts, w.tail, w.event, w.action, w.msg, w.tags
from table1 w
where
w.action = 'stop'
union
select row_number() over(), t.ts, t.tail, t.event, t.action, t.msg, t.tags
from table2 t
where
f.action = 'stop';
I want something like this to work.
I want to write a code where the resulting table (endtable) will be a union of the first query and the second query and will include a constant row number across both queries so that if query1 returns 50 results and query2 returns 40 results. End table will have row number from 1-90
Use a subquery:
insert into table purchase_table ( . . . ) -- include column names here
select row_number() over (), ts, tail, event, action, msg, tags
from ((select w.ts, w.tail, w.event, w.action, w.msg, w.tags
from table1 w
where w.action = 'stop'
) union all
(select w.ts, w.tail, w.event, w.action, w.msg, w.tags
from table2 w
where f.action = 'stop'
)
) w;
Note that this also changes union to union all. union all is more efficient; only use union if you want to incur the overhead of removing duplicates.

How do I sample the number of records in another table?

I have code where I'm sampling 50,000 random records. I.e.,
SELECT * FROM Table1
SAMPLE 50000;
That works. However, what I really want to do is sample the number of records that are in a different table. I.e.,
SELECT * FROM Table1
SAMPLE count(*) FROM Table2;
I get an error. What am I doing wrong?
This is not randomized like sample, so bear that in mind. But there also won't be an obvious pattern, I believe it's determined by disk location (don't quote me on that).
SELECT *
FROM Table1
QUALIFY ROW_NUMBER() OVER
( PARTITION BY 1
ORDER BY 1
) <=
( SELECT COUNT(*)
FROM Table2
);
Better way
SELECT TMP.* -- Or list the columns you want with "rnd"
FROM ( SELECT RANDOM(-10000000,10000000) rnd,
T1.*
FROM Table1 T1
) TMP
QUALIFY ROW_NUMBER() OVER
( ORDER BY rnd
) <=
( SELECT COUNT(*)
FROM Table2
);
SELECT TOP 50000 * FROM Table1 ORDER BY NEWID()

Create a table with duplicate values, and use a CTE (Common Table Expression )to delete those duplicate values

Create a table with duplicate values, and use a CTE (Common Table Expression )to delete those duplicate values.
=>
Would some one please help me how to start it because i really don't understand the question.
Assume guess duplicate values can be chosen anything.
For MS SQL Server, this would work:
;with cte as
(
select *
, row_number() over (
partition by [columns], [which], [should], [be], [unique]
order by [columns], [to], [select], [what's], [kept]
) NoOfThisDuplicate
)
delete
from cte
where NoOfThisDuplicate > 1
SQL Fiddle Demo (based on this question: Deleting duplicate row that has earliest date).
Explanation
Create a CTE
Populate it with all rows from the table we want to delete
Add a NoOfThiDuplicate column to that output
Populate this value with the sequential number of this record with the group/partition of all records with the same values for columns [columns], [which], [should], [be], [unique].
The order of the numbering depends on the sort order of those records when sorted by columns [columns], [to], [select], [what's], [kept]
We delete all records returned by the CTE except the first of each group (i.e. all except those with NoOfThisDuplicate=1).
Oracle Setup:
CREATE TABLE test_data ( value ) AS
SELECT LEVEL FROM DUAL CONNECT BY LEVEL <= 10
UNION ALL
SELECT 2*LEVEL FROM DUAL CONNECT BY LEVEL <= 5;
Query 1:
This will select the values removing duplicates:
SELECT DISTINCT *
FROM test_data
But it does not use a CTE.
Query 2:
So, we can put it in a sub-query factoring clause (the name used in the Oracle documentation which corresponds to the SQL Server Common Table Expression)
WITH unique_values ( value ) AS (
SELECT DISTINCT *
FROM test_data
)
SELECT * FROM unique_values;
Query 3:
The sub-query factoring clause was pointless in the previous example ... so doing it a different way:
WITH row_numbers ( value, rn ) AS (
SELECT value, ROW_NUMBER() OVER ( PARTITION BY value ORDER BY ROWNUM ) AS rn
FROM test_data
)
SELECT value
FROM row_numbers
WHERE rn = 1;
Will select the rows where it the first instance of each value found.
Delete Query:
But that didn't delete the rows ...
DELETE FROM test_data
WHERE ROWID IN (
WITH row_numbers ( rid, rn ) AS (
SELECT ROWID, ROW_NUMBER() OVER ( PARTITION BY value ORDER BY ROWNUM ) AS rn
FROM test_data
)
SELECT rid
FROM row_numbers
WHERE rn > 1
);
Which uses the ROWID pseudocolumn to match rows for deletion.