Query Hive table using ROWNUM - hive

How can I query a Hive table specific to row number.
For example :
Let say I want to print out all records of Hive table from row number 2 to 5.

I actually recently updated the documentation regarding the offset option
... order by ... limit 1,4
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause

This answer seems like what you're asking:
SQL most recent using row_number() over partition
In other words:
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click between 2 and 5

Related

Select rows based on distinct values of nested field in BigQuery

I have a table in BigQuery which looks like this:
The sequence field is a repeated RECORD. I want to select one row per stepName but if there are multiple rows per step name, I want to choose the one where sequence.step.elapsedSeconds and sequence.step.elapsedMinutes are not null, otherwise select the rows where these columns are null.
As shown in the image above, I want to select row no. 2, 4 and 5. I have calculated ROW_NUMBER like this: ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum.
HereĀ“s my query so far in trying to filter out the unwanted rows:
WITH DistinctRows AS
(
select timestamp,
ARRAY (
SELECT
STRUCT(
STRUCT(
step.elapsedSeconds,
step.elapsedMinutes,
) as step
)
FROM
UNNEST(source_table.sequence) AS sequence
) AS sequence,
ROW_NUMBER() OVER(PARTITION BY step.stepName) AS RowNum
from source_table,
unnest(sequence) as previousCalls
order by timestamp asc
)
SELECT *
FROM DistinctRows,
unnest(sequence) as sequence
where (rowNum = 1 and (step.elapsedSeconds is null and step.elapsedMinutes is null)
or (RowNum > 1 and step.elapsedSeconds is not null and step.elapsedSeconds is not null)
order by timestamp asc
I need help in figuring out how to filter out the rows like no. 1 and 3 and would appreciate some help.
Thanks in advance.
Hmmm . . . Assuming that stepname is not part of the repeated column:
SELECT dr.* EXCEPT (sequence),
(SELECT seq
FROM unnest(dr.sequence) seq
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST,
sequence.step.elapsedMinutes DESC NULLS LAST
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc;
If stepname is part of sequence, then the subquery would reaggregate:
SELECT dr.* EXCEPT (sequence),
(SELECT ARRAY_AGG(sequence ORDER BY stepName)
FROM (SELECT seq,
ROW_NUMBER() OVER (PARTITION BY seq.stepName
ORDER BY seq.step.elapsedSeconds DESC NULLS LAST, sequence.step.elapsedMinutes DESC NULLS
) as seqnum
FROM unnest(dr.sequence) seq
) s
WHERE seqnum = 1
) as sequence
FROM DistinctRows dr
ORDER BY timestamp asc

Select only 3 rows per user - SQL Query

These are the columns in my table
id (autogenerated)
created_user
created_date
post_text
This table has lot of values. I wanted to take latest 3 posts of every created_user
I am new to SQL and need help. I ran the below query in my Postgres database and it is not helpful
SELECT * FROM posts WHERE created_date IN
(SELECT MAX(created_date) FROM posts GROUP BY created_date)
You could use the row_number() window function to create an ordered row count per user. After that you can easily filter by this value
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
row_number() OVER (PARTITION BY created_user ORDER BY created_date DESC)
FROM
posts
) s
WHERE row_number <= 3
PARTITION BY groups the users
ORDER BY date DESC orders the posts of each user into a descending order to get the most recent as row_count == 1, ...

convert timestamp to date in hive and get data for that date

I have a query like this
select distinct emp,phno,addrs,email from cdv.emp;
Now I want to get only data which is created on the latest generated date and not old.
I have an audit column created_on - this is the unique key and Timestamp
select distinct emp,phno,addrs,email from cdv.emp;
I expect latest data based on created_on(timestamp) column which is generated in 24 hours or say the Max date
Use rank analytic function.It will work much faster than IN subquery:
select distinct emp,phno,addrs,email
from
(
select emp,phno,addrs,email,
rank() over(order by to_date(c.created_on) desc) rn
from cdv.emp c
)s
where rn=1;
If you want latest record per emp,phno,addrs,email, then you can use row_number() without distinct. If this method is applicable, this will be even faster:
select emp,phno,addrs,email
from
(
select emp,phno,addrs,email,
row_number() over(partition by emp,phno,addrs,email order by to_date(c.created_on) desc) rn
from cdv.emp c
)s
where rn=1;

BigQuery Standard SQL: Delete Duplicates from Table

I am using below query to delete duplicates records from bigquery using standard sql. but it is throwing error
with cte as (
select * ,row_number()over (partition by CallRailCallId order by CallRailCallId) as rn
from `encoremarketingtest.EncoreMarketingTest.CallRailCall2` )
delete
from cte
where rn>1
Query Failed
Error: Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [5:5]
Could anyone help me on the correct approach in BigQuery?
Option #1
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CallRailCallId ORDER BY CallRailCallId) rn
FROM `project.dataset.your_table`
)
WHERE rn = 1
Option #2
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY CallRailCallId LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.your_table` t
GROUP BY CallRailCallId
)
As you might noticed, above options using DDL(CREATE TABLE) approach and that is where it is possible to use just one known (from your question) column - CallRailCallId
Also, note - ORDER BY CallRailCallId plays no real role there because GROUP BY and PARTITION BY are by exactly same filed. But if you change the field this will control which exactly row (out of few duplicates) to "survive" (For example ORDER BY ts DESC - see below option for what ts might be)
Option #3
This option uses DML(DELETE FROM) but requires some extra column to be used to serve as a tie-breaker
For example you have ts TIMESTAMP field and you want the most recent (based on ts) row to survive
DELETE FROM `project.dataset.your_table`
WHERE STRUCT(CallRailCallId, ts) NOT IN (
SELECT AS STRUCT CallRailCallId, MAX(ts) ts
FROM `project.dataset.your_table`
GROUP BY CallRailCallId
)

Using prepared statement without ROW_NUMBER() and OVER() functions in Db2

Let's say I have a table T_SWA.This is my prepared statement.
Select version
From (Select id, version, creator,
created_date ROW_NUMBER() OVER(order by created_date) cnt
From T_SWA
Where cnt=3 and id=35);
I need to select the 3rd recent version from the T_SWA table. Can anyone suggest a replacement for this query without using ROW_NUM() and OVER() functiions?
First take the three most recent and then from those three take the first.
select id, version, creator, created_date
from (
select id, version, creator, created_date
from T_SWA
where id = 35
order by created_date desc
fetch first 3 rows only
)
order by created_date
fetch first 1 row only;