Select first row in each partition of Hive table

Select first row in each partition of Hive table - hive

I'd like to get the first N records from each partition in a hive partitioned table without having to deserialize every record in the table. I've seen a number of solutions for other databases, but not one that works in Hive without running over every single record.
A minimal example:
create table demo ( val int )
partitioned by (partid int);
insert into demo partition (partid=1)
select stack(5, 1, 2, 3, 4, 5);
insert into demo partition (partid=2)
select stack(5, 100, 200, 300, 400, 500)
insert into demo partition (partid=3)
select stack(5, -1, -2, -3, -4, -5)
insert into demo ...
...
I'd like to get the result of
select * from partition_demo where partid = 1 limit 1
union all
select * from partition_demo where partid = 2 limit 1
union all
select * from partition_demo where partid = 3 limit 1
union all
...
without writing every single clause, and without deserializing all of the data in each partition (which seems to happen using RANK OVER).

Related

Count length of consecutive duplicate values for each id

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!

First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];

You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.

One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.

IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

Returning the row with the most recent timestamp from each group

I have a table (Postgres 9.3) defined as follows:
CREATE TABLE tsrs (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers,
timestamp TIMESTAMP WITHOUT TIME ZONE,
licensekeys_checksum VARCHAR(32));
The pertinent details here are the customer_id, the timestamp, and the licensekeys_checksum. There can be multiple entries with the same customer_id, some of those may have matching licensekey_checksum entries, and some may be different. There will never be rows with equal checksum and equal timestamps.
I want to return a table containing 1 row for each group of rows with matching licensekeys_checksum entries. The row returned for each group should be the one with the newest / most recent timestamp.
Sample Input:
1, 2, 2014-08-21 16:03:35, 3FF2561A
2, 2, 2014-08-22 10:00:41, 3FF2561A
2, 2, 2014-06-10 10:00:41, 081AB3CA
3, 5, 2014-02-01 12:03:23, 299AFF90
4, 5, 2013-12-13 08:14:26, 299AFF90
5, 6, 2013-09-09 18:21:53, 49FFA891
Desired Output:
2, 2, 2014-08-22 10:00:41, 3FF2561A
2, 2, 2014-06-10 10:00:41, 081AB3CA
3, 5, 2014-02-01 12:03:23, 299AFF90
5, 6, 2013-09-09 18:21:53, 49FFA891
I have managed to piece together a query based on the comments below, and hours of searching on the internet. :)
select * from tsrs
inner join (
select licensekeys_checksum, max(timestamp) as mts
from tsrs
group by licensekeys_checksum
) x on x.licensekeys_checksum = tsrs.licensekeys_checksum
and x.mts = tsrs.timestamp;
It seems to work, but I am unsure. Am I on the right track?

Your query in the question should perform better than the queries in the (previously) accepted answer. Test with EXPLAIN ANALYZE.
DISTINCT ON is typically simpler and faster:
SELECT DISTINCT ON (licensekeys_checksum) *
FROM tsrs
ORDER BY licensekeys_checksum, timestamp DESC NULLS LAST;
db<>fiddle here
Old sqlfiddle
Detailed explanation:
Select first row in each GROUP BY group?

Alternative deduplication, using NOT EXISTS(...)
SELECT *
FROM tsrs t
WHERE NOT EXISTS (
SELECT *
FROM tsrs x
WHERE x.customer_id = t.customer_id -- same customer
AND x.licensekeys_checksum = t.licensekeys_checksum -- same checksum
AND x.ztimestamp > t.ztimestamp -- but more recent
);

Try this
select *
from tsrs
where (timestamp,licensekeys_checksum) in (
select max(timestamp)
,licensekeys_checksum
from tsrs
group by licensekeys_checksum)
>SqlFiddle Demo
or
with cte as (
select id
,customer_id
,timestamp
,licensekeys_checksum
,row_number () over (partition by licensekeys_checksum ORDER BY timestamp DESC) as rk
from tsrs)
select id
,customer_id
,timestamp
,licensekeys_checksum
from cte where rk=1 order by id
>SqlFiddle Demo
Reference : Window Functions, row_number(), and CTE

MSSQL ORDER BY Passed List

I am using Lucene to perform queries on a subset of SQL data which returns me a scored list of RecordIDs, e.g. 11,4,5,25,30 .
I want to use this list to retrieve a set of results from the full SQL Table by RecordIDs.
So SELECT * FROM MyFullRecord
where RecordID in (11,5,3,25,30)
I would like the retrieved list to maintain the scored order.
I can do it by using an Order by like so;
ORDER BY (CASE WHEN RecordID = 11 THEN 0
WHEN RecordID = 5 THEN 1
WHEN RecordID = 3 THEN 2
WHEN RecordID = 25 THEN 3
WHEN RecordID = 30 THEN 4
END)
I am concerned with the loading of the server loading especially if I am passing long lists of RecordIDs. Does anyone have experience of this or how can I determine an optimum list length.
Are there any other ways to achieve this functionality in MSSQL?
Roger

You can record your list into a table or table variable with sorting priorities.
And then join your table with this sorting one.
DECLARE TABLE #tSortOrder (RecordID INT, SortOrder INT)
INSERT INTO #tSortOrder (RecordID, SortOrder)
SELECT 11, 1 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 25, 4 UNION ALL
SELECT 30, 5
SELECT *
FROM yourTable T
LEFT JOIN #tSortOrder S ON T.RecordID = S.RecordID
ORDER BY S.SortOrder

Instead of creating a searched order by statement, you could create an in memory table to join. It's easier on the eyes and definitely scales better.
SQL Statement
SELECT mfr.*
FROM MyFullRecord mfr
INNER JOIN (
SELECT *
FROM (VALUES (1, 11),
(2, 5),
(3, 3),
(4, 25),
(5, 30)
) q(ID, RecordID)
) q ON q.RecordID = mfr.RecordID
ORDER BY
q.ID
Look here for a fiddle

Something like:
SELECT * FROM MyFullRecord where RecordID in (11,5,3,25,30)
ORDER BY
CHARINDEX(','+CAST(RecordID AS varchar)+',',
','+'11,5,3,25,30'+',')
SQLFiddle demo

SQL: how to select common lines of one table on several column

I have a table :
create table a (page int, pro int)
go
insert into a select 1, 2
insert into a select 4, 2
insert into a select 5, 2
insert into a select 9, 2
insert into a select 1, 3
insert into a select 2, 3
insert into a select 3, 3
insert into a select 4, 3
insert into a select 9, 3
insert into a select 1, 4
insert into a select 9, 4
insert into a select 12, 4
insert into a select 1, 5
insert into a select 9, 5
insert into a select 12, 5
insert into a select 13, 5
insert into a select 14, 5
insert into a select 15, 5
go
(here is the SQLfiddle of this table and queries I began to write )
Common value of page on ALL lines
I'm looking to extract the common column "page" for each column "pro" from this table.
here is what we expect :
1
9
I tried to use:
SELECT DISTINCT a.page
FROM a
WHERE a.page IN (
SELECT b.page FROM a as b
WHERE b.pro <> a.pro
)
but this query returns every "page" that have at least one common values which is not what we need to have. see below :
1
4
9
12
The opposite query aka different value at least one but not all time
I'm looking to extract the "page" linked to one or more "pro" but without being common to all of them (it's the exact opposite of the previous query)
Here is what we expect :
2
3
4
5
12
13
14
15
I can't manage to find a solution to those 2 queries :'(
Could anyone help me on those ones?
Best regards
edit: the SQLfiddle url

Just a bit of reversed thinking - group by page and count distinct pro values for each. Return rows that matches the total of distinct pro values
SELECT [page]
FROM a
GROUP BY [page]
HAVING COUNT(DISTINCT pro) = (SELECT COUNT(DISTINCT pro) FROM a)
SQLFiddle
EDIT: for the second problem, just replace = with '<' in the final line -> SQLFiddle

Fot the first part of the question, try this query:
SELECT DISTINCT t1.page FROM a t1
WHERE (SELECT COUNT(DISTINCT t2.pro) FROM a t2 WHERE
t2.page = t1.page) =
(SELECT COUNT(DISTINCT t3.pro) FROM a t3)
And the second query is the simple substraction from all page values:
SELECT DISTINCT t4.page FROM a t4
EXCEPT
SELECT DISTINCT t1.page FROM a t1
WHERE (SELECT COUNT(DISTINCT t2.pro) FROM a t2 WHERE
t2.page = t1.page) =
(SELECT COUNT(DISTINCT t3.pro) FROM a t3)

In MYSQL, how can I select multiple rows and have them returned in the order I specified?

I know I can select multiple rows like this:
select * FROM table WHERE id in (1, 2, 3, 10, 100);
And I get the results returned in order: 1, 2, 3, 10, 100
But, what if I need to have the results returned in a specific order? When I try this:
select * FROM table WHERE id in (2, 100, 3, 1, 10);
I still get the results returned in the same order: 1, 2, 3, 10, 100
Is there a way to get the results returned in the exact order that I ask for?
(There are limitations due to the way the site is set up that won't allow me to ORDER BY using another field value)

the way you worded that I'm not sure if using ORDER BY is completely impossible or just ordering by some other field... so at the risk of submitting a useless answer, this is how you'd typically order your results in such a situation.
SELECT *
FROM table
WHERE id in (2, 100, 3, 1, 10)
ORDER BY FIELD (id, 2, 100, 3, 1, 10)

Unless you are able to do ORDER BY, there is no guaranteed way.
The sort you are getting is due to the way MySQL executes the query: it combines all range scans over the ranges defined by the IN list into a single range scan.
Usually, you force the order using one of these ways:
Create a temporary table with the value and the sorter, fill it with your values and order by the sorter:
CREATE TABLE t_values (value INT NOT NULL PRIMARY KEY, sorter INT NOT NULL)
INSERT
INTO t_values
VALUES
(2, 1),
(100, 1),
(3, 1),
(1, 1),
(10, 1);
SELECT m.*
FROM t_values v
JOIN mytable m
ON m.id = v.value
ORDER BY
sorter
Do the same with an in-place rowset:
SELECT m.*
FROM (
SELECT 2 AS value, 1 AS sorter
UNION ALL
SELECT 100 AS value, 2 AS sorter
UNION ALL
SELECT 3 AS value, 3 AS sorter
UNION ALL
SELECT 1 AS value, 4 AS sorter
UNION ALL
SELECT 10 AS value, 5 AS sorter
)
JOIN mytable m
ON m.id = v.value
ORDER BY
sorter
Use CASE clause:
SELECT *
FROM mytable m
WHERE id IN (1, 2, 3, 10, 100)
ORDER BY
CASE id
WHEN 2 THEN 1
WHEN 100 THEN 2
WHEN 3 THEN 3
WHEN 1 THEN 4
WHEN 10 THEN 5
END

You can impose an order, but only based on the value(s) of one or more columns.
To get the rows back in the order you specify in the example you would need to add a second column, called a "sortkey" whose values can be used to sort the rows in the desired sequence,
using the ORDER BY clause. In your example:
Value Sortkey
----- -------
1 4
2 1
3 3
10 5
100 2
select value FROM table where ... order by sortkey;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select first row in each partition of Hive table - hive

Related

Count length of consecutive duplicate values for each id

Returning the row with the most recent timestamp from each group

MSSQL ORDER BY Passed List

SQL: how to select common lines of one table on several column

In MYSQL, how can I select multiple rows and have them returned in the order I specified?

Categories

Resources