Fetching a minimum of N rows, plus all peers of the last row - sql

I have a sample table named assets which looks like this:
id
name
block_no
1
asset1
2
2
asset2
2
3
asset3
3
There can be any number of assets in a specific block. I need a minimum of 100 rows from the table, and containing all the data from the block_no. Like, if there are 95 rows to block_no 2 and around 20 on block_no 3, I need all 20 of block_no 3 as if I am fetching data in packets based on block_no.
Is this possible and feasible?

Postgres 13 or later
There is a dead simple solution using WITH TIES in Postgres 13 or later:
SELECT *
FROM assets
WHERE block_no >= 2 -- your starting block
ORDER BY block_no
FETCH FIRST 100 ROWS WITH TIES;
This will return at least 100 rows (if enough qualify), plus all peers of the 100th row.
If your table isn't trivially small, an index on (block_no) is essential for performance.
See:
Get top row(s) with highest value, with ties
Older versions
Use the window function rank() in a subquery:
SELECT (a).*
FROM (
SELECT a, rank() OVER (ORDER BY block_no) AS rnk
FROM assets a
) sub
WHERE rnk <= 100;
Same result.
I use a little trick with the row type to strip the added rnk from the result. That's an optional addition.
See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?

Related

Seperating a Oracle Query with 1.8 million rows into 40,000 row blocks

I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26

Windowing function in Hive

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
Just trying to understand the background process involved for both keywords.
Appreciate the help :)
RANK() analytic function assigns a rank to each row in each partition in the dataset.
PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).
ORDER BY determines how the rows are being sorted in the partition.
First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.
Second phase, all rows are sorted inside each partition.
In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.
Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.
For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.
Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.
Example (rows are already partitioned and sorted inside partitions):
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
a, b, c, d(rank)
----------------
1 1 1 1 --starts with 1
2 1 1 1 --the same c value, the same rank=1
3 1 2 3 --rank 2 is skipped because second row shares the same rank as first
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.
row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.
SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz;
a, b, c, d(row_number)
----------------
1 1 1 1 --starts with 1
2 1 1 2 --the same c value, row number=2
3 1 2 3 --row position=3
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

Group rows into sets of 5

TableA
Col1
----------
1
2
3
4....all the way to 27
I want to add a second column that assigns a number to groups of 5.
Results
Col1 Col2
----- ------
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2...and so on
The 6th group should have 2 rows in it.
NTILE doesn't accomplish what I want because of the way NTILE handles the groups if they aren't divisible by the integer.
If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.
This is clearly demonstrated in this SQL Fiddle. Groups 4, 5, 6 each have 4 rows while the rest have 5. I have some started some solutions but they were getting lengthy and I feel like I'm missing something and that this could be done in a single line.
You can use this:
;WITH CTE AS
(
SELECT col1,
RN = ROW_NUMBER() OVER(ORDER BY col1)
FROM TableA
)
SELECT col1, (RN-1)/5+1 col2
FROM CTE;
In your sample data, col1 is a correlative without gaps, so you could use it directly (if it's an INT) without using ROW_NUMBER(). But in the case that it isn't, then this answer works too. Here is the modified sqlfiddle.
A bit of math can go a long way. subtracting 1 from all values puts the 5s (edge cases) into the previous group here, and 6's into the next. flooring the division by your group size and adding one give the result you're looking for. Also, the SQLFiddle example here fixes your iterative insert - the table only went up to 27.
SELECT col1,
floor((col1-1)/5)+1 as grpNum
FROM tableA

Implementing a total order ranking in PostgreSQL 8.3

The issue with 8.3 is that rank() is introduced in 8.4.
Consider the numbers [10,6,6,2].
I wish to achieve a rank of those numbers where the rank is equal to the row number:
rank | score
-----+------
1 | 10
2 | 6
3 | 6
4 | 2
A partial solution is to self-join and count items with a higher or equal, score. This produces:
1 | 10
3 | 6
3 | 6
4 | 2
But that's not what I want.
Is there a way to rank, or even just order by score somehow and then extract that row number?
If you want a row number equivalent to the window function row_number(), you can improvise in version 8.3 (or any version) with a (temporary) SEQUENCE:
CREATE TEMP SEQUENCE foo;
SELECT nextval('foo') AS rn, *
FROM (SELECT score FROM tbl ORDER BY score DESC) s;
db<>fiddle here
Old sqlfiddle
The subquery is necessary to order rows before calling nextval().
Note that the sequence (like any temporary object) ...
is only visible in the same session it was created.
hides any other table object of the same name.
is dropped automatically at the end of the session.
To use the sequence in the same session repeatedly run before each query:
SELECT setval('foo', 1, FALSE);
There's a method using an array that works with PG 8.3. It's probably not very efficient, performance-wise, but will do OK if there aren't a lot of values.
The idea is to sort the values in a temporary array, then extract the bounds of the array, then join that with generate_series to extract the values one by one, the index into the array being the row number.
Sample query assuming the table is scores(value int):
SELECT i AS row_number,arr[i] AS score
FROM (SELECT arr,generate_series(1,nb) AS i
FROM (SELECT arr,array_upper(arr,1) AS nb
FROM (SELECT array(SELECT value FROM scores ORDER BY value DESC) AS arr
) AS s2
) AS s1
) AS s0
Do you have a PK for this table?
Just self join and count items with: a higher or equal score and higher PK.
PK comparison will break ties and give you desired result.
And after you upgrade to 9.1 - use row_number().

Selecting SUM of TOP 2 values within a table with multiple GROUP in SQL

I've been playing with sets in SQL Server 2000 and have the following table structure for one of my temp tables (#Periods):
RestCTR HoursCTR Duration Rest
----------------------------------------
1 337 2 0
2 337 46 1
3 337 2 0
4 337 46 1
5 338 1 0
6 338 46 1
7 338 2 0
8 338 46 1
9 338 1 0
10 339 46 1
...
What I'd like to do is to calculate the Sum of the 2 longest Rest periods for each HoursCTR, preferably using sets and temp tables (rather than cursors, or nested subqueries).
Here's the dream query that just won't work in SQL (no matter how many times I run it):
Select HoursCTR, SUM ( TOP 2 Duration ) as LongestBreaks
FROM #Periods
WHERE Rest = 1
Group By HoursCTR
The HoursCTR can have any number of Rest periods (including none).
My current solution is not very elegant and basically involves the following steps:
Get the max duration of rest, group by HoursCTR
Select the first (min) RestCTR row that returns this max duration for each HoursCTR
Repeat step 1 (excluding the rows already collected in step 2)
Repeat step 2 (again, excluding rows collected in step 2)
Combine the RestCTR rows (from step 2 and 4) into single table
Get SUM of the Duration pointed to by the rows in step 5, grouped by HoursCTR
If there are any set functions that cut this process down, they would be very welcome.
The best way to do this in SQL Server is with a common table expression, numbering the rows in each group with the windowing function ROW_NUMBER():
WITH NumberedPeriods AS (
SELECT HoursCTR, Duration, ROW_NUMBER()
OVER (PARTITION BY HoursCTR ORDER BY Duration DESC) AS RN
FROM #Periods
WHERE Rest = 1
)
SELECT HoursCTR, SUM(Duration) AS LongestBreaks
FROM NumberedPeriods
WHERE RN <= 2
GROUP BY HoursCTR
edit: I've added an ORDER BY clause in the partitioning, to get the two longest rests.
Mea culpa, I did not notice that you need this to work in Microsoft SQL Server 2000. That version doesn't support CTE's or windowing functions. I'll leave the answer above in case it helps someone else.
In SQL Server 2000, the common advice is to use a correlated subquery:
SELECT p1.HoursCTR, (SELECT SUM(t.Duration) FROM
(SELECT TOP 2 p2.Duration FROM #Periods AS p2
WHERE p2.HoursCTR = p1.HoursCTR
ORDER BY p2.Duration DESC) AS t) AS LongestBreaks
FROM #Periods AS p1
SQL 2000 does not have CTE's, nor ROW_NUMBER().
Correlated subqueries can need an extra step when using group by.
This should work for you:
SELECT
F.HoursCTR,
MAX (F.LongestBreaks) AS LongestBreaks -- Dummy max() so that groupby can be used.
FROM
(
SELECT
Pm.HoursCTR,
(
SELECT
COALESCE (SUM (S.Duration), 0)
FROM
(
SELECT TOP 2 T.Duration
FROM #Periods AS T
WHERE T.HoursCTR = Pm.HoursCTR
AND T.Rest = 1
ORDER BY T.Duration DESC
) AS S
) AS LongestBreaks
FROM
#Periods AS Pm
) AS F
GROUP BY
F.HoursCTR
Unfortunately for you, Alex, you've got the right solution: correlated subqueries, depending upon how they're structured, will end up firing multiple times, potentially giving you hundreds of individual query executions.
Put your current solution into the Query Analyzer, enable "Show Execution Plan" (Ctrl+K), and run it. You'll have an extra tab at the bottom which will show you how the engine went about the process of gathering your results. If you do the same with the correlated subquery, you'll see what that option does.
I believe that it's likely to hammer the #Periods table about as many times as you have individual rows in that table.
Also - something's off about the correlated subquery, seems to me. Since I avoid them like the plague, knowing that they're evil, I'm not sure how to go about fixing it up.