Manually specify starting value for Row_Number() - sql

I want to define the start of ROW_NUMBER() as 3258170 instead of 1.
I am using the following SQL query
SELECT ROW_NUMBER() over(order by (select 3258170)) as 'idd'.
However, the above query is not working. When I say not working I mean its executing but its not starting from 3258170. Can somebody help me?
The reason I want to specify the row number is I am inserting Rows from one table to another. In the first Table the last record's row number is 3258169 and when I insert new records I want them to have the row number from 3258170.

Just add the value to the result of row_number():
select 3258170 - 1 + row_number() over (order by (select NULL)) as idd
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.
To avoid confusion, I replaced the constant value with NULL. In SQL Server, I have observed that this assigns a sequential number without actually sorting the rows -- an observed performance advantage, but not one that I've seen documented, so we can't depend on it.

I feel this is easier
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) + 3258169 AS FieldAlias (To start from 3258170)

Sometimes....
The ROW_NUMBER() may not be the best solution especially when there could be duplicate records in the underlying data set (for JOIN queries etc.). This may result in more rows returned than expected. You may consider creating a SEQUENCE which can be in some cases considered a cleaner solution.
i.e.:
CREATE SEQUENCE myRowNumberId
START WITH 1
INCREMENT BY 1
GO
SELECT NEXT VALUE FOR myRowNumberId AS 'idd' -- your query
GO
DROP SEQUENCE myRowNumberId; -- just to clean-up after ourselves
GO
The downside is that sequences may be difficult to use in complex queries with DISTINCT, WINDOW functions etc. See the complete sequence documentation here.

I had a situation where I was importing a hierarchical structure into an application where a seq number had to be unique within each hierarchical level and start at 110 (for ease of subsequent manual insertion). The data beforehand looked like this...
Level Prod Type Component Quantity Seq
1 P00210005 R NZ1500 57.90000000 120
1 P00210005 C P00210005M 1.00000000 120
2 P00210005M R M/C Operation 20.00000000 110
2 P00210005M C P00210006 1.00000000 110
2 P00210005M C P00210007 1.00000000 110
I wanted the row_number() function to generate the new sequence numbers but adding 10 and then multiplying by 10 wasn't achievable as expected. To force the sequence of arithmetic functions you have to enclose the entire row_number(), and partition clause in brackets. You can only perform simple addition and substraction on the row_number() as such.
So, my solution for this problem was
,10*(10+row_number() over (partition by Level order by Type desc, [Seq] asc)) [NewSeq]
Note the position of the brackets to allow the multiplication to occur after the addition.
Level Prod Type Component Quantity [Seq] [NewSeq]
1 P00210005 R NZ1500 57.90000000 120 110
1 P00210005 C P00210005M 1.00000000 120 120
2 P00210005M R M/C Operation 20.00000000 110 110
2 P00210005M C P00210006 1.00000000 110 120
2 P00210005M C P00210007 1.00000000 110 130

ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) - 2862718 AS FieldAlias (To start from 2862718)
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.

Related

SQL- Sample the 3rd from the beginning and backwards

I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.

Fetching a minimum of N rows, plus all peers of the last row

I have a sample table named assets which looks like this:
id
name
block_no
1
asset1
2
2
asset2
2
3
asset3
3
There can be any number of assets in a specific block. I need a minimum of 100 rows from the table, and containing all the data from the block_no. Like, if there are 95 rows to block_no 2 and around 20 on block_no 3, I need all 20 of block_no 3 as if I am fetching data in packets based on block_no.
Is this possible and feasible?
Postgres 13 or later
There is a dead simple solution using WITH TIES in Postgres 13 or later:
SELECT *
FROM assets
WHERE block_no >= 2 -- your starting block
ORDER BY block_no
FETCH FIRST 100 ROWS WITH TIES;
This will return at least 100 rows (if enough qualify), plus all peers of the 100th row.
If your table isn't trivially small, an index on (block_no) is essential for performance.
See:
Get top row(s) with highest value, with ties
Older versions
Use the window function rank() in a subquery:
SELECT (a).*
FROM (
SELECT a, rank() OVER (ORDER BY block_no) AS rnk
FROM assets a
) sub
WHERE rnk <= 100;
Same result.
I use a little trick with the row type to strip the added rnk from the result. That's an optional addition.
See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?

select top N for each category w/o sorting if there are less than N rows

Given the following table, the question is to find for example the top N C2 from each C1.
C1 C2
1 1
1 2
1 3
1 4
1 ...
2 1
2 2
2 3
2 4
2 ...
....
So if N = 3, the results are
C1 C2
1 1
1 2
1 3
2 1
2 2
2 3
....
The proposed solutions use the window function and partition by
Select top 10 records for each category
https://www.the-art-of-web.com/sql/partition-over/
For example,
SELECT rs.Field1,rs.Field2
FROM (
SELECT Field1,Field2, Rank()
over (Partition BY Section
ORDER BY RankCriteria DESC ) AS Rank
FROM table
) rs WHERE Rank <= 3
I guess what it does is sorting then picking the top N.
However if some categories have less N elements, we can get the top N w/o sorting because the top N must include all elements in the category.
The above query uses Rank(). My question applies to other window functions like row_num() or dense_rank().
Is there a way to ignore the sorting at the case?
Also I am not sure if the underlying engine can optimize the case: whether the inner partition/order considers the outer where constraints before sorting.
Using partition+order+where is a way to get the top-N element from each category. It works perfect if each category has more than N element, but has additional sorting cost otherwise. My question is if there is another approach that works well at both cases. Ideally it does the following
for each category {
if # of element <= N:
continue
sort and get the top N
}
For example, but is there a better SQL?
WITH table_with_count AS (
SELECT Field1, Field2, RankCriteria, count() over (PARTITION BY Section) as c
FROM table
),
rs AS (
SELECT Field1,Field2, Rank()
over (Partition BY Section
ORDER BY RankCriteria DESC ) AS Rank
FROM table_with_count
where c > 10
)
(SELECT Field1,Field2e FROM rs WHERE Rank <= 10)
union
(SELECT Field1,Field2 FROM table_with_count WHERE c <= 10)
No, an there really shouldn't be. Overall what you describe here is the XY-problem.
You seem to:
Worry about sorting, while in fact sorting (with optional secondary sort) is the most efficient way of shuffling / repartitioning data, as it doesn't lead to proliferation of file descriptors. In practice Spark strictly prefers sort over alternatives (hashing) for exactly that reason.
Worry about "unnecessary" sorting of small groups, when in fact the problem is intrinsic inefficiency of window functions, which require full shuffle of all data, therefore exhibit the same behavior pattern as infamous groupByKey.
There are more efficient patterns (MLPairRDDFunctions.topByKey being the most prominent example) but these haven't been ported to Dataset API, and would require custom Aggregator It is also possible to approximate selection (for example through quantile approximation), but this increases the number of passes over data, and in many cases won't provide any performance gains.
This is too long for a comment.
There is no such optimization. Basically, all the data is sorted when using windowing clauses. I suppose that a database engine could actually use a hash algorithm for the partition by and a sort algorithm for the order by, but I don't think that is a common approach.
In any case, the operation is over the entire set, and it should be optimized for this purpose. Trying not to order a subset would add lots of overhead -- for instance, running the sort multiple times for each subset and counting the number of rows in each subset.
Also note that the comparison to "3" occurs (logically) after the window function. I don't think window functions are typically optimized for such post-filtering (although once again, it is a possible optimization).

Fastest way to add a grouping column which divides the result per 4 rows

If i have a resultset like this for example (just a list of numbers) :
1,2,3,4,5,6,7,8,9,10,11
and I would like to add a grouping column so i can group them per 4 like this :
1,1,1,1,2,2,2,2,3,3,3
(The last one in this examle does not have a forth element, so that is why i cannot use Ntile(3) here.
But I still would like to be able to make a grouping by 4 elements.
Is this possible in a easy way ( just like NTile(n)) without to write a bunch of logic ?
Thank you in advance,
Greets Jacob
Try this:
SELECT col,
(ROW_NUMBER() OVER (ORDER BY col) - 1) / 4 + 1 AS grp
FROM mytable
grp is equal to 1 for the first four rows, equal to 2 for the next four, equal to 3 for the next four, etc.
Demo here
Alternatively, the following can also be used (as suggested by #Jacob Siemaszko):
SELECT col,
CEILING(ROW_NUMBER() OVER (ORDER BY col) / 4.0) AS grp
FROM mytable
The second query uses floating point arithmetic and is likely less efficient compared to the first one.

SQL Server 2005 - SUM'ing one field, but only for the first occurence of a second field

Platform: SQL Server 2005 Express
Disclaimer: I’m quite a novice to SQL and so if you are happy to help with what may be a very simple question, then I won’t be offended if you talk slowly and use small words :-)
I have a table where I want to SUM the contents of multiple rows. However, I want to SUM one column only for the first occurrence of text in a different column.
Table schema for table 'tblMain'
fldOne {varchar(100)} Example contents: “Dandelion“
fldTwo {varchar(8)} Example contents: “01:00:00” (represents hh:mm:ss)
fldThree {numeric(10,0)} Example contents: “65”
Contents of table:
Row number fldOne fldTwo fldThree
------------------------------------------------
1 Dandelion 01:00:00 99
2 Daisy 02:15:00 88
3 Dandelion 00:45:00 77
4 Dandelion 00:30:00 10
5 Dandelion 00:15:00 200
6 Rose 01:30:00 55
7 Daisy 01:00:00 22
etc. ad nausium
If I use:
Select * from tblMain where fldTwo < ’05:00:00’ order by fldOne, fldTwo desc
Then all rows are correctly returned, ordered by fldOne and then fldTwo in descending order (although in the example data I've shown, all the data is already in the correct order!)
What I’d like to do is get the SUM of each fldThree, but only from the first occurrence of each fldOne.
So, SUM the first Dandelion, Daisy and Rose that I come across. E.g.
99+88+55
At the moment, I’m doing this programmatically; return a RecordSet from the Select statement above, and MoveNext through each returned row, only adding fldThree to my ‘total’ if I’ve never seen the text from fldOne before. It works, but most of the Select queries return over 100k rows and so it’s quite slow (slow being a relative term – it takes about 50 seconds on my setup).
The actual select statement (selecting about 100k rows from 1.5m total rows) completes in under a second which is fine. The current programatic loop is quite small and tight, it's just the number of loops through the RecordSet that takes time. I'm using adOpenForwardOnly and adLockReadOnly when I open the record set.
This is a routine that basically runs continuously as more data is added, and also the fldTwo 'times' vary, so I can't be more specific with the Select statement.
Everything that I’ve so far managed to do natively with SQL seems to run quickly and I’m hoping I can take the logic (and work) away from my program and get SQL to take the strain.
Thanks in advance
The best way to approach this is with window functions. These let you enumerate the rows within a group. However, you need some way to identify the first row. SQL tables are inherently unordered, so you need a column to specify the ordering. Here are some ideas.
If you have an id column, which is defined as an identity so it is autoincremented:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by id) as seqnum
from tblMain m
) m
where seqnum = 1
To get an arbitrary row, you could use:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by (select NULL as noorder)) as seqnum
from tblMain m
) m
where seqnum = 1
Or, if FldTwo has the values in reverse order:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by FldTwo desc) as seqnum
from tblMain m
) m
where seqnum = 1
Maybe this?
SELECT SUM(fldThree) as ExpectedSum
FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY fldOne ORDER BY fldTwo DSEC) Rn
FROM tblMain) as A
WHERE Rn = 1