Select statement, table sample, equal distribution - sql

Let's assume there is a SQL Server 2008 table like below, that holds 10 million rows.
One of the fields is Id, since it's identity it is from 1 to 10 million.
CREATE TABLE dbo.Stats
(
id INT IDENTITY(1,1) PRIMARY KEY,
field1 INT,
field2 INT,
...
)
Is there an efficient way by doing one select statement to get a subset of this data that satisfies the following requirements:
contains a limited number of rows in the result set, i.e. 100, 200, etc.
provides equal distribution of a certain column, not random, i.e. of column id
So, in our example, if we return 100 rows, the result set would look like this:
Row 1 - 100 000
Row 2 - 200 000
Row 3 - 300 000
...
Row 100 - 10 000 000
I want to avoid using cursor and storing this in a separate table.

Not sure how efficient it's going to be, but thie following query will return every 100000th row (relative to ordering established by id):
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) RN
FROM Stats
) T
WHERE RN % 100000 = 0
ORDER BY id
Since it does not rely on actual id values, this will work even if you have "holes" in the sequence of id values.

Something like this?
SELECT id FROM dbo..Stats WHERE id % 100000 = 0
it should work, since you are saying that id goes from 1 to 10 000 000. If number of rows is not known, but number of resulting rows is what you know, then just calculate that 100000 number like (if you would like 100 resulting rows):
SELECT id FROM Stats WHERE (id % (SELECT COUNT(id) FROM Stats) / 100) = 0

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

I have a table like this:
Id
Date
Steps
Distance
1
2016-06-01
1000
1
There are over 1000 records and 50 Ids in this table, most ids have about 20 records, and some ids only have 1, or 2 records which I think are useless.
I want to create a table that excludes those ids with less than 5 records.
I wrote this code to find the ids that I want to exclude:
SELECT
Id,
COUNT(Id) AS num_id
FROM `table`
GROUP BY
Id
ORDER BY
num_id
Since there are only two ids I need to exclude, I use WHERE clause:
CREATE TABLE `` AS
SELECT
*
FROM ``
WHERE
Id <> 2320127002
AND Id <> 7007744171
Although I can get the result I want, I think there are better ways to solve this kind of problem. For example, if there are over 20 ids with less than 5 records in this table, what shall I do? Thank you.
Consider this:
CREATE TABLE `filtered_table` AS
SELECT *
FROM `table`
WHERE TRUE QUALIFY COUNT(*) OVER (PARTITION BY Id) >= 5
Note: You can remove WHERE TRUE if it runs successfully without it.

Select records from a specific key onwards

I have a table that has more than three trillion records
The main key of this table is guid
As below
GUID Value mid id
0B821574-8E85-4FB7-8047-553393E385CB 4 51 15
716F74B0-80D8-4869-86B4-99FF9EB10561 0 510 153
7EBA2C31-FFC8-4071-B11A-9E2B7ED16B2B 2 5 3
85491F90-E4C6-4030-B1E5-B9CA36238AE2 1 58 7
F04FA30C-0C35-4B9F-A01C-708C0189815D 20 50 13
guid is primary key
I want to select 10 records from where the key is equal to, for example, 85491F90-E4C6-4030-B1E5-B9CA36238AE2
You can use order by and top. Assuming that guid defines the ordering of the rows:
select top (10) t.*
from mytable t
where guid >= '85491F90-E4C6-4030-B1E5-B9CA36238AE2'
order by guid
If the ordering is defined in an other column, say id (that should be unique as well), then you would use a correlated subquery for filterig:
select top (10) t.*
from mytable t
where id >= (select id from mytable t1 where guid = '85491F90-E4C6-4030-B1E5-B9CA36238AE2')
order by id
To read data onward You can use OFFSET .. FETCH in the ORDER BY since MS SQL Server 2012. According learn.microsoft.com something like this:
-- Declare and set the variables for the OFFSET and FETCH values.
DECLARE #StartingRowNumber INT = 1
, #RowCountPerPage INT = 10;
-- Create the condition to stop the transaction after all rows have been returned:
WHILE (SELECT COUNT(*) FROM mytable) >= #StartingRowNumber
BEGIN
-- Run the query until the stop condition is met:
SELECT *
FROM mytable WHERE guid = '85491F90-E4C6-4030-B1E5-B9CA36238AE2'
ORDER BY id
OFFSET #StartingRowNumber - 1 ROWS
FETCH NEXT #RowCountPerPage ROWS ONLY;
-- Increment #StartingRowNumber value:
SET #StartingRowNumber = #StartingRowNumber + #RowCountPerPage;
CONTINUE
END;
In the real world it will not be enough, because another processes could (try) read or write data in your table at the same time.
Please, read documentation, for example, search for "Running multiple queries in a single transaction" in the https://learn.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql
Proper indexes for fields id and guid must to be created/applied to provide performance

Update SQL column with sequenced number

I have a table SL_PROD which has the following columns, NUMBER, DEPTCODE, DISP_SEQ AND SL_PROD_ID.
SL_PROD_ID is an identity column which incrementally increases with each row.
I need to write a query which updates the DISP_SEQ column with sequential numbers (1-X) for the rows which have a DEPTCODE of '725'. I've tried several things with no luck, any ideas?
Try this:
A common table expression can be used in updates. This is extremely usefull, if you want to use the values of window functions (with OVER) as update values.
Attention: Look carefully what you are ordering for. I used NUMBER but you might need some other sort column (maybe your IDENTITY column)
CREATE TABLE #SL_PROD(NUMBER INT,DEPT_CODE INT,DISP_SEQ INT,SL_PROD_ID INT IDENTITY);
INSERT INTO #SL_PROD(NUMBER,DEPT_CODE,DISP_SEQ) VALUES
(1,123,0)
,(2,725,0)
,(3,725,0)
,(4,123,0)
,(5,725,0);
WITH UpdateableCTE AS
(
SELECT ROW_NUMBER() OVER(ORDER BY NUMBER) AS NewDispSeq
,DISP_SEQ
FROM #SL_PROD
WHERE DEPT_CODE=725
)
UPDATE UpdateableCTE SET DISP_SEQ=NewDispSeq;
SELECT * FROM #SL_PROD;
GO
--Clean up
--DROP TABLE #SL_PROD;
The result (look at the lines with 725)
1 123 0 1
2 725 1 2
3 725 2 3
4 123 0 4
5 725 3 5

Get randomically or row from a table

I need you help...for a little problem.
I have a java service that should access in a table and get a random row from table.
My table is simply: it contains only two cols:
"Id" INT IDENTITY(1,1) NOT NULL Primary Key
"Datas" Varchar(64) NOT NULL
Values Id is an progressive number, so you should think it could be enough to create a random number and get the row where id=randomic_number.
But I have lots of gap in table. So for example, a sample of table could be this:
ID Datas
1 Row1
2 Row2
3 Row3
8 Row4
10 Row5
25 Row6
639 Row7
Is there a very stylish way to get one row randomly? No condition must be...only random!
I use sql srv 2000.
I would avoid to to...
select *
and then cycling the entire Resultset using a random number...because it can contain a very large number of rows....
You should be able to do something along the lines of:
SELECT TOP 1 * FROM mytable ORDER BY newid()
Note: this is a duplicate of #52964 which is in turn a duplicate of #19412
I would suggest to get the last id in the table like so
SELECT TOP 1 Id FROM table_name ORDER BY Id DESC
then assuming this is stored in the maxId variable, you could generate a random number index between 1 and maxId and do :
SELECT TOP 1 * FROM table_name WHERE Id > index
And that's it

SQL Server query grouped by max value in column

****Update:**
using the Rank() over partition syntax available in MS SQL Server 2005 does indeed point me in the right direction, it (or maybe I should write "I") is unable to give me the results I need without resorting to enumerating rows in code.
For example, if we select TOP (1) of rank, I get only one value, ie., slot 1. If I use MAX(), then I get the top ranked value for each slot...which, in my case, doesn't work, because if slot 2's top value is NULL, but it's next to MAX value is non-empty, that is the one I want.
So, unable to find a completely T-SQL solution, I've resorted to filtering as much as possible in SQL and then enumerating the results in code on the client side.
Original:
I've been hitting advanced T-SQL books, StackOverflow and google trying to figure out how to handle this query either by using pivots or by using analytic functions. So far, I haven't hit on the right combination.
I have schedules that are ranked (higher value, greater precedence). Each schedule has a playlist of a certain number of numbered slots with files.
What I need to do, is line up all the schedules and their associated playlists, and for each slot, grab the file from the schedule having the highest ranking value.
so, if I had a query for a specific customer with a join between the playlists and the schedules, ordered by Schedule.Rank DESC like so:
PlaylistId Schedule.Rank SlotNumber FileId
100 100 1 1001
100 100 2 NULL
100 100 3 NULL
200 80 1 1101
200 80 2 NULL
200 80 3 NULL
300 60 1 1201
300 60 2 NULL
300 60 3 2202
400 20 1 1301
400 20 2 2301
400 20 3 NULL
From this, I need to find the FileId for the MAX ranked row per slotnumber:
SlotNumber FileId Schedule.Rank
1 1001 100
2 2301 20
3 2202 60
Any ideas on how to do this?
Table Definitions below:
CREATE TABLE dbo.Playlists(
id int NOT NULL)
CREATE TABLE dbo.Customers(
id int NOT NULL,
name nchar(10) NULL)
CREATE TABLE dbo.Schedules(
id int NOT NULL,
rank int NOT NULL,
playlistid int NULL,
customerid int NULL)
CREATE TABLE dbo.PlaylistSlots(
id int NOT NULL,
slotnumber int NOT NULL,
playlistid int NULL,
fileid int NULL)
SELECT slotnumber, fileid, rank
FROM
(
SELECT slotnumber, fileid, Schedules.rank, RANK() OVER (PARTITION BY slotnumber ORDER BY Schedules.rank DESC) as rankfunc
FROM Schedules
INNER JOIN PlaylistSlots ON Schedules.playlistid = PlaylistSlots.playlistid
) tmp
WHERE rankfunc = 1
Have you looked at SQL Server's (2005 onwards) PARTITION and RANK features?
select SlotNumber, FileId, ScheduleRank
FROM intermediateTable a,
(
SELECT SlotNumber, Max(Schedule.Rank) as MaxRank
FROM intermediateTable O
WHERE FileId is not null GROUP BY SlotNumber) b
WHERE b.SlotNumber = a.SlotNumber and b.MaxRank = a.Rank
This query uses the intermediate output to build the final output.
Does this help?