SQL select segment - sql

I'm using SQL Server 2008.
I have a table with x amount of rows. I would like to always divide x by 5 and select the 3rd group of records.
Let's say there are 100 records in the table:
100 / 5 = 20
the 3rd segment will be record 41 to 60.
How will I be able in SQL to calculate and select this 3rd segment only?
Thanks.

You can use NTILE.
Distributes the rows in an ordered partition into a specified number of groups.
Example:
SELECT col1, col2, ..., coln
FROM
(
SELECT
col1, col2, ..., coln,
NTILE(5) OVER (ORDER BY id) AS groupno
FROM yourtable
)
WHERE groupno = 3

That's a perfect use for the NTILE ranking function.
Basically, you define your query inside a CTE and add an NTILE to your rows - a number going from 1 to n (the argument to NTILE). You order your rows by some column, and then you get the n groups of rows you're looking for, and you can operate on any one of those "groups" of data.
So try something like this:
;WITH SegmentedData AS
(
SELECT
(list of your columns),
GroupNo = NTILE(5) OVER (ORDER BY SomeColumnOfYours)
FROM dbo.YourTable
)
SELECT *
FROM SegmentedData
WHERE GroupNo = 3
Of course, you can also use an UPDATE statement after the CTE to update those rows.

Related

SQL: Select the largest value for a group

I have a table like this:
someOtherCols
shelf
type
count
...
row1
A
2
...
row1
B
3
...
row2
C
2
...
row2
D
2
I would like to group by shelf, and only keep the type that has the highest count. If there's a tie, choose any row.
So the result would be like:
someOtherCols
shelf
type
count
...
row1
B
3
...
row2
C
2
I'm using AWS Athena and I'm trying the following query that I saw from another answer I saw:
SELECT * FROM table
WHERE count IN (MAX(count) FROM shoppingAggregate GROUP BY someOtherCols, shelf)
Seems like Athena does not like it. How can I achieve this? Thanks a lot!
Use window functions:
you will have to do an over for each attribute in "some other cols". I used min below (you have to make the query deterministic). You can choose what you prefer.
select distinct -- you need to remove duplicates
min(othercol) over (partition by row),
row, max(type) over (partition by row),
count(*) over (partition by row)
from relation;
sqlite will allow you to have a non-determinist query that will be significantly simpler using a group by (but it will not work in other dbms)
I tried to use ROW_NUMBER() and was able to solve this problem:
ranked AS (
SELECT table.*, ROW_NUMBER() OVER (PARTITION BY othercols, shelf ORDER BY counts DESC) AS rn
FROM table
)
SELECT * FROM ranked WHERE rn = 1;
Presto offers the function max_by() which I think is in Athena as well:
select shelf, max(count), max_by(type, count)
from t
group by shelf

How to randomly create groups with different numbers of rows in HP Vertica

I'd like to randomly select 4 groups of data with different numbers of rows from a table, and generate a new column of group_name.
For example, if the original table (containing 10000 rows) was like this:
ID
---
ID1
ID2
...
The resulting table (containing 2750 rows) I want is like the following:
ID GROUP
--- -----
ID1 1
ID2 3
... ...
The number of rows for each group are like the following:
group1 1000 rows
group2 1000 rows
group3 500 rows
group4 250 rows
These randomly generated groups should not have any overlapping in rows.
Is there any way to do this in Vertica at one time rather than do the random select step by step?
Thanks!
You could do something like this:
SELECT ID, randomint(4)+1 as GROUP
FROM mytable
ORDER BY random()
LIMIT 2750
Although you'd probably want to stuff it in a local temp to summarize it since the groupings and selections would change at each execution.
Another idea if you want to keep consistent groupings might be to use HASH() with a mod instead of purely random. This will create the same GROUP value in each query.
SELECT ID, (HASH(ID) % 4)+1 as GROUP
FROM mytable
ORDER BY random()
LIMIT 2750
You should use row_number
and using a CTE
WITH cte AS (
SELECT ID, row_number() over () as RN
FROM YourTable
)
SELECT ID,
CASE
WHEN rn <= 1000 then 1
WHEN rn <= 2000 then 2
WHEN rn <= 2500 then 3
WHEN rn <= 2750 then 4
END as GROUP
FROM cte
WHERE rn <= 2750
If you want more random you can create a random column and order by random on the row_number() over (order by random) function

SQL - Find rows starting with the same characters

I have an oracle database with a table containing data that may start with the same prefix and would like to find rows that the 5 digit prefix is duplicated somewhere in the table.
For Example:
Table1
---------------
12345-brsd
12345-wbgb
12345-ydad
34573-diwe
75234-daie
72456-woei
72456-wdgq
I want to return only the ones that the first 5 digits are duplicate, so out of this sample:
12345-brsd
12345-wbgb
12345-ydad
72456-woei
72456-wdgq
You can do this using analytic functions:
select t.*
from (select t.*, count(*) over (partition by substr(column, 1, 5)) as cnt
from table t
) t
where cnt > 1
order by column1;

Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG.
I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.
The query looks like this:
SELECT TestId,Throughput FROM dbo.Results ORDER BY id
WITH T AS (
SELECT RANK() OVER (ORDER BY ID) Rank,
P.Field1, P.Field2, P.Value1, ...
FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;
Something like that should get you started. If you can provide your actual schema I can update as appropriate.
Give the answer to Yuck. I only post as an answer so I could include a code block. I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000. Great query Yuck.
WITH T AS (
SELECT RANK() OVER (ORDER BY sID) Rank, sID
FROM docSVsys
)
SELECT (Rank-1) / 1000 GroupID, count(sID)
FROM T
GROUP BY ((Rank-1) / 1000)
order by GroupID
I +1'd #Yuck, because I think that is a good answer. But it's worth mentioning NTILE().
Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.
If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.
NTILE() would make all groups the same size; the only caveat is that you'd need to know how many groups you wanted.
So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.
You could get your group-size simply by
DECLARE #ntile int
SET #ntile = (SELECT count(1) from myTable) / 1000
And then modifying #Yuck's approach with the NTILE() substitution:
;WITH myCTE AS (
SELECT NTILE(#ntile) OVER (ORDER BY id) myGroup,
col1, col2, ...
FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;
Answer above does not actually assign a unique group id to each 1000 records. Adding Floor() is needed. The following will return all records from your table, with a unique GroupID for each 1000 rows:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T
And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:
TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID
without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.
This is BigQuery syntax. T-SQL might be slightly different.
Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup
You can also use Row_Number() instead of rank. No Floor required.
declare #groupsize int = 50
;with ct1 as ( select YourColumn, RowID = Row_Number() over(order by YourColumn)
from YourTable
)
select YourColumn, RowID, GroupID = (RowID-1)/#GroupSize + 1
from ct1
I read more about NTILE after reading #user15481328 answer
(resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )
and this solution allowed me to find the max date within each of the 25 groups of my data set:
with cte as (
select date,
NTILE(25) OVER ( order by date ) bucket_num
from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

Find the top 5 MAX() values from an SQL table and then performi an AVG() on that table without them

I want to be able to perform an avg() on a column after removing the 5 highest values in it and see that the stddev is not above a certain number. This has to be done entirely as a PL/SQL query.
EDIT:
To clarify, I have a data set that contains values in a certain range and tracks latency. I want to know whether the AVG() of those values is due to a general rise in latency, or due to a few values with a very high stddev. I.e - (1, 2, 1, 3, 12311) as opposed to (122, 124, 111, 212). I also need to achieve this via an SQL query due to our monitoring software's limitations.
You can use row_number to find the top 5 values, and filter them out in a where clause:
select avg(col1)
from (
select row_number() over (order by col1 desc) as rn
, *
from YourTable
) as SubQueryAlias
where rn > 5
select column_name1 from
(
select column_name1 from table_name order by nvl(column_name,0) desc
)a
where rownum<6
(the nvl is done to omit the null value if there is/are any in the column column_name)
Well, the most efficient way to do it would be to calculate (sum(all values) - sum(top 5 values)) / (row_count - 5)
SELECT SUM(val) AS top5sum FROM table ORDER BY val DESC LIMIT 5
SELECT SUM(val) AS allsum FROM table
SELECT (COUNT(*) - 5) AS bottomCount FROM table
The average is then (allsum - top5sum) / bottomCount
First, get the MAX 5 values:
SELECT TOP 5 RowId FROM Table ORDER BY Column
Now use this in your main statement:
SELECT AVG(Column) FROM Table WHERE RowId NOT IN (SELECT TOP 5 RowId FROM Table ORDER BY Column)