Fastest way to add a grouping column which divides the result per 4 rows - sql

If i have a resultset like this for example (just a list of numbers) :
1,2,3,4,5,6,7,8,9,10,11
and I would like to add a grouping column so i can group them per 4 like this :
1,1,1,1,2,2,2,2,3,3,3
(The last one in this examle does not have a forth element, so that is why i cannot use Ntile(3) here.
But I still would like to be able to make a grouping by 4 elements.
Is this possible in a easy way ( just like NTile(n)) without to write a bunch of logic ?
Thank you in advance,
Greets Jacob

Try this:
SELECT col,
(ROW_NUMBER() OVER (ORDER BY col) - 1) / 4 + 1 AS grp
FROM mytable
grp is equal to 1 for the first four rows, equal to 2 for the next four, equal to 3 for the next four, etc.
Demo here
Alternatively, the following can also be used (as suggested by #Jacob Siemaszko):
SELECT col,
CEILING(ROW_NUMBER() OVER (ORDER BY col) / 4.0) AS grp
FROM mytable
The second query uses floating point arithmetic and is likely less efficient compared to the first one.

Related

SQL- Sample the 3rd from the beginning and backwards

I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.

fetch aggregate value along with data

I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help
S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.
Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?

Find preceding and following rows for a matching row in BigQuery?

Is it possible to find rows preceding and following a matching rows in a BigQuery query? For example if I do:
select textPayload from logs.logs_20160709 where textPayload like "%something%"
and say that I get these results back:
something A
something B
How can I also show the 3 rows preceding and following the matching rows? Something like this:
some text 1
some text 2
some text 3
something A
some text 4
some text 5
some text 6
some text 90
some text 91
some text 92
something B
some text 93
some text 94
some text 95
Is this possible and if so how?
While on Zuma Beach - I was thinking of avoiding CROSS JOIN in my original answer.
Check below - should be much cheaper especially for big set
SELECT textPayload
FROM (
SELECT textPayload,
SUM(match) OVER(ORDER BY ts ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS flag
FROM (
SELECT textPayload, ts, IF(textPayload CONTAINS 'something', 1, 0) AS match
FROM YourTable
)
)
WHERE flag > 0
Of course another way to avoid cross join is to use BigQuery Standard SQL. But still - above solution with no joins at all is better than my original answer
I think, one piece is missing in your example - extra field that will define the order, so I added ts field for this in my answer. This mean I assume your table has two fields involved : textPayload and ts
Try below. Should give you exactly what you need
SELECT
all.textPayload
FROM (
SELECT start, finish
FROM (
SELECT textPayload,
LAG(ts, 3) OVER(ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS start,
LEAD(ts, 3) OVER(ORDER BY ts ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS finish
FROM YourTable
)
WHERE textPayload CONTAINS 'something'
) AS matches
CROSS JOIN YourTable AS all
WHERE all.ts BETWEEN matches.start AND matches.finish
Please note: depends on type of your ts field - you might need to do some data casting in query for this field. hope not

Find a record with a key closest to a give value

I have a two column table currently, with the columns 'probability' and 'age'. I have a given probability, and I need to search the table and return the age related to the closest probability. It's already in ascending order next to age, for example:
20 0.01050
21 0.02199
22 0.03155
23 0.04710
The only thing I can think of doing right now is returning all ages with probabilities greater than the given probability, and taking the first one.
select age from mydb.mytest
where probability > givenProbability;
I'm sure there is a better approach to this than doing that, so I'm wondering what that would be.
What about something like this:
SELECT * FROM mytest
ORDER BY ABS( .0750 - probability )
LIMIT 1
Should return the top 1 closest value, based on a sorted list of the Absolute value of the Difference between Probability and givenProbability.
Different solutions will work for different DBMS. This one works in DB2 and is standard sql:
select age
from (
select age
, row_number() over (order by abs(probability - givenProbability)) as rn
from mydb.mytest
)
where rn = 1

Manually specify starting value for Row_Number()

I want to define the start of ROW_NUMBER() as 3258170 instead of 1.
I am using the following SQL query
SELECT ROW_NUMBER() over(order by (select 3258170)) as 'idd'.
However, the above query is not working. When I say not working I mean its executing but its not starting from 3258170. Can somebody help me?
The reason I want to specify the row number is I am inserting Rows from one table to another. In the first Table the last record's row number is 3258169 and when I insert new records I want them to have the row number from 3258170.
Just add the value to the result of row_number():
select 3258170 - 1 + row_number() over (order by (select NULL)) as idd
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.
To avoid confusion, I replaced the constant value with NULL. In SQL Server, I have observed that this assigns a sequential number without actually sorting the rows -- an observed performance advantage, but not one that I've seen documented, so we can't depend on it.
I feel this is easier
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) + 3258169 AS FieldAlias (To start from 3258170)
Sometimes....
The ROW_NUMBER() may not be the best solution especially when there could be duplicate records in the underlying data set (for JOIN queries etc.). This may result in more rows returned than expected. You may consider creating a SEQUENCE which can be in some cases considered a cleaner solution.
i.e.:
CREATE SEQUENCE myRowNumberId
START WITH 1
INCREMENT BY 1
GO
SELECT NEXT VALUE FOR myRowNumberId AS 'idd' -- your query
GO
DROP SEQUENCE myRowNumberId; -- just to clean-up after ourselves
GO
The downside is that sequences may be difficult to use in complex queries with DISTINCT, WINDOW functions etc. See the complete sequence documentation here.
I had a situation where I was importing a hierarchical structure into an application where a seq number had to be unique within each hierarchical level and start at 110 (for ease of subsequent manual insertion). The data beforehand looked like this...
Level Prod Type Component Quantity Seq
1 P00210005 R NZ1500 57.90000000 120
1 P00210005 C P00210005M 1.00000000 120
2 P00210005M R M/C Operation 20.00000000 110
2 P00210005M C P00210006 1.00000000 110
2 P00210005M C P00210007 1.00000000 110
I wanted the row_number() function to generate the new sequence numbers but adding 10 and then multiplying by 10 wasn't achievable as expected. To force the sequence of arithmetic functions you have to enclose the entire row_number(), and partition clause in brackets. You can only perform simple addition and substraction on the row_number() as such.
So, my solution for this problem was
,10*(10+row_number() over (partition by Level order by Type desc, [Seq] asc)) [NewSeq]
Note the position of the brackets to allow the multiplication to occur after the addition.
Level Prod Type Component Quantity [Seq] [NewSeq]
1 P00210005 R NZ1500 57.90000000 120 110
1 P00210005 C P00210005M 1.00000000 120 120
2 P00210005M R M/C Operation 20.00000000 110 110
2 P00210005M C P00210006 1.00000000 110 120
2 P00210005M C P00210007 1.00000000 110 130
ROW_NUMBER() OVER(ORDER BY Field) - 1 AS FieldAlias (To start from 0)
ROW_NUMBER() OVER(ORDER BY Field) - 2862718 AS FieldAlias (To start from 2862718)
The order by clause of row_number() is specifying what column is used for the order by. By specifying a constant there, you are simply saying "everything has the same value for ordering purposes". It has nothing, nothing at all to do with the first value chosen.