Assuming a table with the following format
Team Score
---- -----
A 10
B 20
A 30
B 40
A 50
C 60
I would like to compute statistics, e.g. mean over the "kth" game a given team played, e.g. if k = 1, the mean is (10 + 20 + 60) / 3. How can one accomplish this using big query? Is there a much simpler way for special case k=1 vs. general case?
Consider below - it assumes you have column that represents game number or game date - something that defines game order - in this example I use column named game but you should replace it with your column
select avg(score) avg_score from (
select * from your_table
qualify row_number() over(partition by team order by game) <= 1
)
Related
I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.
I have a table with the following fields
ID,Content,QuestionMarks,TypeofQuestion
350, What is the symbol used to represent Bromine?,2,MCQ
758,What is the symbol used to represent Bromine? ,2,MCQ
2425,What is the symbol used to represent Bromine?,3,Essay
2080,A quadrilateral has four sides, four angles ,1,MCQ
2614,A circular cone has a curved surface area of ,2,MCQ
2520,Two triangles have sides 5 cm, 11 cm, 2 cm . ,2,MCQ
2196,Life supporting process mediated by water? ,2,Essay
I would like to get random questions where total marks is an input number.
For example if I say 25, the result should be all the random questions whose Sum(QuestionMarks) is 25(+/-1)
Is this really possible using a SQL
select content,id,questionmarks,sum(questionmarks) from quiz_question
group by content,id,questionmarks;
Expected Input 25
Expected Result (Sum of Question Marks =25)
Update:
How do I ensure I get atleast 2 Essay Type Questions (this is just an example) I would extend this for other conditions. Thank you for all the help
S-Man's cumulative sum is the right approach. For your logic, though, I think you want to get up to the first row that is 24 or more. That logic is:
where total - questionmark < 24
If you have enough questions, then you could get exactly 25 using:
with q25 as (
select *
from (select t.*,
sum(questionmark) over (order by random()) as running_questionmark
from t
) t
where running_questionmark < 25
)
select q.ID, q.Content, q.QuestionMarks, q.TypeofQuestion
from q25 q
union all
(select t.ID, t.Content, t.QuestionMarks, t.TypeofQuestion
from t cross join
(select sum(questionmark) as questionmark_25 from q25) x
where not exists (select 1 from q25 where q25.id = t.id)
order by abs(questionmark - (25 - questionmark_25))
limit 1
)
This selects questions up to 25 but not at 25. It then tries to find one more to make the total 25.
Supposing, questionmark is of type integer. Then you want to get some records in random order whose questionmark sum is not more than 25:
You can use the consecutive SUM() window function. The order is random. The consecutive SUM() adds every current value to the previous sum. So, you could filter where SUM() <= <your value>:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
SUM(questionmark) OVER (ORDER BY random()) as total
FROM
t
)s
WHERE total <= 25
Note:
This returns a records list with no more than 25, but as close as possible to it with an random order.
To find an exact match of your value is some sort of combinatorical problem which shouldn't be solved in a database. Especially when there's a random factor. What if your current SUM is 22 and the next randomly chosen value is 4. Would you retry maybe until infinity to randomly find a value = 3? Or are you trying to remove an already counted record with value = 1?
Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists).
Simplified schema is:
UserGUID String, ArtistGUID String
I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
Can this approach be scaled for my example?
I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
STEP 1 - Aggregate plays by user / artist
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
Combined with first step – it will be:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time.
In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000.
To be on safe side I decided to use 2000 features at a time
Below script is used for dynamically generating query that then run to create partitioned tables
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
Same as in above steps. First we need generate query and then run it
So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
Then we repeat STEP 4 like below
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
The resulted table has following schema:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea
I have a two column table currently, with the columns 'probability' and 'age'. I have a given probability, and I need to search the table and return the age related to the closest probability. It's already in ascending order next to age, for example:
20 0.01050
21 0.02199
22 0.03155
23 0.04710
The only thing I can think of doing right now is returning all ages with probabilities greater than the given probability, and taking the first one.
select age from mydb.mytest
where probability > givenProbability;
I'm sure there is a better approach to this than doing that, so I'm wondering what that would be.
What about something like this:
SELECT * FROM mytest
ORDER BY ABS( .0750 - probability )
LIMIT 1
Should return the top 1 closest value, based on a sorted list of the Absolute value of the Difference between Probability and givenProbability.
Different solutions will work for different DBMS. This one works in DB2 and is standard sql:
select age
from (
select age
, row_number() over (order by abs(probability - givenProbability)) as rn
from mydb.mytest
)
where rn = 1
I have a table Patients which looks like this:
PatientName DateOftest Eye L1 L2 L3 L4 L5
----------------------------------------------------------------
Mike 17-02-2009 L 23 25 40 32 30
Mike 17-02-2009 R 25 30 34 35 24
Bill 08-03-2006 L 20 24 30 24 25
Bill 08-03-2006 R 18 25 27 30 24
Now my query below finds mean
SELECT
PatientName, DateOfTest,
(MAX(L1) + MAX(L2) + MAX(L3) + MAX(L4) + MAX(L5))/4 as Mean,
SQRT(POW(L1 - Mean, 2) + POW(L2 - Mean, 2) + POW(L3 - Mean, 2) + POW(L4 - Mean, 2) + POW(L5 - Mean, 2)) AS Standard Deviation,
'Binocular' Eye
FROM
Patients
GROUP BY
PatientName, DateOfTest;
The above query is wrong because I have not stored mean.. is there any way to store mean to find out standard deviation in my code.. I'm asking because I have very lengthy query and more records..
To store the mean and reuse it in your query, one option would be to use a Common Table Expression. You can join the CTE to the table to use the calculated mean multiple times.
I'll admit that didn't understand the following line...
SQRT(POW(L1-Mean,2)+POW(L2-Mean,2)+POW(L3-Mean,2)+POW(L4-Mean,2)+POW(L5-Mean,2))
as Standard Deviation, 'Binocular' Eye
...but the query below shows how you would integrate the calculated mean into that line, which I think might need some additional work as well.
--This is the CTE to calculate the mean
WITH Mean_CTE AS
(
SELECT PatientName, DateOfTest,
(MAX(L1) + MAX(L2) + MAX(L3) + MAX(L4) + MAX(L5))/4 AS [Mean]
FROM Patients
GROUP BY PatientName, DateOfTest
)
--This is the original query
SELECT Patients.PatientName, Patients.DateOfTest, Mean_CTE.Mean AS Mean,
SQRT(POW(L1-Mean_CTE.Mean,2)+POW(L2-Mean_CTE.Mean,2)+POW(L3-Mean_CTE.Mean,2)
+POW(L4-Mean_CTE.Mean,2)+POW(L5-Mean_CTE.Mean,2)) as Standard Deviation,
'Binocular' Eye
FROM Patients
INNER JOIN Mean_CTE --This is where you join the two
ON Patients.PatientName = Mean_CTE.PatientName
AND Patients.DateOfTest = Mean_CTE.DateOfTest
GROUP BY Patients.PatientName, Patients.DateOfTest, Mean_CTE.Mean;
What about the possibility of adding a CALCULATED column to the table that stores the result of the formula??
This is a rather simple concept, and will sotre the value for the formula.
http://msdn.microsoft.com/en-us/library/ms191250(v=sql.105).aspx