How to make a test-train split in the table using SQL? - sql

I am slightly new to PostgreSQL and had a doubt.
Suppose I have a table with data like this (below). How do I select ten rows from each of the classes (or sentiments), i.e. 10 positive tweets, 10 extremely positive tweets, 10 negative tweets, etc. (there are around 11 such classes)? The data is also somewhat large to do manually; ~1k rows.
As the title reads, my actual goal is to make a 10:90 split in the entire table and add another column with a test-train label, while keeping the class balance intact. So, if there is any other better way to do this, please suggest.
(I do realise that python could make life easier, but I wish to know if we can do it with SQL directly, also kinda constrained to SQL at the moment.)
Original Tweet
Sentiment
New Yorkers encounter empty supermarket shelves ...
Extremely Negative
When I couldn't find hand sanitizer at Fred Meyer, I turned ...
Positive
Find out how you can protect yourself and loved ones from ...
Extremely Positive
#Panic buying hits #NewYork City as anxious shoppers stock up on...
Negative
#toiletpaper #dunnypaper #coronavirus #coronavirusaustralia ...
Neutral
Do you remember the last time you paid $2.99 a gallon for ...
Neutral

You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by sentiment order by rand()) as seqnum
from t
) t
where seqnum <= 10;
If you wanted to split the data into 90-10 groups randomly, I would probably recommend ntile():
select t.*,
(case when ntile(10) over (order by rand() <= 9 then 'train' else 'test'
end) as grp
from t;

Related

Optimizing spark sql query

I am using this below query to derive outliers form my data. using distinct is creating too much shuffle and the end tasks are taking huge amount of time to complete. are there any optimization that can be done to speed it up?
query = """SELECT
DISTINCT NAME,
PERIODICITY,
PERCENTILE(CAST(AMOUNT AS INT), 0.997) OVER(PARTITION BY NAME, PERIODICITY) as OUTLIER_UPPER_THRESHOLD,
CASE
WHEN PERIODICITY = "WEEKLY" THEN 100
WHEN PERIODICITY = "BI_WEEKLY" THEN 200
WHEN PERIODICITY = "MONTHLY" THEN 250
WHEN PERIODICITY = "BI_MONTHLY" THEN 400
WHEN PERIODICITY = "QUARTERLY" THEN 900
ELSE 0
END AS OUTLIER_LOWER_THRESHOLD
FROM base"""
I would suggest rephrasing this so you can filter before aggregating:
SELECT NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD,
MIN(AMOUNT)
FROM (SELECT NAME, PERIODICITY,
RANK() OVER (PARTITION BY NAME, PERIODICITY ORDER BY AMOUNT) as sequm,
COUNT(*) OVER (PARTITION BY NAME, PERIODICITY) as cnt,
(CASE . . . END) as OUTLIER_LOWER_THRESHOLD
FROM base
) b
WHERE seqnum >= 0.997 * cnt
GROUP BY NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD;
Note: This ranks duplicate amounts based on the lowest rank. That means that some NAME/PERIODICITY pairs may not be in the results. They can easily be added back in using a LEFT JOIN.
The easiest way to deal with a large shuffle, independent of what the shuffle is, is to use a larger cluster. It's the easiest way because you don't have to think much about it. Machine time is usually much cheaper than human time refactoring code.
The second easiest way to deal with a large shuffle that is the union of some independent and constant parts is to break it into smaller shuffles. In your case, you could run separate queries for each periodicity, filtering the data down before the shuffle and then union the results.
If the first two approaches are not applicable for some reason, it's time to refactor. In your case you are doing two shuffles: first to compute OUTLIER_UPPER_THRESHOLD which you associate with every row and then to distinct the rows. In other words, you are doing a manual, two-phase GROUP BY. Why don't you just group by NAME, PERIODICITY and compute the percentile?

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Query for grouping of successful attempts when order matters

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.
You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Selecting percentage of group and population based on a field in a table

I have a table with user IDs and states. I need to assign 20% of users in each state to a control group by setting a flag in another table. I don't know how I would be able to ensure that the numbers are correct though. How would I go about even starting this?
As an example, take a look at this sqlfiddle:
http://sqlfiddle.com/#!4/8e49d/6/0
with counts as
(select stateid, count(userid) as num_users
from userstates
group by stateid)
select *
from (select x.stateid,
x.userid,
sum(1) over(partition by x.stateid order by x.userid) as runner,
y.num_users,
sum(1) over(partition by x.stateid order by x.userid) / y.num_users as pct
from userstates x
join counts y
on x.stateid = y.stateid)
where pct <= .2
There are a couple of assumptions I made:
-- I assumed that, if you could not pull exactly 20%, you would choose, for instance, 19%, rather than 21%. The query would need to be changed slightly if you want to pull 1 ID over 20% when exactly 20% is not possible (you can't pull a fraction of a username, so you have to choose one way or the other).
-- I assumed that you did not want a random 20%, and that 20% of the first user IDs, in order, would suffice. I would need to change the query slightly if you wanted the 20% from each group to be random.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost