SQL: perform undersampling to select a subset of majority class - sql

I have a table that looks like the following:
user_id
target
1278
1
9809
0
3345
0
9800
0
1298
1
1223
0
My goal is to perform undersampling which means that I want to randomly select a subset of users that have a target of 0 while keeping all users that have a target of 1 value. I have tried the following code, however, since the user_ids are all unique, it doesn't remove the rows with the target of 0 randomly. Any idea what I need to do?
select *
from (select user_id, target, row_number() over (partition by user_id, target order by rand()) as seq
from dataset.mytable
) a
where target = 1 or seq = 1

One method uses window functions:
select t.* except (seqnum, cnt1)
from (select t.*,
row_number() over (partition by target order by rand()) as seqnum,
countif(target = 1) over () as cnt1
from t
) t
where seqnum <= cnt1;
The above might have performance problems -- or even exceed resources because of the large volume of data being sorted. An approximate method might also work for your purposes:
select t.* except (cnt, cnt1)
from (select t.*,
count(*) over (partition by target) as cnt,
countif(target = 1) over () as cnt1
from t
) t
where rand() < cnt * 1.0 / cnt1;
This is not guaranteed to produce exactly the same numbers of 0 and 1, but the numbers will be quite close.

Consider below approach - it leaves all target=1 rows and ~50% of target=0 rows
select *
from `dataset.mytable`
where if(target = 1, true, rand() < 0.5)

Related

SQL - delete record where sum = 0

I have a table which has below values:
If Sum of values = 0 with same ID I want to delete them from the table. So result should look like this:
The code I have:
DELETE FROM tmp_table
WHERE ID in
(SELECT ID
FROM tmp_table WITH(NOLOCK)
GROUP BY ID
HAVING SUM(value) = 0)
Only deletes rows with ID = 2.
UPD: Including additional example:
Rows in yellow needs to be deleted
Your query is working correctly because the only group to total zero is id 2, the others have sub-groups which total zero (such as the first two with id 1) but the total for all those records is -3.
What you're wanting is a much more complex algorithm to do "bin packing" in order to remove the sub groups which sum to zero.
You can do what you want using window functions -- by enumerating the values for each id. Taking your approach using a subquery:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
)
delete from t
where exists (select 1
from t t2
where t2.id = t.id and t2.value = - t.value and t2.seqnum = t.seqnum
);
You can also do this with a second layer of window functions:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
),
tt as (
select t.*, count(*) over (partition by id, abs(value), seqnum) as cnt
from t
)
delete from tt
where cnt = 2;

How to get the top N percent (e.g., 50%) of a table in BigQuery (standard SQL)?

I have tried the following approaches which none of them worked:
Using SELECT TOP 50 PERCENT: BigQuery does not have top function
Using LIMIT (SELECT COUNT(*) FROM tabl)/2: the reason is BigQuery does not accept any non integer value.
Using SET to set the median value and then use WHERE
In BigQuery I would use window function percent_rank().
select t.* except (prnk)
from (select t.*, percent_rank() over(order by id) prnk from mytable t) t
where prnk <= 0.5
Note: any answer to your question will require that you provide a column to order your data. I assumed that this column is called id.
One method uses window functions:
select t.* except (seqnum, cnt)
from (select t.*, row_number() over (order by ?) as seqnum,
count(*) over () as cnt
from t
) t
where seqnum <= cnt / 2;
Another possibility would be to limit the data with a WHERE clause instead of LIMIT. This is an example if you want yo filter by an ID:
SELECT * FROM table_name as t
WHERE t.id <= (SELECT COUNT(*) FROM table_name)/2;
And if you want to filter by the row number:
SELECT t.* except (rn)
FROM (
SELECT t.*, ROW_NUMBER() OVER () AS rn
FROM table_name as t
) AS t
WHERE t.rn <= (SELECT COUNT(*) FROM table_name)/2;
To scale up, you can use an approx algorithm to find the 50% point:
DECLARE mid_date TIMESTAMP DEFAULT (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers` )
;
SELECT mid_date
, COUNTIF(creation_date > mid_date) first_half
, COUNTIF(creation_date < mid_date) second_half
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
Looks like it works well:
Now let's get these records out:
CREATE TABLE `temp.fifty_percent`
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
WHERE creation_date < (
SELECT APPROX_QUANTILES(creation_date, 2)[OFFSET(1)] mid_date
FROM `fh-bigquery.stackoverflow_archive.201909_posts_answers`
)
This method will happily scale, while solutions using OVER(ORDER BY) won't.

Aggregate function to detect trend in PostgreSQL

I'm using a psql DB to store a data structure like so:
datapoint(userId, rank, timestamp)
where timestamp is the Unix Epoch milliseconds timestamp.
In this structure I store the rank of each user each day, so it's like:
UserId Rank Timestamp
1 1 1435366459
1 2 1435366458
1 3 1435366457
2 8 1435366456
2 6 1435366455
2 7 1435366454
So, in the sample data above, userId 1 its improving it's rank with each measurement, which means it has a positive trend, while userId 2 is dropping in rank, which means it has a negative trend.
What I need to do is to detect all users that have a positive trend based on the last N measurements.
One approach would be to perform a linear regression on the each user's rank, and check if the slope is positive or negative. Luckily, PostgreSQL has a builtin function to do that - regr_slope:
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id
This query gives you the basic functionality. Now, you can dress it up a bit with case expressions if you like:
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM my_table
GROUP BY user_id) t
Edit:
Unfortunately, regr_slope doesn't have a built in way to handle "top N" type requirements, so this should be handled separately, e.g., by a subquery with row_number:
-- Decoration outer query
SELECT user_id,
CASE WHEN slope > 0 THEN 'positive'
WHEN slope < 0 THEN 'negative'
ELSE 'steady' END AS trend
FROM (-- Inner query to calculate the slope
SELECT user_id, regr_slope (rank1, timestamp1) AS slope
FROM (-- Inner query to get top N
SELECT user_id, rank1,
ROW_NUMER() OVER (PARTITION BY user_id
ORDER BY timestamp1 DESC) AS rn
FROM my_table) t
WHERE rn <= N -- Replace N with the number of rows you need
GROUP BY user_id) t2
You can use analytic functions for this. Overall approach:
compute the previous rank using lag()
use case to decide whether the trend is positive or not (0 or 1)
use min() to get the minimum trend over the preceding N rows; if the trend was positive for N rows, this returns 1, otherwise 0. To limit it to N rows, use the between N preceding and 0 following clause of the windowing function
Code:
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank
from t1
order by userid, timestamp1
) v1
) v2
SQL Fiddle
UPDATE
To only get the userid with the overall trend and the delta for the rank, you'll have to add another call to lag(.., N+1) to get the nth previous rank and row_number() to get a numbering within the same userid:
select v3.userid, v3.trend_overall, delta_rank
from (
select v2.*,
min(positive_trend) over (partition by userid order by timestamp1
rows between 3 preceding and 0 following) as trend_overall,
latest_rank - prev_N_rank as delta_rank
from (
select v1.*,
(case when prev_rank < rank1 then 0 else 1 end) as positive_trend,
max(case when v1.rn = 1 then rank1 else NULL end) over (partition by userid) as latest_rank
from (
select userid,
rank1,
timestamp1,
lag(rank1) over (partition by userid order by timestamp1) as prev_rank,
lag(rank1, 4) over (partition by userid order by timestamp1) as prev_N_rank,
row_number() over (partition by userid order by timestamp1 desc) as rn
from t1
order by userid, timestamp1
) v1
) v2
) v3
where rn = 1
group by userid, trend_overall, delta_rank
order by userid, trend_overall, delta_rank
Updated SQL Fiddle

How do I filter the top 1% and lower 1% of data in each group in SQL

I have a data set that includes PRICE, SUBTYPE, and others. I want to do some outlier removal before I use the dataset. I want to remove rows for things where the price is ridiculously high or low, in each SUBTYPE.
For each SUBTYPE look at the range of the PRICEs and remove or filter out rows.
Keep rows that fall between: PRICErange * .01 |KEEP| PRICErange * .99
This was provided to me by a Martin Smith on stackoverflow, I edited this question, so lets start from here.
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
I'm not sure this is what I need to do. I don't know how many rows will be removed off the ends.
You don't specify exactly how you define the 1 percent and how ties should be handled.
One way is below
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
That assumes the highest price item is 100%, the lowest price one 0% and all others scaled evenly between taking no account of ties. If you need to take account of ties look into RANK rather than ROW_NUMBER
NB: If all of the subtypes have a relatively large amount of rows you could use NTILE(100) instead but it does not distribute between buckets well if the number of rows is small relative to number of buckets.

SQL: If no rows found, do another search?

I need to write a query which would always return something, even if nothing satisfies conditions, something like this
SELECT * WHERE date > NOW() FROM table ORDER by date
IF none is returned THEN
SELECT FIRST FROM table ORDER by date
So only numbers more then 10 will be returned, if none is returned, return any number
Any ideas how to do it?
Here is one way, using union all:
select *
from table
where number > 10
union all
(select *
where number > 0 and
not exists (select * from table where number > 10)
limit 1
)
If you are using a reasonable version of SQL, you could do something like:
select t.*
from (select t.*, max(number) over () as maxnumber,
row_number() over (order by number desc) as seqnum
from table t
) t
where (maxnumber > 10 and number > 10) or seqnum = 1
You need window functions for this.