How to sample rows from a table with a specific probability? - sql

I'm using BigQuery at my new position, and I'm totally new to SQL/BigQuery.
I'm testing a machine learning model and monitoring an A/B test with a different ratio, e.g., 3 vs. 10. To compare the A/B results, e.g., # of page view, I want to make the ratios equal first so that I can compare easily. For example, say we have a table with 13 records (3 are from A and 10 are from B). In addition, each row contains an id field that is identical. What I want to do is to extract only 3 samples out of 10 for B to match the sample number to A.
I'm trying to use the FARM_FINGERPRINT function to map fields to integers. Then I'm taking ABS and then calculating MOD to convert the integer numbers to a specific range, e.g., [0, 10). Eventually, I would like to get 3 in 10 items using the following line:
MOD(ABS(FARM_FINGERPRINT(field)), 10) < 3
However, I found that even if I run A/B with exactly the same ML model with different A/B ratio, the result is different between A and B (The results should be same because A and B are running the same ML model with just the different ratio). This made me doubt that the above implementation may bring some biased data sampling. I also read this post and confirmed the FARM_FINGERPRINT might not bring a randomly distributed result.
*There's a critical reason why I cannot simply multiply 3/10 to B, which is confidential and cannot disclose here.
Is there a better way to accomplish the equally distributed sampling?
Thank you in advance. (I'm sorry if the question is vague, as I'm hiding the confidential parts.)

Related

LeavePGroupsOut For multidimensional array

I am working on a research problem and due to a small sized dataset with subjects I am trying to implement Leave N Out style analyses.
Currently I am doing this ad-hoc and I stumbled upon scikit-learn LeavePGroupsOut function.
I read the docs but I am unable to understand how to use it in multidimensional array.
My data are the following: I have 50 subjects, around 20 entries per subject (not fixed) and 20 features per entry with ground-truth value (0 or 1) for every entry.
Well the documentation is actually pretty clear:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut
In your case you need to concatenate your array s.t. you can provide for every entry and feature the group index. Thus your feature array will have the shape 50*20 datapoints times 20 features (1000,20), so your group array also needs to have shape (1000,).
Then you need to define the cross validation via
lpgo = LeavePGroupsOut(n_groups=n_groups)
It's important to notice that this will result in all possible combinations of left out test groups.

Understanding Stratified sampling in numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.

How many Axis can we use in MDX practically?

I heard about there are around 128 Axis in MDX.
AXIS(0) or simply 0 – Columns
AXIS(1) or simply 1 – Rows
AXIS(2) or simply 2 – Pages
AXIS(3) or simply 3 – Sections
……….
……….
So far I have used only two of them, Column (0) & Row (1).
I am just curious about
how,
where
when or why
can I use other MDX Axis ?
As SQL SSMS only supports two Axis, If I am not wrong.
Thanks.
How :
select ... on 0, ... on 1, ... on 2 and so on .... from [cube]
Where :
Any client that will not crash with unexpected result format ;-)
When / Why :
A client could take advantage of several axis for rendering the result in 3D using 3 axis. Even if the the client does not render the result in 3D, it might be interesting to ask the server to return the result split over 3 axis for ad-hoc (or easier) processing.
I do not know of any standard client that supports this.
But a typical application that comes to mind: Some years ago (before I was working with Analysis Services), we had a client requiring one and the same report for ten countries and five markets on fifty PowerPoint slides. If we had used Analysis Services at that time, we might have written a custom client application that uses a four dimensional report and thus can get the data to be put into all fifty PowerPoint slides with a single MDX query.
You need not think of OLAP dimensions as dimensions in space. You also can think of them (as the name aliases suggest) as e. g. pages and chapters.

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

comparing/intersecting compare criteria

If there's any open source code that does this already I'm interested in hearing about it. But I haven't seen it yet so I'm trying to roll my own.
Example:
variable x = compareCriteriaBetween 3 and 6
variable y = compareCriteriaLesserThanOrEqual 5
The difficult part for me is finding an elegant way to compare the compareCriteria and create an intersection. In the example the intersection between the two is 'between 3 and 5'.
How can I implement this in a 'tell don't ask' manner? Note that compareCriteria can be completely unrelated (eg startsWithLetter versus betweenNumber).
If you only have constants in your expressions you should be safe from undecidability (I think!). Problems arise as soon as you can express e.g. general statements about integers with +-*/ (see Peano arithmetic).
Even if you stay within the realm of decidability, there exists no algorithm that can take arbitrary statements P(x) and Q(x) and compute a statement R(x) equivalent to P(x) & Q(x) for all x, where x can range over any domain (integers, strings, matrices, real numbers, complex numbers, logical statements [whoops, back into undecidable territory!], ...). You need domain specific tricks to get there, and strictly delimited languages in which P, Q and R are formulated. There exist software products for certain domains -- one of them is called Mathematica...
Try to get back to basics: what problem are you trying to solve?
If you are just interested in simple criteria like less equal or between on integers/floats, you can rewrite between 3 and 6 as (greater equal 3 and less equal 6). If you combine this with a logical and with less equal 5 you can use Boolean algebra to obtain (greater equal 3 and (less equal 6 and less equal 5)) before simplifying the inner parenthesis to just less equal 5 and rewriting the result as between 3 and 5.