Visually/graphically compare intervals on a number line - data-visualization

I created a simple visualization to compare two intervals, A and B, using a number line:
Is there a better way to visualize this comparison? I thought about putting A and B both on top of the number line:
But now my concern is that it looks like B has a higher value along some hidden y-axis. Are there existing mechanisms to compare intervals? It seems like a common need.

Related

How to sample rows from a table with a specific probability?

I'm using BigQuery at my new position, and I'm totally new to SQL/BigQuery.
I'm testing a machine learning model and monitoring an A/B test with a different ratio, e.g., 3 vs. 10. To compare the A/B results, e.g., # of page view, I want to make the ratios equal first so that I can compare easily. For example, say we have a table with 13 records (3 are from A and 10 are from B). In addition, each row contains an id field that is identical. What I want to do is to extract only 3 samples out of 10 for B to match the sample number to A.
I'm trying to use the FARM_FINGERPRINT function to map fields to integers. Then I'm taking ABS and then calculating MOD to convert the integer numbers to a specific range, e.g., [0, 10). Eventually, I would like to get 3 in 10 items using the following line:
MOD(ABS(FARM_FINGERPRINT(field)), 10) < 3
However, I found that even if I run A/B with exactly the same ML model with different A/B ratio, the result is different between A and B (The results should be same because A and B are running the same ML model with just the different ratio). This made me doubt that the above implementation may bring some biased data sampling. I also read this post and confirmed the FARM_FINGERPRINT might not bring a randomly distributed result.
*There's a critical reason why I cannot simply multiply 3/10 to B, which is confidential and cannot disclose here.
Is there a better way to accomplish the equally distributed sampling?
Thank you in advance. (I'm sorry if the question is vague, as I'm hiding the confidential parts.)

How to identify records which have clusters or lumps in data?

I have a tableau table as follows:
This data can be visualized as follows:
I'd like to flag cases that have lumps/clusters. This would flag items B, C and D because there are spikes only in certain weeks of the 13 weeks. Items A and E would not be flagged as they mostly have a 'flat' profile.
How can I create such a flag in Tableau or SQL to isolate this kind of a case?
What I have tried so far?:
I've tried a logic where for each item I calculate the MAX and MEDIAN. Items that need to be flagged will have a larger (MAX - MEDIAN) value than items that have a fairly 'flat' profile.
Please let me know if there's a better way to create this flag.
Thanks!
Agree with the other commenters that this question could be answered in many different ways and you might need a PhD in Stats to come up with an ideal answer. However, given your basic requirements this might be the easiest/simplest solution you can implement.
Here is what I did to get here:
Create a parameter to define your "spike". If it is going to always be a fixed number you can hardcode this in your formulas. I called min "Min Spike Value".
Create a formula for the Median Values in each bucket. {fixed [Buckets]: MEDIAN([Values])} . (A, B, ... E = "Buckets"). This gives you one value for each letter/bucket that you can compare against.
Create a formula to calculate the difference of each number against the median. abs(sum([Values])-sum([Median Values])). We use the absolute value here because a spike can either be negative or positive (again, if you want to define it that way...). I called this "Spike to Current Value abs difference"
Create a calculated field that evaluates to a boolean to see if the current value is above the threshold for a spike. [Spike to Current Value abs difference] > min([Min Spike Value])
Setup your viz to use this boolean to highlight the spikes. The beauty of the parameter is you can change the value for what a spike should be and it will highlight accordingly. Above the value was 4, but if you change it to 8:

How might one most efficiently calculate contingent values?

Suppose that I have 10 values n_1, n_2, ... n_10 and that given any 1 of these value, the other 9 can be calculated. Let f_i(n_j) be the function that calculates the value n_i using the values of n_j (where i != j). These functions are relatively simple (i.e. contain no more than a few exponential functions or powers).
In terms of the functions used, what would be the most efficient way of creating a program to calculate the other 9 values in n_1, ..., n_10 given the 1 that is initially known?
Would the best option be to minimize the number of functions used (and thus minimize the number of lines of code), or to create a function defining every single mapping?
For example, would it be most efficient to use only the 18 functions
f_1(n_2), f_1(n_3), ..., f_1(n_10) [1]
f_2(n_1), f_3(n_1), ..., f_10(n_1) [2]
And then, for whatever input is provided by the user, the value of n_1 may be calculated by using the relevant function in line 1, from which every other value of intererest may be calculated using functions from line [2]?
Or would it be better to define all 90 mappings, and so that only a single function (rather than 2 functions) must be called to calculate each of the 9 other values?
Edit: The specific result that I am trying to achieve is as follows...
I am currently using VBA, with a user form of the following format:
The conversion frequency is a required field (so lets just say, for example, that it is always equal to 2 and forget about it). I want to use on change events so that whenever the user changes any of the 6 fields below the conversion frequency field, the other 5 fields are auto-filled with the correct value. However, since the user need only update any one out of six fields, with the other 5 fields being calculated from this, we will require 6^6-6 = 30 different functions to do these calculations. We will thus end up with a lot of repetitive code.
My question regards the best practices to follow when working with a form where one of many inputs may be provided, and all other fields must be updated as a result of the input provided and its value.
Or, equivalently, is there a way to update all fields when the value of one field changes? Can this be done without the number of lines of code required increasing exponentially as the number of fields increases?
I think you are grossly overthinking this. Think of this in terms of the formulas you need; which I think are 6. 6 functions that take 5 inputs each:
calculateEIR(nominalInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
calculateNIR(EffectiveInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
' and so on...
The event handlers, and the code to calculate the values are their own thing. Your onchange event handlers simply need to call the correct methods; this is 6 event handlers calling 5 methods each, so 11 functions if you want to keep count. It's a lot of copypasta. For example:
sub textEffectiveInterestRate_onchange()
Me.textNominalInterstRate.value = calculateNIR(Me.textEffectiveInterestRate.value, Me.textForceOfInterest.value, etc...)
Me.textForceOfInterest.value = calculateForceOfInterest(Me.textEffectiveInterestRate.value, Me.textNominalInterstRate.value, etc...)
' And every other function aside from calculateEIR()
end sub
I am unsure about the specifics of how you are changing all the values based on a change in the others (since I don't know the formulas), but in general, you should not in any way need 30 functions...

How can I perform aggregate function logic across multiple fields?

How can I get max() of three dimensions to come from the same record?
Description:
I have a large list of widgets, with multiple attributes, from multiple sources. Think manual data entry, where you have the same stuff being entered by different people, and then you need to consolidate differences. Though, instead of auditing each difference, I just want to perform some logic to choose a value over another under certain criteria.
An analogous example: if one source a says widget xyz weighs 3 pounds, and source b says it weighs 4 pounds, I am just blindly taking the 4, as it is greater, and say I need to be over cautious for packing/shipping purposes. That is easy, I choose MAX().
Now, I have a group of attributes that are in separate fields but related. Think dimensions of a box. There are width/length/height fields. If one source says the 'dimensions' are 2x3x4, and another says they are 3x3x4, I need to take the larger, for the same reason as above. Also sounds like MAX(), except...
My sources disagree on which is the width, height, or length. A 2x3x4 box could be entered 4x3x2, or 2x4x3, depending on how the source was looking at it. If I took the MAX of 3 such sources, I would end up with 4x4x4, even though all 3 sources measured it correctly. This is undesirable.
How do I take the greatest 'measurement' value, but make sure all three values comes from the same record?
If 'greatest' is impossible, we could settle for unique... except there is a fourth source, which has 0x0x0 for about 40% of the widgets. I can't leave a 0x0x0 if any of the other sources did in fact measure that widget.
some sample data
ID,widget_name,height,width,leng
(a1,widget3,2,3,4)
(b1,widget3,2,4,3)
(c1,widget3,4,3,2)
(d1,widget3,0,0,0)
output should be (widget3,4,3,2)
you could use row_number instead of group by like
select * from
(select data, ID,widget_name,height,width,leng, ROW_NUMBER() over ( partition by widget_name order by height + width + leng desc ) rowid
from yourTable
) as t
where rowid = 1

How to replace the missing data from AMELIA results

I have run a AMELIA imputation for a data set including missing data. I need to replace the missing points by the result of amelia(). But it content 5 group of imputed values. How can i choose the best one to replace the missing values (to plot a graph of data set after imputing)
You use all 5.
You have to perform whatever you wanted to do with the data on all 5 sets of data and then combine the results of that.
i.e. you run a t-test on all 5 datasets and then combine the results..somehow.. I have not yet looked into that, but from what I have heared you can use the zelig R package to do it somewhat easily. I also noted a reference to papers that should describe methods to combine those, but have not looked into that either: King et al. (2001) and Schafer (1997).
My guess is that you just average out the p-values gained from the analysis?