I am trying to recreate this R logic within a SQL query. Any ideas on how I should go about doing so? Appreciate any assistance at all - sql

This is the R script that I am attempting to recreate using a CASE WHEN statement in SQL:
dat[ ,X_1_7_Spline := pmax(1,pmin(ifelse(is.na(X),1,X),7))]
It seems that this command is telling the parser to return the parallel maxima of a vector containing a conditional statement as long as the value of variable X lies between 1 and the parallel minima of some value and 7 (as long as the value is not null). It then seems to join the new column containing these values back to the original dataset (dat). I am having some troubles representing the "pmax(1,pmin(ifelse(is.na(X),1,X),7))" portion of the code in my SQL query and would appreciate any ideas on how I might be able to do this effectively.
I have something very remedial right now, which I know does not express this above statement properly:
CASE WHEN MAX(IF(ISNOTNULL(X) AND MIN(X)=1 AND MAX(X)=7) then 1 else X end as X_1_7_Spline
Any thoughts/feedback would be greatly appreciated as I am still trying to understand the R script. Thanks in advance for any insight on this issue.

ifelse(is.na(X),1,X) can be translated into SQL's COALESCE(X, 1); and
pmin and pmax logic can be placed in a CASE WHEN (as you've started)
Perhaps this?
CASE WHEN X < 1 THEN 1
WHEN X > 7 THEN 7
ELSE coalesce(X, 1) END as NewX
We don't need to worry about coalesceing the X < 1 or X > 7 because null < 1 does not resolve as true, so it does not accept that case.
Demo in R using sqldf:
library(data.table)
dat <- data.table(X = c(-1,5,9,NA))
dat[, X_1_7_Spline := pmax(1,pmin(ifelse(is.na(X),1,X),7)) ]
sqldf::sqldf("select *, (CASE WHEN X < 1 THEN 1 WHEN X > 7 THEN 7 ELSE coalesce(X,1) END) as NewX from dat")
# X X_1_7_Spline NewX
# 1 -1 1 1
# 2 5 5 5
# 3 9 7 7
# 4 NA 1 1

Related

how to convert function output into list, dict or as data frame?

My issue is, i don't know how to use the output of a function properly. The output contains multiple lines (j = column , i = testresult)
I want to use the output for some other rules in other functions. (eg. if (i) testresult > 5 then something)
I have a function with two loops. The function goes threw every column and test something. This works fine.
def test():
scope = range(10)
scope2 = range(len(df1.columns))
for (j) in scope2:
for (i) in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() > df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
i + 1
else:
print(j,i)
break
Output:
test()
1 0
2 3
3 3
4 1
5 0
6 6
7 0
8 1
9 0
10 1
11 1
12 0
13 0
14 0
15 0
I tried to convert it to list, dataframe etc. However, i miss something here.
What is the best way for that?
Thank you!
A fix of your code would be:
def test():
out = []
scope = range(10)
scope2 = range(len(df1.columns))
for j in scope2:
for i in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() <= df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
out.append([i, j])
return pd.DataFrame(out)
out = test()
But you probably don't want to use loops as it's slow, please clarify what is your input with a minimal reproducible example and what you are trying to achieve (expected output and logic), we can probably make it a vectorized solution.

SQL dealing every bit without run query repeatedly

I have a column using bits to record status of every mission. The index of bits represents the number of mission while 1/0 indicates if this mission is successful and all bits are logically isolated although they are put together.
For instance: 1010 is stored in decimal means a user finished the 2nd and 4th mission successfully and the table looks like:
uid status
a 1100
b 1111
c 1001
d 0100
e 0011
Now I need to calculate: for every mission, how many users passed this mission. E.g.: for mission1: it's 0+1+1+0+1 = 5 while for mission2, it's 0+1+0+0+1 = 2.
I can use a formula FLOOR(status%POWER(10,n)/POWER(10,n-1)) to get the bit of every mission of every user, but actually this means I need to run my query by n times and now the status is 64-bit long...
Is there any elegant way to do this in one query? Any help is appreciated....
The obvious approach is to normalise your data:
uid mission status
a 1 0
a 2 0
a 3 1
a 4 1
b 1 1
b 2 1
b 3 1
b 4 1
c 1 1
c 2 0
c 3 0
c 4 1
d 1 0
d 2 0
d 3 1
d 4 0
e 1 1
e 2 1
e 3 0
e 4 0
Alternatively, you can store a bitwise integer (or just do what you're currently doing) and process the data in your application code (e.g. a bit of PHP)...
uid status
a 12
b 15
c 9
d 4
e 3
<?php
$input = 15; // value comes from a query
$missions = array(1,2,3,4); // not really necessary in this particular instance
for( $i=0; $i<4; $i++ ) {
$intbit = pow(2,$i);
if( $input & $intbit ) {
echo $missions[$i] . ' ';
}
}
?>
Outputs '1 2 3 4'
Just convert the value to a string, remove the '0's, and calculate the length. Assuming that the value really is a decimal:
select length(replace(cast(status as char), '0', '')) as num_missions as num_missions
from t;
Here is a db<>fiddle using MySQL. Note that the conversion to a string might look a little different in Hive, but the idea is the same.
If it is stored as an integer, you can use the the bin() function to convert an integer to a string. This is supported in both Hive and MySQL (the original tags on the question).
Bit fiddling in databases is usually a bad idea and suggests a poor data model. Your data should have one row per user and mission. Attempts at optimizing by stuffing things into bits may work sometimes in some programming languages, but rarely in SQL.

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Calculate amount of combinations with conditions

I'd like to calculate how many different variations of a certain amount of numbers are possible. The number of elements is variable.
Example:
I have 5 elements and each element can vary between 0 and 8. Only the first element is a bit more defined and can only vary between 1 and 8. So far I'd say I have 8*9^4 possibilities. But I have some more conditions. As soon as one of the elements gets zero the next elements should be automatically zero as well.
E.G:
6 5 4 7 8 is ok
6 3 6 8 0 is ok
3 6 7 0 5 is not possible and would turn to 3 6 7 0 0
Would somebody show me how to calculate the amount of combinations for this case and also in general, because I'd like to be able to calculate it also for 4 or 8 or 9 etc. elements. Later on I'd like to calculate this number in VBA to be able give the user a forecast how long my calculations will take.
Since once a 0 is present in the sequence, all remaining numbers in the sequence will also be 0, these are all of the possibilities: (where # below represents any digit from 1 to 8):
##### (accounts for 8^5 combinations)
####0 (accounts for 8^4 combinations)
...
#0000 (accounts for 8^1 combinations)
Therefore, the answer is (in pseudocode):
int sum = 0;
for (int x = 1; x <= 5; x++)
{
sum = sum + 8^x;
}
Or equivalently,
int prod = 0;
for (int x = 1; x <= 5; x++)
{
prod = 8*(prod+1);
}
great thank you.
Sub test()
Dim sum As Single
Dim x As Integer
For x = 1 To 6
sum = sum + 8 ^ x
Next
Debug.Print sum
End Sub
With this code I get exactly 37488. I tried also with e.g. 6 elements and it worked as well. Now I can try to estimate the calculation time

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.