SQL: Insert Rows and interpolate - sql

I have a large SQL table and I want to add rows so all issue ages 40-75 are present and all the issue ages have a db_perk and accel_perk which is added via liner interpolation.
Here is a small portion of my data
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 45 1 0.2985 0.0033 0
111 F 50 1 0.472 0.0065 0
111 F 55 1 0.7075 0.01 0
111 F 60 1 1.0226 0.0238 0
111 F 65 1 1.5208 0.0551 0
111 F 70 1 2.3808 0.1296 0
111 F 75 1 4.0748 0.3242 0
I want my output to look something like this
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 41 1 0.21656 0.00266 0
111 F 42 1 0.23702 0.00282 0
111 F 43 1 0.25748 0.00298 0
111 F 44 1 0.27794 0.00314 0
111 F 45 1 0.2985 0.0033 0
I basically want to have all the columns, but iss_age, db_perk, and accel_perk be the same as the column above
Is there anyway to do this?

Related

Reindex kmeans clustered dataframe in an ascending order values

I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes

Chi-square with the contingency table applied to the die

Bonjour,
I am trying to apply the χ2 test by contingency table from the exploratory statistics course to the results given in Example 1 from Wikipedia.
Rolling a die 600 times in a row gave the following results:
number rolled 1 2 3 4 5 6
numbers 88 109 107 94 105 97
The number of degrees of freedom is 6 - 1 = 5.
We wish to test the hypothesis that the die is not rigged, with a risk α = 0.05.
The null hypothesis here is therefore: "The die is balanced".
Considering this hypothesis to be true, the variable T defined above is : ( 88 - 100 ) 2 100 + ( 109 - 100 ) 2 100 + ( 107 - 100 ) 2 100 + ( 94 - 100 ) 2 100 + ( 105 - 100 ) 2 100 + ( 97 - 100 ) 2 100 = 3 , 44
The χ2 distribution with five degrees of freedom gives the value below which we consider the draw to be compliant with a risk α = 0.05: P(T < 11.07) = 0.95.
Since 3.44 < 11.07, we cannot reject the null hypothesis: this statistical data does not allow us to consider that the die is rigged.
I tried to retrieve this result with pandas:
dico = {' face ' : [1,2,3,4,5,6], ' numbers ' : [88, 109, 107, 94, 105, 97]} #[100, 100, 101, 99, 101, 99]}
tab = pd.DataFrame(dico)
print(tab.head(6))
ta = pd.crosstab(tab[' face '],tab[' numbers '])
print(ta)
test = chi2_contingency(tab)
test
face numbers 0 1 88 1 2 109 2 3 107 3 4 94 4 5 105 5 6 97 numbers 88 94 97 105 107 109 face 1 1 0 0 0 0 0 0 2 0 0 0 0 0 1 3 0 0 0 0 1 0 4 0 1 0 0 0 0 5 0 0 0 1 0 0 6 0 0 1 0 0 0
(4.86, 0.432, 5)
This is not the expected result. (with ta, it is the same)
then I present the problem as follows:
dico = {' error ' : [-12, 9, 7, -6, 5, -3], ' number ' : [88, 109, 107, 94, 105, 97]} #[100, 100, 101, 99, 101, 99]}
tab = pd.DataFrame(dico)
print(tab.head(6))
ta = pd.crosstab(tab[' error '],tab[' staff '])
test
error numbers 0 -12 88 1 9 109 2 7 107 3 -6 94 4 5 105 5 -3 97
(10.94, 0.052, 5)...same...I expect something like (3.44, p-value should be between 0.5 and 0.9, 5)
Something is wrong but What?
Regards,
Leloup

How to combine two sliders on Altair chart?

I have the following DataFrame :
num_tra num_ts Year Value
0 0 0 1 100
1 0 0 2 90
2 0 0 3 80
3 0 1 1 90
4 0 1 2 81
5 0 1 3 72
6 1 0 1 81
7 1 0 2 73
8 1 0 3 65
9 1 1 1 73
10 1 1 2 66
11 1 1 3 58
12 2 0 1 142
13 2 0 2 160
14 2 0 3 144
15 2 1 1 128
16 2 1 2 144
17 2 1 3 130
Based on the Multiple Interactions Altair example, I tried to build a chart with two sliders based (in this example) on values of columns num_tra [0 to 2] and num_ts [0 to 1] but it doesn't work
import altair as alt
from vega_datasets import data
base = alt.Chart(df, width=500, height=300).mark_line(color="Red").encode(
x=alt.X('Year:Q'),
y='Value:Q',
tooltip="Value:Q"
)
# Slider filter
tra_slider = alt.binding_range(min=0, max=2, step=1)
ts_slider = alt.binding_range(min=0, max=1, step=1)
slider1 = alt.selection_single(bind=tra_slider, fields=['num_tra'], name="TRA")
slider2 = alt.selection_single(bind=ts_slider, fields=['num_ts'], name="TS")
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1,slider2
).properties(title="Sensi_TRA")
filter_TRA
=> TypeError: transform_filter() takes 2 positional arguments but 3 were given
No problem with one slider but as mentioned, I wasn't able to combine two or more sliders on the same chart.
If you have any idea, it would be very appreciated.
There are a couple ways to do this. If you want the filters to be applied sequentially, you can use two transform statements:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1
).transform_filter(
slider2
)
Alternatively, you can use a single transforms statement and use the & or | operators to filter on the intersection or union of the slider values, respectively:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1 & slider2
)

How can I merge two files while printing a given value on resulting empty fields using AWK?

I have two files:
01File:
1 2051
2 1244
7 917
X 850
22 444
21 233
Y 47
KI270728_1 6
KI270727_1 4
KI270734_1 3
KI270726_1 2
KI270713_1 2
GL000195_1 2
GL000194_1 2
KI270731_1 1
KI270721_1 1
KI270711_1 1
GL000219_1 1
GL000218_1 1
GL000213_1 1
GL000205_2 1
GL000009_2 1
and 02File:
1 248956422
2 242193529
7 159345973
X 156040895
Y 56887902
22 50818468
21 46709983
KI270728_1 1872759
KI270727_1 448248
KI270726_1 43739
GL000009_2 201709
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
I need to merge these two files based on Field 01 and print "0"s on resulting empty fields.
I was trying to accomplish this using the following command:
awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0 "\t" a[$1]}' 01File 02File
Which results in the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
However, I am having trouble adapting the command so as to be able to print, in this case a value of zero "0" on the resulting empty fields, so as to generate the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0
I would be grateful if you can get me going in the right direction
Use a conditional expression in place of a[1]. Instead of the empty string, "0" will be printed if no line matched.
awk 'FNR==NR{a[$1]=$2;next} {print $0 "\t" ($1 in a? a[$1]: "0")}' 01File 02File
Also I simplified the first action, as there are only 2 fields.
Output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106