I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
Related
I've got the next DataFrame:
id
sec
1
45
2
1
3
176
1
19
1
876
3
123
I want to split it to groups by id by sessions, or create multiple dataframes of this sessions. Like I want to have sessions of each id (session is When more than 30 seconds have passed between user actions)
For example:
sessions for id 1: [45, 19], [876]
I tried gruopby and cat, but I have no idea how to implement this
To identify the session you can use:
df['session'] = (df.sort_values(by=['id', 'sec'])
.groupby('id')['sec']
.apply(lambda s: s.diff().gt(30).cumsum().add(1))
)
Output:
id sec session
0 1 45 1
1 2 1 1
2 3 176 2
3 1 19 1
4 1 876 2
5 3 123 1
I have the following DataFrame :
num_tra num_ts Year Value
0 0 0 1 100
1 0 0 2 90
2 0 0 3 80
3 0 1 1 90
4 0 1 2 81
5 0 1 3 72
6 1 0 1 81
7 1 0 2 73
8 1 0 3 65
9 1 1 1 73
10 1 1 2 66
11 1 1 3 58
12 2 0 1 142
13 2 0 2 160
14 2 0 3 144
15 2 1 1 128
16 2 1 2 144
17 2 1 3 130
Based on the Multiple Interactions Altair example, I tried to build a chart with two sliders based (in this example) on values of columns num_tra [0 to 2] and num_ts [0 to 1] but it doesn't work
import altair as alt
from vega_datasets import data
base = alt.Chart(df, width=500, height=300).mark_line(color="Red").encode(
x=alt.X('Year:Q'),
y='Value:Q',
tooltip="Value:Q"
)
# Slider filter
tra_slider = alt.binding_range(min=0, max=2, step=1)
ts_slider = alt.binding_range(min=0, max=1, step=1)
slider1 = alt.selection_single(bind=tra_slider, fields=['num_tra'], name="TRA")
slider2 = alt.selection_single(bind=ts_slider, fields=['num_ts'], name="TS")
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1,slider2
).properties(title="Sensi_TRA")
filter_TRA
=> TypeError: transform_filter() takes 2 positional arguments but 3 were given
No problem with one slider but as mentioned, I wasn't able to combine two or more sliders on the same chart.
If you have any idea, it would be very appreciated.
There are a couple ways to do this. If you want the filters to be applied sequentially, you can use two transform statements:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1
).transform_filter(
slider2
)
Alternatively, you can use a single transforms statement and use the & or | operators to filter on the intersection or union of the slider values, respectively:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1 & slider2
)
I am Learning Python using pandas, I do not know, how to pivot a data frame with columns with a multilevel index. I have the following pivot table :
df= df.pivot_table(index=["FECHA",'Planta'],
aggfunc = {'Menor_F0' :np.sum, 'Menor_fc' :np.sum,
"Total_Muestras " : "count"
})
it gives: PD: it is correct
Menor_F0 Menor_fc Total_Muestras
FECHA Planta
01/2014 455 0 0 2
470 1 2 5
01/2016 455 0 0 1
470 0 1 2
But I want to visualize it, in this form, how can I do it?
FECHA 01/2014 01/2016
Menor_F0 Menor_fc Total_Muestras Menor_F0 Menor_fc Total_Muestras
PLANTA
455 0 0 2 0 0 1
470 1 2 5 0 1 2
You can try stack and unstack:
df.stack().unstack(level=(0,2))
For each row, I need to calculate the integer part from dividing by 4. For each subsequent row, we add the remainder of the division by 4 previous and current lines and look at the whole part and the remainders from dividing by 4. Consider the example below:
id val
1 22
2 1
3 1
4 2
5 1
6 6
7 1
After dividing by 4, we look at the whole part and the remainders. For each id we add up the accumulated points until they are divided by 4:
id val wh1 rem1 wh2 rem2 RESULT(wh1+wh2)
1 22 5 2 0 2 5
2 1 0 1 (3/4=0) 3%4=3 0
3 1 0 1 (4/4=1) 4%4=0 1
4 2 0 2 (2/4=0) 2%4=2 0
5 1 0 1 (3/4=0) 3%4=3 0
6 7 1 2 (5/4=1) 5%4=1 2
7 1 0 1 (2/4=0) 2%4=1 0
How can I get the next RESULT column with sql?
Data of project:
http://sqlfiddle.com/#!18/9e18f/2
The whole part from the division into 4 is easy, the problem is to calculate the accumulated remains for each id, and to calculate which of them will also be divided into 4
i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.