How to combine two sliders on Altair chart? - pandas

I have the following DataFrame :
num_tra num_ts Year Value
0 0 0 1 100
1 0 0 2 90
2 0 0 3 80
3 0 1 1 90
4 0 1 2 81
5 0 1 3 72
6 1 0 1 81
7 1 0 2 73
8 1 0 3 65
9 1 1 1 73
10 1 1 2 66
11 1 1 3 58
12 2 0 1 142
13 2 0 2 160
14 2 0 3 144
15 2 1 1 128
16 2 1 2 144
17 2 1 3 130
Based on the Multiple Interactions Altair example, I tried to build a chart with two sliders based (in this example) on values of columns num_tra [0 to 2] and num_ts [0 to 1] but it doesn't work
import altair as alt
from vega_datasets import data
base = alt.Chart(df, width=500, height=300).mark_line(color="Red").encode(
x=alt.X('Year:Q'),
y='Value:Q',
tooltip="Value:Q"
)
# Slider filter
tra_slider = alt.binding_range(min=0, max=2, step=1)
ts_slider = alt.binding_range(min=0, max=1, step=1)
slider1 = alt.selection_single(bind=tra_slider, fields=['num_tra'], name="TRA")
slider2 = alt.selection_single(bind=ts_slider, fields=['num_ts'], name="TS")
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1,slider2
).properties(title="Sensi_TRA")
filter_TRA
=> TypeError: transform_filter() takes 2 positional arguments but 3 were given
No problem with one slider but as mentioned, I wasn't able to combine two or more sliders on the same chart.
If you have any idea, it would be very appreciated.

There are a couple ways to do this. If you want the filters to be applied sequentially, you can use two transform statements:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1
).transform_filter(
slider2
)
Alternatively, you can use a single transforms statement and use the & or | operators to filter on the intersection or union of the slider values, respectively:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1 & slider2
)

Related

Reindex kmeans clustered dataframe in an ascending order values

I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

How to split a column in a data frame containing only numbers into multiple columns in pandas

I have a .dat file containing the following data:
0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011
Need to count number of zeros and ones in each row
I have tried with Pandas.
Step-1: Read the data file
Step-2: Given a column name
Step-3: Tried to split the values into multiple columns. But could
not succeed
df1=pd.read_csv('data.dat',header=None) df1.head()
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
df1.columns=['kirti']
df1.head()
Kirti
_______________________
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
I need to split the data frame into multiple columns depending upon the 0s and 1s in each row.
the maximum number of columns will be equal to max no of zeros and ones in any of the rows in the data frame.
First create one column DataFrame by parameters names and dtype=str for convert column to strings:
import pandas as pd
temp="""0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename'
df = pd.read_csv(StringIO(temp), header=None, names=['kirti'], dtype=str)
print (df)
kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
And then create new DataFrame by convert values to lists:
df = pd.DataFrame([list(x) for x in df['kirti']])
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0
1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 None None None None
2 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 None
3 0 1 1 1 1 1 1 0 1 0 1 0 0 None None None None None None
4 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 None None None
If your data is in a list of strings, then use the count method:
>> data = ["0001100000101010100", "110101000001111", "101100011001110111", "0111111010100", "1010111111100011"]
>> for i in data:
print(i.count("0"))
13
7
7
5
5
If your data is in a .dat file with whitespace sepparation as you discribed, then I would recommend loading your data as follows:
data = pd.read_csv("data.dat", lineterminator=" ",dtype="str", header=None, names=["Kirti"])
Kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
The lineterminator argument ensures that every entry is in a new row. The dtype argument ensures that it's read as string. Otherwise you will loose leading zeros.
If your data is in a DataFrame, you can use the count method (inspired from here):
>> data["Kirti"].str.count("0")
0 13
1 7
2 7
3 5
4 5
Name: Kirti, dtype: int64

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:
DF1
ID c1 c2 c3 c4 c5
1 0 1 0 1 1
10 1 0 1 0 0
5 0 1 1 1 0
20 1 1 0 0 1
3 1 1 0 0 1
6 0 0 1 1 1
71 1 0 1 0 0
15 0 1 1 1 0
80 0 0 0 1 0
DF2
ID c1 c2 c3 c4 c5
5 1 0 1 1 0
6 0 1 0 0 1
15 1 0 0 1 1
80 1 1 1 0 0
78 1 1 1 0 0
98 0 0 1 1 1
1 0 1 0 0 1
2 1 0 0 1 1
9 0 0 0 1 0
My function must return something like this: (the following is a subset)
DF_Return
ID Count
1 4
2 NA
80 1
20 NA
.
.
.
Could you give me any suggestions to carry this out? I'm not that expert in sql.
I put the codes in R to generate the experiment I used above.
id1=c(1,10,5,20,3,6,71,15,80)
c1=c(0,1,0,1,1,0,1,0,0)
c2=c(1,0,1,1,1,0,0,1,0)
c3=c(0,1,1,0,0,1,1,1,0)
c4=c(1,0,1,0,0,1,0,1,1)
c5=c(1,0,0,1,1,1,0,0,0)
DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)
Many thanks in advance.
Best Regards!
Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:
#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] == DF3[, 7:ncol(DF3)]))
#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))
#Are they the same?
all.equal(out1, out2)
#[1] TRUE
> head(out1)
ID count
1 1 4
2 2 NA
3 3 NA
4 5 3
5 6 2
6 9 NA
SELECT
COALESCE(DF1.ID, DF2.ID) AS ID,
CASE WHEN DF1.c1 = DF2.c1 THEN 1 ELSE 0 END +
CASE WHEN DF1.c2 = DF2.c2 THEN 1 ELSE 0 END +
CASE WHEN DF1.c3 = DF2.c3 THEN 1 ELSE 0 END +
CASE WHEN DF1.c4 = DF2.c4 THEN 1 ELSE 0 END +
CASE WHEN DF1.c5 = DF2.c5 THEN 1 ELSE 0 END AS count_of_matches
FROM
DF1
FULL OUTER JOIN
DF2
ON DF1.ID = DF2.ID
There's probably a more elegant way, but this works:
x <- merge(DF1,DF2,by="ID",all=TRUE)
pre <- paste("c",1:5,sep="")
x$Count <- rowSums(x[,paste(pre,"x",sep=".")]==x[,paste(pre,"y",sep=".")])
DF_Return <- x[,c("ID","Count")]
We could use safe_full_join from my package safejoin, and apply ==
between conflicting columns. This will yield a new data frame with logical
c* columns that we can use rowSums on.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(DF1, DF2, by = "ID", conflict = `==`) %>%
transmute(ID, count = rowSums(.[-1]))
# ID count
# 1 1 4
# 2 10 NA
# 3 5 3
# 4 20 NA
# 5 3 NA
# 6 6 2
# 7 71 NA
# 8 15 1
# 9 80 1
# 10 78 NA
# 11 98 NA
# 12 2 NA
# 13 9 NA
You can use the apply function to handle this. To get the sum of each row, you can use:
sums <- apply(df1[2:ncol(df1)], 1, sum)
cbind(df1[1], sums)
which will return the sum of all but the first column, then bind that to the first column to get the ID back.
You could do that on both data frames. I'm not really clear what the desired behavior is after that, but maybe look at the merge function.