I'm working with a dataset, from which a subset has initial values and final values. I created an id that lets me identify those observations, so after applying this:
df['aux']=df.duplicated(subset=['id'], keep=False)
df_dup=df_dup[df_dup.aux==True]
df_dup.sort_values(by='id').reset_index(inplace=True)
I get something like this:
index id status value
88 1 'initial' 8
95 1 'final' 12
63 2 'initial' 9
52 2 'final' 13
What I want to achieve is to replace the final value in the initial value:
index id status value
88 1 'initial' 12
95 1 'final' 12
63 2 'initial' 13
52 2 'final' 13
I tried several things, my last attempt was this:
df_dup[df_dup.status=='initial'].reset_index().value= \
df_dup[df_dup.status=='final'].reset_index().value
But that fills initial values with nan:
index id status value
88 1 'initial' nan
95 1 'final' 12
63 2 'initial' nan
52 2 'final' 13
What am I missing?
Thanks
Use GroupBy.transform with last - it also replace unique values of id, but it return same value:
df['value'] = df.groupby('id')['value'].transform('last')
print (df)
index id status value
0 88 1 'initial' 12
1 95 1 'final' 12
2 63 2 'initial' 13
3 52 2 'final' 13
If want replace only duplicated id rows (reason many unique values, so better performance):
mask = df.duplicated(subset=['id'], keep=False)
df.loc[mask, 'value'] = df[mask].groupby('id')['value'].transform('last')
Without groupby and base on your drop_duplicates
df.value=df.id.map(df.drop_duplicates('id',keep='last').set_index('id').value)
df
Out[436]:
index id status value
0 88 1 'initial' 12
1 95 1 'final' 12
2 63 2 'initial' 13
3 52 2 'final' 13
Related
I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2
I have the following DataFrame :
num_tra num_ts Year Value
0 0 0 1 100
1 0 0 2 90
2 0 0 3 80
3 0 1 1 90
4 0 1 2 81
5 0 1 3 72
6 1 0 1 81
7 1 0 2 73
8 1 0 3 65
9 1 1 1 73
10 1 1 2 66
11 1 1 3 58
12 2 0 1 142
13 2 0 2 160
14 2 0 3 144
15 2 1 1 128
16 2 1 2 144
17 2 1 3 130
Based on the Multiple Interactions Altair example, I tried to build a chart with two sliders based (in this example) on values of columns num_tra [0 to 2] and num_ts [0 to 1] but it doesn't work
import altair as alt
from vega_datasets import data
base = alt.Chart(df, width=500, height=300).mark_line(color="Red").encode(
x=alt.X('Year:Q'),
y='Value:Q',
tooltip="Value:Q"
)
# Slider filter
tra_slider = alt.binding_range(min=0, max=2, step=1)
ts_slider = alt.binding_range(min=0, max=1, step=1)
slider1 = alt.selection_single(bind=tra_slider, fields=['num_tra'], name="TRA")
slider2 = alt.selection_single(bind=ts_slider, fields=['num_ts'], name="TS")
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1,slider2
).properties(title="Sensi_TRA")
filter_TRA
=> TypeError: transform_filter() takes 2 positional arguments but 3 were given
No problem with one slider but as mentioned, I wasn't able to combine two or more sliders on the same chart.
If you have any idea, it would be very appreciated.
There are a couple ways to do this. If you want the filters to be applied sequentially, you can use two transform statements:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1
).transform_filter(
slider2
)
Alternatively, you can use a single transforms statement and use the & or | operators to filter on the intersection or union of the slider values, respectively:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1 & slider2
)
I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01
Following are the dataframes I have
score_df
col1_id col2_id score
1 2 10
5 6 20
records_df
date col_id
D1 6
D2 4
D3 1
D4 2
D5 5
D6 7
I would like to compute a score based on the following criteria:
When 2 occurs after 1 the score should be assigned 10 or when 1 occurs after 2, score should be assigned 10.
i.e when (1,2) gives a score 10 .. (2,1) also get the same score 10.
considering (1,2) . When 1 occurs first time we dont assign a score. We flag the row and wait for 2 to occur. When 2 occurs in the column we give the score 10.
considering (2,1). When 2 comes first. We assign value 0 and wait for 1 to occur. When 1 occurs, we give the score 10.
So, for the first time - dont assign the score and wait for the corresponding event to occur and then assign the score
So, my result dataframe should look something like this
result
date col_id score
D1 6 0 -- Eventhough 6 is there in score list, it occured for first time. So 0
D2 4 0 -- 4 is not even there in list
D3 1 0 -- 1 occurred for first time . So 0
D4 2 10 -- 1 occurred previously. 2 occurred now.. we can assign 10.
D5 5 20 -- 6 occurred previously. we can assign 20
D6 7 0 -- 7 is not in the list
I have around 100k rows in both score_df and record_df. Looping and assigning score is taking the time. Can someone help with logic without looping the entire dataframe?
From what i understand , you can try melt for unpivotting and then merge. keeping the index from the melted df , we check where the index is duplicated , and then return score from the merge else 0.
m = score_df.reset_index().melt(['index','uid','score'],
var_name='col_name',value_name='col_id')
final = records_df.merge(m.drop('col_name',1),on=['uid','col_id'],how='left')
c = final.duplicated(['index']) & final['index'].notna()
final = final.drop('index',1).assign(score=lambda x: x['score'].where(c,0))
print(final)
uid date col_id score
0 123 D1 6 0.0
1 123 D2 4 0.0
2 123 D3 1 0.0
3 123 D4 2 10.0
4 123 D5 5 20.0
5 123 D6 7 0.0
I will explain my problem with an sample example
create table foo(id int,idx int,idy int,fld int,fldx varchar);
insert into foo values (1,2,3,55,'AA'),(2,3,4,77,'AB'),(3,4,8,55,'AX'),(9,10,15,77,'AR'),
(3,4,8,11,'AX'),(3,4,8,65,'AX'),(3,4,8,77,'AX');
id,idx,idy, fld,fldx
1 2 3 55 AA
2 3 4 77 AB
3 4 8 55 AX
9 10 15 77 AR
3 4 8 11 AX
3 4 8 65 AX
3 4 8 77 AX
I need to select only column fld and its total count of each column(fld) in descending order
Expected Result :
fld count
---------
77 3
55 2
11 1
65 1
select fld
,count(fld) rw_count
from foo
group by fld
order by rw_count desc
Group By
select fld,count(*) from foo group by 1 order by 2 desc ;