Slicing pandas dataframe by closest value

Slicing pandas dataframe by closest value - pandas

I have a pandas data frame that looks like this:
age score
5 72 99.424
6 70 99.441
7 69 99.442
8 67 99.443
9 71 99.448
mean score: 99.4396
The mean is the mean over all score column. How can I slice/get an age value that is say +/- 0.001 closer to the mean score.
So in this case: 67 and 69

mean = df['score'].mean()
df[df['score'].between(mean - .001, mean + .001)]['age']

import pandas as pd
import statistics
df = pd.DataFrame({"age": [72, 70, 69, 67, 71], "score": (99.424, 99.441, 99.442, 99.443, 99.448)})
df["diff"] = abs(df["score"] - statistics.mean(list(df["score"])))
You get :
age score diff
0 72 99.424 0.0156
1 70 99.441 0.0014
2 69 99.442 0.0024
3 67 99.443 0.0034
4 71 99.448 0.0084
Then :
x = 0.002
ages = list(df.loc[df["diff"] < x]["age"])
[Out]: [70]
x will be your parameter for the difference with the mean.
EDIT: we cannot get the same result as you as we do not have your whole score column by the way

Related

Compare two Excel files that have a different number of rows using Python Pandas

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't find a solution for my case!
Here is an example :
df1 (old report) :
id qte d1 d2
A 10 23 35
B 43 63 63
C 15 61 62
df2 (new report) :
id qte d1 d2
A 20 23 35
C 15 61 62
E 38 62 16
F 63 20 51
and the results should be :
the modify rows must be in yellow and the value modified in red color
the new rows in green
the deleted rows in red
id qte d1 d2
A 20 23 35
C 15 61 62
B 43 63 63
E 38 62 16
F 63 20 51
the code :
import pandas as pd
import numpy as np
df1= pd.read_excel(r'C .....\data novembre.xlsx','Sheet1',na_values=['NA'])
df2= pd.read_excel(r'C.....\data decembre.xlsx','Sheet1',na_values=['NA'])
merged_data=df1.merge(df2, left_on = 'id', right_on = 'id', how = 'outer')
Joining the data though is not want I want to have!
I'm just starting to learn Python so I really need help!

an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.
assuming your dataframes are called df1, df2
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()
df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new' # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells.
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.
print(df3a)
d1 d2 qte status
id
A [23] [35] [10, 20] modified
B [63] [63] [43] deleted
C [61] [62] [15] same
E [62] [16] [38] new
F [20] [51] [63] new
if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.
df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)
d1 d2 qte status
id
A 23 35 10-->20 modified
B 63 63 43 deleted
C 61 62 15 same
E 62 16 38 new
F 20 51 63 new

Selecting the higher of two data

I'm working with Python Pandas trying to sort some student testing data. On occasion, students will test twice during the same testing window, and I want to save only the highest of the two tests. Here's an example of my dataset.
Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60
Any ideas of how can I save only the higher score for each student?

If your data is in the dataframe df, you can sort by the score in descencing order and drop duplicate names, keeping the first:
df.sort_values(by='Score', ascending=False).drop_duplicates(subset='Name', keep='first')

You can do this with groupby. It works like this:
df.groupby('Name').agg({'Score': 'max'})
It results in:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
Btw. in that special setup, you could also use drop_duplicates to make the name unique after sorting on the score. This would yield the same result, but would not be extensible (e.g. if you later would like to add the average score etc). It would look like this:
df.sort_values(['Name', 'Score']).drop_duplicates(['Name'], keep='last')
From the test data you posted:
import pandas as pd
from io import StringIO
sio= StringIO("""Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60 """)
df= pd.read_csv(sio, sep='\s+')

There are multiple ways to do that, two of them are:
In [8]: df = pd.DataFrame({"Score" : [32, 75, 89, 40, 70, 60],
...: "Name" : ["Alice", "Alice", "John", "Mark", "Mark", "Amy"]})
...: df
Out[8]:
Score Name
0 32 Alice
1 75 Alice
2 89 John
3 40 Mark
4 70 Mark
5 60 Amy
In [13]: %time df.groupby("Name").max()
CPU times: user 2.26 ms, sys: 286 µs, total: 2.54 ms
Wall time: 2.11 ms
Out[13]:
Score
Name
Alice 75
Amy 60
John 89
Mark 70
In [14]: %time df.sort_values("Name").drop_duplicates(subset="Name", keep="last")
CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
Wall time: 1.89 ms
Out[14]:
Score Name
1 75 Alice
5 60 Amy
2 89 John
4 70 Mark

This question has already been answered here on StackOverflow.
You can merge two pandas data frames and after that calculate the maximum number in each row. df1 and df2 are the pandas of students score:
import pandas as pd
df1 = pd.DataFrame({'Alice': 3,
'John': 8,
'Mark': 7.5,
'Amy': 0},
index=[0])
df2 = pd.DataFrame({'Alice': 7,
'Mark': 7},
index=[0])
result = pd.concat([df1, df2], sort=True)
result = result.T
result["maxvalue"] = result.max(axis=1)

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?

I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN

Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

assigning title to intervals in pandas

import numpy as np
xlist = np.arange(1, 100).tolist()
df = pd.DataFrame(xlist,columns=['Numbers'],dtype=int)
pd.cut(df['Numbers'],5)
how to assign column name to each distinct intervals created ?

IIUC, you can use pd.concat function and join them in a new data frame based on indexes:
# get indexes
l = df.index.tolist()
n =20
indexes = [l[i:i + n] for i in range(0, len(l), n)]
# create new data frame
new_df = pd.concat([df.iloc[x].reset_index(drop=True) for x in indexes], axis=1)
new_df.columns = ['Numbers'+str(x) for x in range(new_df.shape[1])]
print(new_df)
Numbers0 Numbers1 Numbers2 Numbers3 Numbers4
0 1 21 41 61 81.0
1 2 22 42 62 82.0
2 3 23 43 63 83.0
3 4 24 44 64 84.0
4 5 25 45 65 85.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Slicing pandas dataframe by closest value - pandas

I have a pandas data frame that looks like this: age score 5 72 99.424 6 70 99.441 7 69 99.442 8 67 99.443 9 71 99.448 mean score: 99.4396 The mean is the mean over all score column. How can I slice/get an age value that is say +/- 0.001 closer to the mean score. So in this case: 67 and 69

mean = df['score'].mean() df[df['score'].between(mean - .001, mean + .001)]['age']

Related

Compare two Excel files that have a different number of rows using Python Pandas

Selecting the higher of two data

List of Pandas Dataframes: Merging Function Outputs

Summing columns and rows

assigning title to intervals in pandas

Categories

Resources