filter df based on index condition - pandas

I have a df with lots of rows:
13790226       0.320  0.001976    
9895d5dis 182.600  0.040450     
105066007     18.890  0.006432     
109067019     52.500  0.034011     
111845014     16.400  0.023974     
11668574e      7.180  0.070714     
113307021      4.110  0.017514     
113679I37      8.180  0.010837     
I would like to filter this df in order to obtain the rows where the index last char is not a digit
Desired df:
9895d5dis 182.600 0.040450
11668574e 7.180 0.070714
How can I do it?

df['is_digit'] = [i[-1].isdigit() for i in df.index.values]
df[df['is_digit'] == False]
But I like regex better:
df[df.index.str.contains('[A-z]$')]

Is the column on which you are filtering index or a column? If its a column
df1 = df[df[0].str.contains('[A-Za-z]')]
Returns
0 1 2
1 9895d5dis 182.60 0.040450
5 11668574e 7.18 0.070714
7 113679I37 8.18 0.010837 #looks like read_clipboard is reading 1 in 113679137 as I
If its index, first do
df = df.reset_index()

Here's a concise way without creating a new temp column:
df
b c
a
9895d5dis 182.60 0.040450
105066007 18.89 0.006432
109067019 52.50 0.034011
111845014 16.40 0.023974
11668574e 7.18 0.070714
113307021 4.11 0.017514
113679I37 8.18 0.010837
df[~df.index.str[-1].str.isnumeric()]
b c
a
9895d5dis 182.60 0.040450
11668574e 7.18 0.070714

Throwing this into the mix:
df.loc[[x for x in df.index if x[-1].isalpha()]]

Related

GroupBy-Apply even for empty DataFrame

I am using groupby-apply to create new DataFrame from given Data Frame. But if given DataFrame is empty result would look like given DataFrame with group keys not like target new DataFrame. So to get look of target new DataFrame I have to use if..else with length check and if given DataFrame is empty then manually create DataFrame with specified columns and indexes.
It is kinda broken flow of code. Also if in future structure of target DataFrame happen to change I would have to fix code in two places instead of one.
Is there a way to get look of a target DataFrame even if given DataFrame is empty with GroupBy only (or without if..else)?
Simplified example:
def some_func(df: pd.DataFrame):
return df.values.sum() + pd.DataFrame([[1,1,1], [2,2,2], [3,3,3]], columns=['new_col1', 'new_col2', 'new_col3'])
df1 = pd.DataFrame([[1,1], [1,2], [2,1], [2,2]], columns=['col1', 'col2'])
df2 = pd.DataFrame(columns=['col1', 'col2'])
df1_grouped = df1.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
df2_grouped = df2.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Result for df1 is ok:
new_col1 new_col2 new_col3
0 6 6 6
1 7 7 7
2 8 8 8
0 8 8 8
1 9 9 9
2 10 10 10
And not ok for df2:
Empty DataFrame
Columns: [col1, col2]
Index: []
If..else to get expected result for df2:
df = df2
if df.empty:
df_grouped = pd.DataFrame(columns=['new_col1', 'new_col2', 'new_col3'])
else:
df_grouped = df.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Gives what I need:
Empty DataFrame
Columns: [new_col1, new_col2, new_col3]
Index: []

Is there a way to use .loc on column names instead of the values inside the columns?

I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?
IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365
One solution:
df3 = df2[[c for c in df2.columns if c in df1]]

selecting rows with min and max values in pandas dataframe

my df:
df=pd.DataFrame({'A':['Adam','Adam','Adam','Adam'],'B':[24,90,67,12]})
I want to select only rows with same name with min and max value in this df.
i can do that using this code:
df_max=df[df['B']==(df.groupby(['A'])['B'].transform(max))]
df_min=df[df['B']==(df.groupby(['A'])['B'].transform(min))]
df=pd.concat([df_max,df_min])
Is there any way to do this in one line? i prefer to not create two additional df's and concat them at the end .
Thanks
Use GroupBy.agg with DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin with reshape by DataFrame.melt and select rows by DataFrame.loc:
df1 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).melt()['value']].drop_duplicates()
Or DataFrame.stack:
df2 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).stack()].drop_duplicates()
print (df2)
A B
1 Adam 90
3 Adam 12
A solution using groupby, apply and loc to select only the min or max value of column 'B'.
ddf = df.groupby('A').apply(lambda x : x.loc[(x['B'] == x['B'].min()) | (x['B'] == x['B'].max())]).reset_index(drop=True)
The result is:
A B
0 Adam 90
1 Adam 12

Assigning index column to empty pandas dataframe

I am creating an empty dataframe that i then want to add data to one row at a time. I want to index on the first column, 'customer_ID'
I have this:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'],index=['customer_ID'])
In[2]: df
Out[3]:
customer_ID a b c
customer_ID NaN NaN NaN NaN
So there is already a row of NaN that I don't want.
Can I point the index to the first column without adding a row of data?
The answer, I think, as hinted at by #JD Long is to set the index in a seprate instruction:
In[1]: df = pd.DataFrame(columns = ['customer_ID','a','b','c'])
In[2]: df.set_index('customer_ID',inplace = True)
In[3]: df
Out[3]:
Empty DataFrame
Columns: [customer_ID, a, b, c]
Index: []
I can then add rows:
In[4]: id='x123'
In[5]: df.loc[id]=[id,4,5,6]
In[6]: df
Out[7]:
customer_ID a b c
x123 x123 4.0 5.0 6.0
yes... and you can dropna at any time if you are so inclined:
df = df.set_index('customer_ID').dropna()
df
Because you didn't have any row in your dataframe when you just create it.
df= pd.DataFrame({'customer_ID': ['2'],'a': ['1'],'b': ['A'],'c': ['1']})
df.set_index('customer_ID',drop=False)
df

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028