pandas fillna with fuzzy search on col names - pandas

i have a dataframe with many col names having _paid as part of the name (eg. A_paid, B_paid. etc). I need to fill miss values in any col that has _paid as part of the name. (note: i am not allowed to replace missing value in other cols with no _paid as part of the name).
I tried to use .fillna(), but not sure how to make it do fuzzy search on col names.

If you want to select any column that has _paid in it:
paid_cols = df.filter(like="_paid").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid", regex=False)]
andthen
df[paid_cols] = df[paid_cols].fillna(...)
If you need _paid to be at the end only, then with $ anchor in a regex:
paid_cols = df.filter(regex="_paid$").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid$")]
then the same fillna above.

Related

pandas constains with regular expression- special character and full word

I am trying to remove rows that contains any of this characters(##%+*=) and also a full word
col
ahoi*
word
be
df = df[~df[col].str.contains(r'[##%+*=](word))', regex=True)]
I achieved to remove special characters only with .str.contains(r'[##%+*=])', however I cannot remove the row with the full word.
What am I missing?
This is the expected result.
col
be
IIUC, you need to use the or operator (|) instead of parenthesis :
df = df[~df["col"].str.contains(r'[##%+*=]|word', regex=True)]
​
Output :
print(df)
col
2 be
You can try:
>>> df[~df['col'].str.contains(r'(?:[##%+*=]|word)', regex=True)]
col
2 be

How to use OR in a regex?

I am trying to select only columns from a dataframe that start with a p or that contain an s. I am using the following:
df2 = (df.filter(regex ='(^p)' or '(s)'))
df2
But that only selects columns that start with a p. It ignores the second part and doesn't select columns that have an s in the column name. Anyone knows how can I filter so that both conditions are true and my algorithm select both, columns starting with P and also columns that contain an s?
Use the pipe character | which is equivalent to OR in regex.
df2 = (df.filter(regex ='^p|s'))

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Multiple column selection on a Julia DataFrame

Imagine I have the following DataFrame :
10 rows x 26 columns named A to Z
What I would like to do is to make a multiple subset of the columns by their name (not the index). For instance, assume that I want columns A to D and P to Z in a new DataFrame named df2.
I tried something like this but it doesn't seem to work :
df2=df[:,[:A,:D ; :P,:Z]]
syntax: unexpected semicolon in array expression
top-level scope at Slicing.jl:1
Any idea of the way to do it ?
Thanks for any help
df2 = select(df, Between(:A,:D), Between(:P,:Z))
or
df2 = df[:, All(Between(:A,:D), Between(:P,:Z))]
if you are sure your columns are only from :A to :Z you can also write:
df2 = select(df, Not(Between(:E, :O)))
or
df2 = df[:, Not(Between(:E, :O))]
Finally, you can easily find an index of the column using columnindex function, e.g.:
columnindex(df, :A)
and later use column numbers - if this is something what you would prefer.
In Julia you can also build Ranges with Chars and hence when your columns are named just by single letters yet another option is:
df[:, Symbol.(vcat('A':'D', 'P':'Z'))]

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.