How do I prevent str.contains() from searching for a sub-string? - pandas

I want Pandas to search my data frame for the complete string and not a sub-string. Here is a minimal-working example to explain my problem -
data = [['tom', 'wells fargo', 'retired'], ['nick', 'bank of america', 'partner'], ['juli', 'chase', 'director - oil well']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(val, case=False)).any(axis="columns")]
The correct code would have only returned the second row and not the first one
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
Update - My intention is to have a search that looks for the exact string requested. While looking for "well" the algorithm shouldn't extract out "well. Based on the comments, I understand how my question might be misleading.

IIUC, you can use:
>>> df[~df['Position'].str.contains(fr'\b{val}\b')]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well
And for all columns:
>>> df[~df.apply(lambda x: x.str.contains(fr'\b{val}\b', case=False)).any(axis=1)]
Name Place Position
0 tom wells fargo retired
2 juli chase director - oil well

The regular expression anchor \b which is a word boundary is what you want.
I added addtional data to your code to illustrate more:
import pandas as pd
data = [
['tom', 'wells fargo', 'retired']
, ['nick', 'bank of america', 'partner']
, ['john','bank of welly','blah']
, ['jan','bank of somewell knwon','well that\'s it']
, ['juli', 'chase', 'director - oil well']
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Place', 'Position'])
# print dataframe.
df
val = 'well'
df.loc[df.apply(lambda col: col.str.contains(fr"\b{val}\b", case=False)).any(axis="columns")]
EDIT
In Python3 the string can be substitute with the variable with f in front of " or ' and r will express it as regular expression. Then now you can get the val as you want. Thank #smci
and the output is like this
Name
Place
Position
3
jan
bank of somewell knwon
well that's it
4
juli
chase
director - oil well

Related

Trying to print entire dataframe after str.replace on one column

I can't figure out why this is throwing the error:
KeyError(f"None of [{key}] are in the [{axis_name}]")
here is the code:
def get_company_name(df):
company_name = [col for col in df if col.lower().startswith('comp')]
return company_name
df = df[df[get_company_name(master_leads)[0]].str.replace(punc, '', regex=True)]
this is what df.head() looks like:
Company / Account Website
0 Big Moose RV, & Boat Sales, Service, Camper Re... https://bigmooservsales.com/
1 Holifield Pest Management of Hattiesburg NaN
2 Steve Nichols Insurance NaN
3 Sandel Law Firm sandellaw.com
4 Duplicate - Checkered Flag FIAT of Newport News NaN
I have tried putting the [] in every place possible but I must be missing something. I was under impression that this is how you ran transformations on one column of the dataframe without pulling the series out of the dataframe.
Thanks!
You can get the first column name for company with
company_name_col = [col for col in df if col.lower().startswith('comp')][0]
you can see the cleaned up company name with
df[company_name_col].str.replace(punc, "", regex=True)
to apply the replacement
df[company_name_col] = df[company_name_col].str.replace(punc, "", regex=True)

Pandas Lambda expression to search substring not working for numeric pincode

I have 4 compete address in column and CITY PICODE in different column of same datatframe, below expression returns correct result for CITY but not for Pincode which is 6 digit number.
ConAddress is the concatenation of all 5 client address columns
import pandas as pd
import numpy as np
df = pd.read_excel('Rural_Data.xlsx')
df['ConAddress'] = df['CLIENT_ADDRESS_1'].astype(str)+' '+df['CLIENT_ADDRESS_2'].astype(str)+' '+df['CLIENT_ADDRESS_3'].astype(str)+' '+df['CLIENT_ADDRESS_4'].astype(str)+' '+df['CLIENT_ADDRESS_5'].astype(str)
# filling na as if blank cell will be there in the address columns mentioned above it will find the match
df.update(df[['VILLAGENAME','TALUKANAME','DISTRICTNAME','PINCODENEW']].fillna('--'))
df_given_columns =df[['VILLAGENAME','TALUKANAME','DISTRICTNAME','PINCODENEW']]
print(df['PINCODENEW'].dtype)
for gcol in list(df_given_columns.columns.values):
result_column_name= str(gcol)[:3]
df[gcol]=df[gcol].astype(str)
# df[result_column_name] = df.apply(lambda x: x[gcol] in x['ConAddress'], axis=1).astype(int)
df[result_column_name] = (df.apply(lambda x: str(x[gcol]) in x['ConAddress'], axis=1)).astype(int)
df_result_columns = df[['VIL','TAL','DIS','PIN']]
print(df_result_columns['PIN'].head())
df.to_csv('outputs.csv')
Sample Data
https://drive.google.com/file/d/1lusfgHHX_qmqYuaw0xexDF2hovkcU8py/view?usp=sharing
ConAddress DISTRICTNAME PINCODENEW
AP MOHI MANTAL MANDIST SATARA 415508 MAHA SATARA 415508
AP BHAGAT MALA VADIYERAYBAG SATARA SATARA 415305 SATARA 415305
AT POST ,NHAVI,TAL-INDAPUR PUNE MAHARASHTRA PUNE
AT POST ,NHAVI,TAL-INDAPUR PUNE MAHARASHTRA Delhi
Had a look on your data , column has that green symbol comes in excel for format changing.
Similar issue I had in searching mobile number change below lines just before your for loop it hope it will work fine.
df['PINCODENEW'] = df['PINCODENEW'].astype(int, errors='ignore')
df['PINCODENEW'] = df['PINCODENEW'].astype(str).replace('\.0','', regex=True)
Convert value to string by str:
df['Result'] = (df.apply(lambda x: str(x['PINCODENEW']) in x['ConAddress'], axis=1)
.astype(int))

How to put a range of columns into one column when reading a csv with pandas?

I would like to select specifics rows when reading a csv with pandas but I also would like to keep the last 5 to 8 columns as a one column because they all represent "genres" in my case.
I have tried to put the flag usecols=[0,1,2,np.arange(5,8)] when using pd.read_csv bubt it does not work.
If I use the flag usecols=[0,1,2,5], I just get one genre in the last column and the others (6, 7, 8) are lost.
I have tried the following but without succeeding:
items = pd.read_csv(filename_item,
sep='|',
engine='python',
encoding='latin-1',
usecols=[0,1,2,np.arange(5,23)],
names=['movie_id', 'title', 'date','genres'])
My CSV looks like:
2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic
And I would like to get:
2 - Scream of Stone (Schrei aus Stein) - (1991) - 08-Mar-1996 - drama|comedia|fun|romantic
, where what I drew separated by "-" should be a column of the dataframe.
Thank you
You may need to do this in 2-passes. Firstly read the csv in as is:
In[56]:
import pandas as pd
import io
t="""2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic"""
df = pd.read_csv(io.StringIO(t), sep='|', usecols=[0,1,2,3,*np.arange(6,10)], header=None)
df
Out[56]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9
0 fun romantic
Then we can join all the genres together using apply:
In[57]:
df['genres'] = df.iloc[:,4:].apply('|'.join,axis=1)
df
Out[57]:
0 1 2 3 6 7 \
0 2 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 drama comedia
8 9 genres
0 fun romantic drama|comedia|fun|romantic
My solution is based on a piece of code proposed at:
How to pre-process data before pandas.read_csv()
The idea is to write a "file wrapper" class, which can be passed
to read_csv.
class InFile(object):
def __init__(self, infile):
self.infile = open(infile)
def __next__(self):
return self.next()
def __iter__(self):
return self
def read(self, *args, **kwargs):
return self.__next__()
def next(self):
try:
line = self.infile.readline()
return re.sub('\|', ',', line, count=6)
except:
self.infile.close()
raise StopIteration
Reformatting of each source line is performed by:
re.sub('\|', ',', line, count=6)
which changes first 6 | chars into commas, so you can read it
without sep='|'.
To read your CSV file, run:
df = pd.read_csv(InFile('Films.csv'), usecols=[0, 1, 2, 3, 6],
names=['movie_id', 'title', 'prod', 'date', 'genres'])

Multiply two data frames, base on a columns and skipping rows do not fulfill a condition

I have two data frames, the first one has two indexes (country and product) and the value of the variable associated. I have 20 countries and 7 products. Note that I can have two rows with the same country and product in this data frame because each row corresponds to a different observation.
df1
value
Country Product
Guatemala Hydro 259.420233
Oil 4.211656
Oil 341.550360
Coal, peat and oil shale 4.311316
Coal, peat and oil shale NaN
Hydro 24.433527
Colombia Oil 10
Coal, peat and oil shale 4.311316
.
.
.
The second data frame is EXACTLY like I show below
df2
mult
Country Product
Argentina Natural gas 1
Colombia Oil 161
Mexico Coal, peat and oil shale 9
Natural gas 2
I am trying to multiply the two data frames. The rows of the final data frame must be equal to the first data frame. When there is not any available value in df2 to multiply the row in df1 (e.g. Guatemala/Oil), the value in df1 must be unchanged.
I really appreciate your help. I have tried many options and any works.
First, it’s not a great idea to index on columns that will generate duplicates. If you are really that thirsty to violate that best practice, you can still follow my instructions below and then change it back to the original index.
import pandas as pd
import numpy as np
df1 = df1.reset_index(drop = False)
df2 = df2.reset_index(drop = False)
df3 = df1.merge(df2, on = [‘product’, ‘country’], how = ‘left’)
df3[ ‘result’] = np.where(df3.mult.isnull(), df3.value, df3.value * df3.mult)
#now, disrespect all that is holy
df3 = df3.set_index([‘product’, ‘country])

Create a pandas DataFrame from multiple dicts [duplicate]

This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 4 years ago.
I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file:
01Jhon Smith 555-1234
03Cow Bos primigenius taurus 00401
01Jannette Jhonson 00100000000
...
field start length
type 1 2 *common to all records, example: 01 = person, 03 = animal
name 3 10
surname 13 10
phone 23 8
credit 31 11
fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'}
person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?
To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'}
>>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00}
>>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
>>> pd.DataFrame([person1])
name phone surname type
0 Jhon 555-1234 Smith 1
>>> pd.DataFrame([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
>>> pd.DataFrame.from_dict([person1, person2])
credit name phone surname type
0 NaN Jhon 555-1234 Smith 1
1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf (fixed-width-file). For example:
from StringIO import StringIO
def get_filelike_object(filename, line_prefix):
s = StringIO()
with open(filename, "r") as fp:
for line in fp:
if line.startswith(line_prefix):
s.write(line)
s.seek(0)
return s
and then
>>> type01 = get_filelike_object("animal.dat", "01")
>>> df = pd.read_fwf(type01, names="type name surname phone credit".split(),
widths=[2, 10, 10, 8, 11], header=None)
>>> df
type name surname phone credit
0 1 Jhon Smith 555-1234 NaN
1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas ever sees them, which might be easiest of all.