Trying to print entire dataframe after str.replace on one column - pandas

I can't figure out why this is throwing the error:
KeyError(f"None of [{key}] are in the [{axis_name}]")
here is the code:
def get_company_name(df):
company_name = [col for col in df if col.lower().startswith('comp')]
return company_name
df = df[df[get_company_name(master_leads)[0]].str.replace(punc, '', regex=True)]
this is what df.head() looks like:
Company / Account Website
0 Big Moose RV, & Boat Sales, Service, Camper Re... https://bigmooservsales.com/
1 Holifield Pest Management of Hattiesburg NaN
2 Steve Nichols Insurance NaN
3 Sandel Law Firm sandellaw.com
4 Duplicate - Checkered Flag FIAT of Newport News NaN
I have tried putting the [] in every place possible but I must be missing something. I was under impression that this is how you ran transformations on one column of the dataframe without pulling the series out of the dataframe.
Thanks!

You can get the first column name for company with
company_name_col = [col for col in df if col.lower().startswith('comp')][0]
you can see the cleaned up company name with
df[company_name_col].str.replace(punc, "", regex=True)
to apply the replacement
df[company_name_col] = df[company_name_col].str.replace(punc, "", regex=True)

Related

Replace string taken from another column in pandas

I am trying to do a replace across one column of a pandas dataframe like the below.
From:
a b
house ho
cheese ee
king ng
To:
a b
use ho
chse ee
ki ng
My attempt is to use:
df['a'] = df['a'].str.replace(df['b'], "")
but I get TypeError: 'Series' objects are mutable, thus they cannot be hashed
I have done it by iterating row by row across the dataframe but its 200,000 rows so would take hours. Does anyone know how I can make this work?
Because performance is important here is possible use list comprehension with replace for replace per rows:
df['a'] = [a.replace(b, "") for a, b in df[['a','b']].values]
Another solution is slowier with DataFrame.apply:
df['a'] = df.apply(lambda x: x.a.replace(x.b, ""), axis=1)
print (df)
a b
0 use ho
1 chse ee
2 ki ng

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Sum column's values from duplicate rows python3

I have a old.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
and I need a new.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,30000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
As you can see the difference from these two is:
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
merge to
Berry,KS,Toyota,Camry,30000
Here is my code:
import pandas as pd
df=pd.read_csv('old.csv')
df1=df.sort_values('Name').groupby('Name','State','Brand','Model')
.agg({'Name':'first','Price':'sum'})
print(df1[['Name','State','Brand','Model','Price']])
and It didn't work,and I got these error:
File "------\venv\lib\site-packages\pandas\core\frame.py", line 4421, in sort_values stacklevel=stacklevel)
File "------- \venv\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'Name'
I am a totally new of python,and I found a solutions in stackoverflow:
Sum values from Duplicated rows
The site above has similar question as mine,But It's a sql code,
Not Python
Any help will be great appreciation....
import pandas as pd
df = pd.read_csv('old.csv')
Group by 4 fields('Name', 'State', 'Brand', 'Model') and select Price column and apply aggregate sum to it,
df1 = df.groupby(['Name', 'State', 'Brand', 'Model'])['Price'].agg(['sum'])
print(df1)
This will give you a required output,
sum
Name State Brand Model
Adam MO Toyota RV4 26500
Berry KS Toyota Camry 30000
Kavin CA Ford F150 23000
Yuke OR Nissan Murano 31000
Note: There is only column sum in this df1. All 4 other columns are indexes so to convert it into csv, we first need to convert these 4 index columns to dataframe columns.
list(df1['sum'].index.get_level_values('Name')) will give you an output like this,
['Adam', 'Berry', 'Kavin', 'Yuke']
Now, for all indexes, do this,
df2 = pd.DataFrame()
cols = ['Name', 'State', 'Brand', 'Model']
for col in cols:
df2[col] = list(df1['sum'].index.get_level_values(col))
df2['Price'] = df1['sum'].values
Now, just write df2 to excel file like this,
df2.to_csv('new.csv', index = False)

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

How to apply function on mutlple-index pandas dataframe elegantly like on panel?

Suppose I have a dataframe like:
ticker MS AAPL
field price volume price volume
0 -0.861210 -0.319607 -0.855145 0.635594
1 -1.986693 -0.526885 -1.765813 1.696533
2 -0.154544 -1.152361 -1.391477 -2.016119
3 0.621641 -0.109499 0.143788 -0.050672
generated from following codes, please ignore the numbers which are just as examples
columns = pd.MultiIndex.from_tuples([('MS', 'price'), ('MS', 'volume'), ('AAPL', 'price'), ('AAPL', 'volume')], names=['ticker', 'field'])
data = np.random.randn(4, 4)
df = pd.DataFrame(data, columns=columns)
Now, I would like to calculate pct_change() or any function user definded on each price column, and add a new column on 'field' level to store the result.
I know how to do it elegantly if the data is a Panel, which is deprecated since ver 0.20. Suppose panel's 3 axis are date, ticker and field:
p[:,:, 'ret'] = p[:,:,'price'].pct_change()
That's all. But I have not found a similar elegant way to do it with multiple index dataframe.
You can using IndexSlice
df.loc[:,pd.IndexSlice[:,'price']].apply(pd.Series.pct_change).rename(columns={'price':'ret'})
Out[1181]:
ticker MS AAPL
field ret ret
0 NaN NaN
1 -1.420166 -0.279805
2 3.011155 0.062529
3 -1.609004 0.759954
def cstm(s):
return s.pct_change()
new = pd.concat(
[df.xs('price', 1, 1).apply(cstm)],
axis=1, keys=['new']
).swaplevel(0, 1, 1)
df.join(new).sort_index(1)
ticker AAPL MS
field new price volume new price volume
0 NaN -0.855145 0.635594 NaN -0.861210 -0.319607
1 1.064928 -1.765813 1.696533 1.306863 -1.986693 -0.526885
2 -0.211991 -1.391477 -2.016119 -0.922211 -0.154544 -1.152361
3 -1.103335 0.143788 -0.050672 -5.022430 0.621641 -0.109499
Or
def cstm(s):
return s.pct_change()
df.stack(0).assign(
new=lambda d: d.groupby('ticker').price.apply(cstm)
).unstack().swaplevel(0, 1, 1).sort_index(1)