Creating a column based on pattern matching in pandas - pandas

I have a date frame containing two columns 'Name' and 'Task'. I want to create a third column called 'task_category' based on matching condition from a list. Please note the below data is only for example and I actually have 100s of patterns to look for instead of the three shown below.
df = pd.DataFrame(
{'Name': ["a","b","c"],
'Task': ['went to trip','Mall Visit','Cinema']})
task_category=['trip','Mall','Cinema']
Name Task task_category
0 a went to trip trip
1 b Mall Visit Mall
2 c Cinema Cinema

Use series.str.extract()
pat=r'({})'.format('|'.join(task_category))
#'(trip|Mall|Cinema)'
df['task_category']=df.Task.str.extract(pat)
print(df)
Name Task task_category
0 a went to trip trip
1 b Mall Visit Mall
2 c Cinema Cinema

I am using find all , since this will help you finding same key words in same line
df.Task.str.findall('|'.join(task_category)).str[0]
Out[1008]:
0 trip
1 Mall
2 Cinema
Name: Task, dtype: object
Sample
df = pd.DataFrame(
{'Name': ["a","b","c"],
'Task': ['went to trip Cinema','Mall Visit','Cinema']})
df.Task.str.findall('|'.join(task_category))
Out[1012]:
0 [trip, Cinema]
1 [Mall]
2 [Cinema]
Name: Task, dtype: object

Related

Is there anything in Pandas similar to dplyr's 'list columns'

I'm currently transitioning from R to Python in my data analysis, and there's one thing I haven't seen in any tutorials out there: is there anything in Pandas similar to dplyr's 'list columns' ?
Link to refence:
https://www.rstudio.com/resources/webinars/how-to-work-with-list-columns/
pandas will accept any object type, including lists, in an object type column.
df = pd.DataFrame()
df['genre']=['drama, comedy, action', 'romance, sci-fi, drama','horror']
df.genre = df.genre.str.split(', ')
print(df, '\n', df.genre.dtype, '\n', type(df.genre[0]))
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
object
<class 'list'>
We can see that:
genre is a column of lists.
The dtype of the genre column is object
The type of the first value of genre is list.
There are a number of str functions that work with lists.
For example:
print(df.genre.str.join(' | '))
# Output:
0 drama | comedy | action
1 romance | sci-fi | drama
2 horror
Name: genre, dtype: object
print(df.genre.str[::2])
# Output:
0 [drama, action]
1 [romance, drama]
2 [horror]
Name: genre, dtype: object
Others can typically be done with an apply function if there isn't a built-in method:
print(df.genre.apply(lambda x: max(x)))
# Output:
0 drama
1 sci-fi
2 horror
Name: genre, dtype: object
See the documentation for more... pandas str functions
As for nesting dataframes within one another, it is possible but, I believe it's considered an anti-pattern, and pandas will fight you the whole way there:
data = {'df1': df, 'df2': df}
df2 = pd.Series(data.values(), data.keys()).to_frame()
df2.columns = ['dfs']
print(df2)
# Output:
dfs
df1 genre
0 [drama, comedy...
df2 genre
0 [drama, comedy...
print(df2['dfs'][0])
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
See:
Link1
Link2
A possibly acceptable work around, would be storing them as numpy arrays:
df2 = df2.applymap(np.array)
print(df2)
print(df2['dfs'][0])
# Output:
dfs
df1 [[[drama, comedy, action]], [[romance, sci-fi,...
df2 [[[drama, comedy, action]], [[romance, sci-fi,...
array([[list(['drama', 'comedy', 'action'])],
[list(['romance', 'sci-fi', 'drama'])],
[list(['horror'])]], dtype=object)

Trying to print entire dataframe after str.replace on one column

I can't figure out why this is throwing the error:
KeyError(f"None of [{key}] are in the [{axis_name}]")
here is the code:
def get_company_name(df):
company_name = [col for col in df if col.lower().startswith('comp')]
return company_name
df = df[df[get_company_name(master_leads)[0]].str.replace(punc, '', regex=True)]
this is what df.head() looks like:
Company / Account Website
0 Big Moose RV, & Boat Sales, Service, Camper Re... https://bigmooservsales.com/
1 Holifield Pest Management of Hattiesburg NaN
2 Steve Nichols Insurance NaN
3 Sandel Law Firm sandellaw.com
4 Duplicate - Checkered Flag FIAT of Newport News NaN
I have tried putting the [] in every place possible but I must be missing something. I was under impression that this is how you ran transformations on one column of the dataframe without pulling the series out of the dataframe.
Thanks!
You can get the first column name for company with
company_name_col = [col for col in df if col.lower().startswith('comp')][0]
you can see the cleaned up company name with
df[company_name_col].str.replace(punc, "", regex=True)
to apply the replacement
df[company_name_col] = df[company_name_col].str.replace(punc, "", regex=True)

how to deal with the pandas "id variables need to uniquely identify each row"?

I have a dataframe like:
df=pd.DataFrame({
'name1': ['A', 'A', 'C','B','C','A','D'],
'name2': ['D', 'B', 'A','D','B','C','A'],
'text': ['cars', 'cars', 'flower', 'tea','ball','phone','ice'],
'time':['10/01','10/01','10/01','10/01','10/02','10/02','10/02'],
'Flag1':[1,1,2,0,2,1,0],
'Flag2':[0,0,1,0,0,2,1]})
expect:
pd.DataFrame({
'name': ['A', 'B', 'C','D','A','B','C'],
'text': ['cars,flower','cars,tea', 'flower', 'cars,tea','phone,ice','ball','phone'],
'time':['10/01','10/01','10/01','10/01','10/02','10/02','10/02'],
'Flag':[1,0,2,0,1,0,2]})
I want to combine information according to "time". columns are merged by 'time';
'name': 'name1' and 'name2' are merged into 'name';
'words': on each day, words are merged once it shows up in the identical user's row;
'time': the date that the user shows up on that day;
'Flag': 'Flag1' and 'Flag2' are merge into 'Flag'. Each user has a unique
'Flag'('0','1','2') no matter what the date is.
But When I do:
pd.wide_to_long(df, stubnames=["name", "Flag"], i=["text", "time"], j="ref",
suffix="\d*").reset_index().groupby(["name","time"],
as_index=False).agg({"text": ",".join, "Flag": "first"}).sort_values(["time", "name"])
I get:
id variables need to uniquely identify each row
How to deal with that?
Let me know if this works for you. Try :
Reshape by getting the index in, to serve as unique identifier i :
m = pd.wide_to_long(df.reset_index(), stubnames=["name", "Flag"], i="index", j="num")
Munge to get desired output, using groupby and some text manipulation :
(
m.groupby(["name", "time"])
.agg(Flag=("Flag", "first"), text=("text", lambda x: ",".join(set(x))))
.reset_index()
.sort_values("time")
)
name time Flag text
0 A 10/01 1 flower,cars
2 B 10/01 0 tea,cars
4 C 10/01 2 flower
6 D 10/01 0 tea,cars
1 A 10/02 1 phone,ice
3 B 10/02 0 ball
5 C 10/02 2 phone,ball
7 D 10/02 0 ice

How can I combine same-named columns into one in a pandas dataframe so all the columns are unique?

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()