How to ignore terms in replace if they do not exist in the pandas dataframe? - pandas

I have the following code to replace one term with another, this only works if the value exists in the pandas dataframe, I assume I need to wrap gdf[montype] = gdf[montype].replace(dict(montype), regex=True) in an if statement? How would I do this, or is there a better way?
montype = [
['HIS_COP_', ''],
['_Ply', ''],
['_Pt',''],
['BURIAL','burial'],
['CUT', 'CUT'],
['MODERN', 'MODERN'],
['NATURAL', 'NATURAL'],
['STRUCTURE', 'STRUCTURE'],
['SURFACE', 'SURFACE'],
['TREETHROW', 'natural feature'],
['FURROW', 'FURROW'],
['FIELD_DRAIN', 'FIELD_DRAIN'],
['DEPOSIT_FILL', 'DEPOSIT_FILL'],
['POSTHOLE', ''],
['TIMBER', ''],
['', '']
]
gdf[montype] = gdf[montype].replace(dict(montype), regex=True)
When the term does not exist I get the error raise KeyError(f"None of [{key}] are in the [{axis_name}]")
Edit:
mtype = {
'HIS_COP_': '',
'_Ply': '',
'_Pt': '',
'BURIAL': 'burial',
'CUT': 'CUT',
'MODERN': 'MODERN',
'NATURAL': 'NATURAL',
'STRUCTURE': 'STRUCTURE',
'SURFACE': 'SURFACE',
'TREETHROW': 'natural feature',
'FURROW': 'FURROW',
'FIELD_DRAIN': 'FIELD_DRAIN',
'DEPOSIT_FILL': 'DEPOSIT_FILL',
'POSTHOLE': '',
'TIMBER': ''
} # dict(montype)
gdf['montype'] = gdf['montype'].map(mtype).fillna(gdf['montype'])

You can try this:
# Convert you list to dict
Montype={'His_cop':'','Modern':'Modern', etc...} # dict(montype)
gdf[montype]=gdf[montype].map(Montype).fillna('whatever value you want')

Related

I have several problems with Logistic Regression using pandas

Can't create X_train and X_test Dataframes ( from 2 differenet csv files) and also can't use them as integer
data=pd.read_csv('action_train.csv', delimiter=';', header=0)
data=data.replace(to_replace ='[act1_]', value = '', regex = True).replace(to_replace ='[act2_]', value = '', regex = True).replace(to_replace ='[type ]', value = '', regex = True)
print(data.shape)
print(list(data.columns))
data1=pd.read_csv('action_test.csv', delimiter=';', header=0)
data1=data1.replace(to_replace ='[act1_]', value='', regex=True).replace(to_replace='[act2_]', value = '', regex = True).replace(to_replace ='[type ]', value = '', regex = True)
print(data1.shape)
print(list(data1.columns))
X_train=data['action_id', 'char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10']
print(X_train)
y_train=data['result']
X_test=data1['action_id', 'char_1', 'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8', 'char_9', 'char_10']
print(X_test)
y_test=data1['result']
I tried to use them in different way but got tuple instead of array. Also can't convert object type in integer

youtube_dl video descriptions

I have a df containing a set of videoIDs from YT:
import pandas as pd
data = {'Order': ['1', '2', '3'],
'VideoID': ['jxwHmAoKte4', 'LsXM502SpiU','1I3f27iQ4pM']
}
df = pd.DataFrame (data, columns = ['Order','VideoID'])
print (df)
and want to download the video descriptions and save them in the same df in an extra column.
I tried to use youtube_dl in Jupyter this way:
import youtube_dl
def all_descriptions(URL):
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
try:
youtube_dl.YoutubeDL(ydl_opts).download(URL)
return webpage
except:
pass
df['descriptions']=all_descriptions(URL)
I see the output of the code as text, but in df only "None" as text of the column.
Obviously I can't transport the output of the function to df in the proper way.
Can you suggest how to get it right?
Thank you in advance for help.
#perl
I modify the df to include two URLs that are causing two types of error:
import pandas as pd
data = {'Order': ['1', '2', '3', '4', '5'],
'VideoId': ['jxwHmAoKte4', 'LsXM502SpiU','1I3f27iQ4pM', 'MGQOX2rK5s', 'wNayw_E7lIA']
}
df = pd.DataFrame (data, columns = ['Order','VideoId'])
print (df)
Then I test it in the way you suggested, including my definition of ydl_opts:
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
df['description'] = [
youtube_dl.YoutubeDL(ydl_opts).extract_info(
u, download=False)['description'] for u in URL]
df
Reaching to the first error I get the output:
TypeError: 'NoneType' object is not subscriptable
After that I replace 'forcedescription' in my code with 'extract_info':
def all_descriptions(URL):
videoID=df['VideoId']
URL = 'https://www.youtube.com/watch?v=' + videoID
ydl_opts = {
'forcedescription':True,
'skip_download': True,
'youtube-skip-dash-manifest': True,
'no_warnings': True,
'ignoreerrors': True
}
try:
youtube_dl.YoutubeDL(ydl_opts).download(URL)
return webpage
except:
pass
It skips all errors, but as the result there is nothing in the 'description'-column.
Any sugggestions?
You can use extract_info method:
df['description'] = [
youtube_dl.YoutubeDL().extract_info(
u, download=False)['description'] for u in URL]
df
Output:
Order VideoID description
0 1 jxwHmAoKte4 Bundesweit gelten sie nun ab heute, die schärf...
1 2 LsXM502SpiU Wie sicher ist der Impfstoff? Wäre eine Impfpf...
2 3 1I3f27iQ4pM Impfen ja oder nein, diese Frage stellen sich ...
P.S. The forcedescription parameter only prints the description to standard output, it doesn't return it
Update: extract_info returns None if it fails, so in case we have videos that may fail before getting the description from the info we can check that the info is not None:
ydl = youtube_dl.YoutubeDL(ydl_opts)
infos = [ydl.extract_info(u, download=False) for u in URL]
df['description'] = [
info['description'] if info is not None else ''
for info in infos]

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

python3 variable expansion "a"*x to add empty elements in a list

How can I use python3 variable expansion to add empty elements into a list.
>>> "a"*5
'aaaaa'
This initialises a list with 3 elements.
l = ['']
>>> l
['']
>>> l.append('')
>>> l.append('')
>>> l
['', '', '']
When I try to add 5 empty elements I get just one.
>>> l=['' * 5]
>>> l
['']
I am writing this list into a csv, I want a cheap way to added empty columns, elements in a row. Where I build the row as elements in a list.
It was just a matter of semantics. Where I did the multiplication.
>>> l = [''] * 5
>>> l
['', '', '', '', '']
or
>>> l=[]
>>> l.extend([''] * 5)
>>> l
['', '', '', '', '']