In Pandas Dataframe error: nothing to repeat at position 17217 - pandas

I am trying to use str.contains to identify elements of one column of a dataframe in another column of another dataframe. Here is the code:
pattern = fr"(?:{'|'.join(strategic_accounts['Account Name'])})"
all_leads['in_strategic_list'] = all_leads['Company'].str.contains(pattern).astype(int)
Here are the heads of both dataframes as well as position 17271 of the all_leads dataframe. I don't understand the error because it looks like there isn't anything abnormal at position 17217. Also, all related errors online seem to refer to error nothing to repeat at position 0 which seems like it would be a different error since mine came up at loc 17217. Any insights appreciated! Thanks!
This mock example works perfectly with the same code:
df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad', 'SpongeBob']})
df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']})
pattern = fr"(?:{'|'.join(df2['IDs'])})"
df1['In_df2'] = df1['name'].str.contains(pattern).astype(int)
Update:
I have managed to figure out that the error is referring to loc 17217 in pattern not in strategic_accounts df. Printing the loc 17217 in pattern returns '*'. I have tried to apply this function to pattern before inserting it into the str.contains and I can't seem to get it to remove.
import re
pattern = fr"(?:{'|'.join(strategic_accounts['Account Name'])})"
def esc_spec_char(pattern):
for p in pattern:
if p == '\*':
re.sub('*', '1', p)
else:
continue
return pattern
pattern = esc_spec_char(pattern)
pattern[17217]
New_Update:
I have applied #LiamFiddler's method of turning the string into a re.Pattern object and run it on a dummy df and while it does seem to escape the * it doesn't seem to find the N. Not sure if I made some mistake. Here is the code:
sries = pd.Series(['x','y','$','%','^','N','*'])
ac = '|'.join(sries)
p = re.compile(re.escape(ac))
df1 = pd.DataFrame(data = {'Id' : [123, 232, 344, 455, 566, 377],
'col2' : ["N", "X", "Y", '*', "W", "Z"]})
df1['col2'].str.contains(p, regex=True).astype(int)

EDIT
I realized that re.escape() also escapes the | delimiter, so I think the appropriate solution is to map re.escape() to the series before joining the names:
strategic_accounts['Escaped Accounts'] = strategic_accounts['Account Name'].apply(lambda x: re.escape(x))
pattern = re.compile('|'.join(strategic_accounts['Escaped Accounts']))
Then you can proceed as below with using Series.str.contains(). On your sample dataframe, here is what I get:
sries = pd.Series(['x','y','$','%','^','N','*'])
ac = sries.apply(lambda x: re.escape(x))
p = re.compile('|'.join(ac))
df1 = pd.DataFrame(data = {'Id' : [123, 232, 344, 455, 566, 377],
'col2' : ["N", "X", "Y", '*', "W", "Z"]})
df1['col2'].str.contains(p, regex=True).astype(int)
Out:
0 1
1 0
2 0
3 1
4 0
5 0
Original
Ok, so based on the discovery of the special character, I think this is your solution:
First, we need to escape the special characters in the strings so that they don't mess up the regex. Fortunately, Python's re module has an .escape() method specifically for escaping special characters.
import re
accounts = '|'.join(strategic_accounts['Account Name'])
pattern = re.compile(re.escape(accounts))
Now we can proceed as before:
all_leads['in_strategic_list'] = all_leads['Company'].str.contains(pattern, regex=True).astype(int)

Related

New column with word at nth position of string from other column pandas

import numpy as np
import pandas as pd
d = {'ABSTRACT_ID': [14145090,1900667, 8157202,6784974],
'TEXT': [
"velvet antlers vas are commonly used in tradit",
"we have taken a basic biologic RPA to elucidat4",
"ceftobiprole bpr is an investigational cephalo",
"lipoperoxidationderived aldehydes for example",],
'LOCATION': [1, 4, 2, 1]}
df = pd.DataFrame(data=d)
df
def word_at_pos(x,y):
pos=x
string= y
count = 0
res = ""
for word in string:
if word == ' ':
count = count + 1
if count == pos:
break
res = ""
else :
res = res + word
print(res)
word_at_pos(df.iloc[0,2],df.iloc[0,1])
For this df I want to create a new column WORD that contains the word from TEXT at the position indicated by LOCATION. e.g. first line would be "velvet".
I can do this for a single line as an isolated function world_at_pos(x,y), but can't work out how to apply this to whole column. I have done new columns with Lambda functions before, but can't work out how to fit this function to lambda.
Looping over TEXT and LOCATION could be the best idea because splitting creates a jagged array, so filtering using numpy advanced indexing won't be possible.
df["WORDS"] = [txt.split()[loc] for txt, loc in zip(df["TEXT"], df["LOCATION"]-1)]
print(df)
ABSTRACT_ID ... WORDS
0 14145090 ... velvet
1 1900667 ... a
2 8157202 ... bpr
3 6784974 ... lipoperoxidationderived
[4 rows x 4 columns]

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

series.str.split(expand=True) returns error: Wrong number of items passed 2, placement implies 1

I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1
You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Concatenate DataFrames.DataFrame in Julia

I have a problem when I try to concatenate multiple DataFrames (a datastructure from the DataFrames package!) with the same columns but different row numbers. Here's my code:
using(DataFrames)
DF = DataFrame()
DF[:x1] = 1:1000
DF[:x2] = rand(1000)
DF[:time] = append!( [0] , cumsum( diff(DF[:x1]).<0 ) ) + 1
DF1 = DF[DF[:time] .==1,:]
DF2 = DF[DF[:time] .==round(maximum(DF[:time])),:]
DF3 = DF[DF[:time] .==round(maximum(DF[:time])/4),:]
DF4 = DF[DF[:time] .==round(maximum(DF[:time])/2),:]
DF1[:T] = "initial"
DF2[:T] = "final"
DF3[:T] = "1/4"
DF4[:T] = "1/2"
DF = [DF1;DF2;DF3;DF4]
The last line gives me the error
MethodError: Cannot `convert` an object of type DataFrames.DataFrame to an object of type LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame
This may have arisen from a call to the constructor LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame(...),
since type constructors fall back to convert methods.
I don't understand this error message. Can you help me out? Thanks!
I just ran into this exact problem on Julia 0.5.0 x86_64-linux-gnu, DataFrames 0.8.5, with both hcat and vcat.
Neither clearing the workspace nor reloading DataFrames solved the problem, but restarting the REPL fixed it immediately.