series.str.split(expand=True) returns error: Wrong number of items passed 2, placement implies 1 - pandas

I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1

You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')

Related

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

In Pandas Dataframe error: nothing to repeat at position 17217

I am trying to use str.contains to identify elements of one column of a dataframe in another column of another dataframe. Here is the code:
pattern = fr"(?:{'|'.join(strategic_accounts['Account Name'])})"
all_leads['in_strategic_list'] = all_leads['Company'].str.contains(pattern).astype(int)
Here are the heads of both dataframes as well as position 17271 of the all_leads dataframe. I don't understand the error because it looks like there isn't anything abnormal at position 17217. Also, all related errors online seem to refer to error nothing to repeat at position 0 which seems like it would be a different error since mine came up at loc 17217. Any insights appreciated! Thanks!
This mock example works perfectly with the same code:
df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad', 'SpongeBob']})
df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']})
pattern = fr"(?:{'|'.join(df2['IDs'])})"
df1['In_df2'] = df1['name'].str.contains(pattern).astype(int)
Update:
I have managed to figure out that the error is referring to loc 17217 in pattern not in strategic_accounts df. Printing the loc 17217 in pattern returns '*'. I have tried to apply this function to pattern before inserting it into the str.contains and I can't seem to get it to remove.
import re
pattern = fr"(?:{'|'.join(strategic_accounts['Account Name'])})"
def esc_spec_char(pattern):
for p in pattern:
if p == '\*':
re.sub('*', '1', p)
else:
continue
return pattern
pattern = esc_spec_char(pattern)
pattern[17217]
New_Update:
I have applied #LiamFiddler's method of turning the string into a re.Pattern object and run it on a dummy df and while it does seem to escape the * it doesn't seem to find the N. Not sure if I made some mistake. Here is the code:
sries = pd.Series(['x','y','$','%','^','N','*'])
ac = '|'.join(sries)
p = re.compile(re.escape(ac))
df1 = pd.DataFrame(data = {'Id' : [123, 232, 344, 455, 566, 377],
'col2' : ["N", "X", "Y", '*', "W", "Z"]})
df1['col2'].str.contains(p, regex=True).astype(int)
EDIT
I realized that re.escape() also escapes the | delimiter, so I think the appropriate solution is to map re.escape() to the series before joining the names:
strategic_accounts['Escaped Accounts'] = strategic_accounts['Account Name'].apply(lambda x: re.escape(x))
pattern = re.compile('|'.join(strategic_accounts['Escaped Accounts']))
Then you can proceed as below with using Series.str.contains(). On your sample dataframe, here is what I get:
sries = pd.Series(['x','y','$','%','^','N','*'])
ac = sries.apply(lambda x: re.escape(x))
p = re.compile('|'.join(ac))
df1 = pd.DataFrame(data = {'Id' : [123, 232, 344, 455, 566, 377],
'col2' : ["N", "X", "Y", '*', "W", "Z"]})
df1['col2'].str.contains(p, regex=True).astype(int)
Out:
0 1
1 0
2 0
3 1
4 0
5 0
Original
Ok, so based on the discovery of the special character, I think this is your solution:
First, we need to escape the special characters in the strings so that they don't mess up the regex. Fortunately, Python's re module has an .escape() method specifically for escaping special characters.
import re
accounts = '|'.join(strategic_accounts['Account Name'])
pattern = re.compile(re.escape(accounts))
Now we can proceed as before:
all_leads['in_strategic_list'] = all_leads['Company'].str.contains(pattern, regex=True).astype(int)

Dataframe loc with multiple string value conditions

Hi, given this dataframe is it possible to fetch the Number value associated with certain conditions using df.loc? This is what i came up with so far.
if df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]:
I want the output to be 1. Is this the correct way to do it?
You're in the right way, but you have to pass ".values[0]" in the end of the .loc statement to extract the only value that you got in the pandas Series.
df = pd.DataFrame({
'Tags': ['Brunei', 'China'],
'Type': ['Host', 'Address'],
'Number': [1, 1192]
}
)
display(df)
series = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]
print(type(series))
value = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"].values[0]
print(type(value))

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Managing MultiIndex Dataframe objects

So, yet another problem using grouped DataFrames that I am getting so confused over...
I have defined an aggregation dictionary as:
aggregations_level_1 = {
'A': {
'mean': 'mean',
},
'B': {
'mean': 'mean',
},
}
And now I have two grouped DataFrames that I have aggregated using the above, then joined:
grouped_top =
df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom =
df2.groupby(['group_lvl']).agg(aggregations_level_1)
Joining these:
df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10',
rsuffix='_low_10')
A_top_10 A_low_10 B_top_10 B_low_10
mean mean mean mean
group_lvl
a 3.711413 14.515901 3.711413 14.515901
b 4.024877 14.442106 3.694689 14.209040
c 3.694689 14.209040 4.024877 14.442106
Now, if I call index and columns I have:
print df3.index
>> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')
print df3.columns
>> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]])
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
Only A_top_10, B_top_10 for a and c?
I am pretty confused so any overall help would be great!
Need slicers, but first sort columns by sort_index else error:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
df = df.sort_index(axis=1)
idx = pd.IndexSlice
df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
print (df1)
A_low_10 A_top_10
mean mean
group_lvl
a 14.515901 3.711413
b 14.442106 4.024877
c 14.209040 3.694689
And:
idx = pd.IndexSlice
df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
print (df2)
A_top_10 B_top_10
mean mean
group_lvl
a 3.711413 3.711413
c 3.694689 4.024877
EDIT:
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
I think very close, better is say I have MultiIndex in columns.