Removing \xa0 from dataframe in pandas

Removing \xa0 from dataframe in pandas - pandas

I have a list of NBA team names that have doubled up. How do I remove the entries with \xa0?
This is the output I get.
['Atlanta Hawks', 'Atlanta Hawks\xa0', 'Boston Celtics', 'Boston Celtics\xa0', ect
I am removing the seed from the webscraping process by
teams["Team"] = teams["Team"].str.replace("(1)", "", regex=False)
teams["Team"] = teams["Team"].str.replace("(2)", "", regex=False)
teams["Team"] = teams["Team"].str.replace("(15)", "", regex=False)
How to I get my list to not include the \xa0 entries?

Related

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet

You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

pandas: how to remove duplicates from a deeply nested list of lists

I have a panda dataframe like the following:
df = pd.DataFrame({ 'text':['the weather is nice though', 'How are you today','the beautiful girl and the nice boy']})
df['sentence_number'] = df.index + 1
df['token'] = df['text'].str.split().tolist()
df= df.explode('token').reset_index(drop=True)
I have to have a column for tokens as I need it for another project. I have applied the following to my dataframe.
import spacy
nlp = spacy.load("en_core_web_sm")
dep_children_sm = []
def dep_children_tagger(txt):
children = [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
dep_children_sm.append(children)
dep_children_tagger(df.text)
since one has to apply the n.children method on the sentence level, I have to use the text column and not the token column, so the output has a list of repetitions. I would now like to remove this repetitions from my list 'dep_children_sm', and i have done the following,
children_flattened =[item for sublist in dep_children_sm for item in sublist]
list(k for k,_ in itertools.groupby(children_flattened))
but nothing happens, and I still have the repeated lists. I have also tried to add drop_duplicates() to the text column when calling the function, but the problem is that I have duplicate sentences in my original dataframe and unfortunately cannot do that.
desired output = [[[], [the], [weather, nice, though], [], []], [[], [How, you, today], [], []], [[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]

It seems that you want to apply your function on the unique texts. Thus you can first use the pandas.Series.unique method on df.text
>>> df['text'].unique()
array(['the weather is nice though', 'How are you today',
'the beautiful girl and the nice boy'], dtype=object)
Then I would simplify your function to directly output the result. There is no need for a global list. Also, your function was adding an extra level of list, which seems unwanted.
def dep_children_tagger(txt):
return [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
Finally, apply your function on the unique texts:
dep_children_sm = dep_children_tagger(df['text'].unique())
This gives:
>>> dep_children_sm
[[[], [the], [weather, nice, though], [], []],
[[], [How, you, today], [], []],
[[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]

ok, I figured out how to solve this issue. the problem was that the nlp.text outputs a list of list of lists on spacy tokens, and since at no point there is any string in this nested list, the itertools does not work.
since I cannot remove the duplicates from the text column in my analysis, I have done the following instead.
d =[' '.join([str(c) for c in lst]) for lst in children_flattened]
list(set(d))
this outputs a list of strings excluding duplicates
# ['[] [How, you, today] [] []',
# '[] [the] [weather, nice, though] [] []',
# '[] [] [the, beautiful, and, boy] [] [] [] [the, nice]']

Dataframe loc with multiple string value conditions

Hi, given this dataframe is it possible to fetch the Number value associated with certain conditions using df.loc? This is what i came up with so far.
if df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]:
I want the output to be 1. Is this the correct way to do it?

You're in the right way, but you have to pass ".values[0]" in the end of the .loc statement to extract the only value that you got in the pandas Series.
df = pd.DataFrame({
'Tags': ['Brunei', 'China'],
'Type': ['Host', 'Address'],
'Number': [1, 1192]
}
)
display(df)
series = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]
print(type(series))
value = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"].values[0]
print(type(value))

From dictionary to organized Excel

Good Morning,
I have a dictionary: organized this way below; And What I want to do is use the dictionary values as the column number for the Key. As showed below:
My first idea was to loop through the dictionary and create a text file where dico_values = tabs and then transform this new file into an excel file but this seems one too much step. Cheers to all

You could perhaps try this:
new_dico = {value: [] for value in dico.values()} # {1: [], 2: [], 3: [], ...}
for key, value in dico.items():
new_dico[value].append(key)
for otherkey in new_dico.keys():
if otherkey == value:
continue
else:
new_dico[otherkey].append("")
# new_dico == {1: ["President of ...", "Members of ..", "", ...], 2: ["", "", "President's Office", ...], ...}
# Then you can make a dataframe of 'new_dico' with pandas, for instance

Grouping and heading pandas dataframe

I have the following dataframe of securities and computed a 'liquidity score' in the last column, where 1 = liquid, 2 = less liquid, and 3 = illiquid. I want to group the securities (dynamically) by their liquidity. Is there a way to group them and include some kind of header for each group? How can this be best achieved. Below is the code and some example, how it is supposed to look like.
import pandas as pd
df = pd.DataFrame({'ID':['XS123', 'US3312', 'DE405'], 'Currency':['EUR', 'EUR', 'USD'], 'Liquidity score':[2,3,1]})
df = df.sort_values(by=["Liquidity score"])
print(df)
# 1 = liquid, 2 = less liquid,, 3 = illiquid

Add labels for liquidity score
The following replaces labels for numbers in Liquidity score:
df['grp'] = df['Liquidity score'].replace({1:'Liquid', 2:'Less liquid', 3:'Illiquid'})
Headers for each group
As per your comment, find below a solution to do this.
Let's illustrate this with a small data example.
df = pd.DataFrame({'ID':['XS223', 'US934', 'US905', 'XS224', 'XS223'], 'Currency':['EUR', 'USD', 'USD','EUR','EUR',]})
Insert a header on specific rows using np.insert.
df = pd.DataFrame(np.insert(df.values, 0, values=["Liquid", ""], axis=0))
df = pd.DataFrame(np.insert(df.values, 2, values=["Less liquid", ""], axis=0))
df.columns = ['ID', 'Currency']
Using Pandas styler, we can add a background color, change font weight to bold and align the text to the left.
df.style.hide_index().set_properties(subset = pd.IndexSlice[[0,2], :], **{'font-weight' : 'bold', 'background-color' : 'lightblue', 'text-align': 'left'})

You can add a new column like this:
df['group'] = np.select(
[
df['Liquidity score'].eq(1),
df['Liquidity score'].eq(2)
],
[
'Liquid','Less liquid'
],
default='Illiquid'
)
And try setting as index, so you can filter using the index:
df.set_index(['grouping','ID'], inplace=True)
df.loc['Less liquid',:]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Removing \xa0 from dataframe in pandas - pandas

Related

sort dataframe by string and set a new id

pandas: how to remove duplicates from a deeply nested list of lists

Dataframe loc with multiple string value conditions

From dictionary to organized Excel

Grouping and heading pandas dataframe

Categories

Resources