columns.values is not returning the strings - pandas

I have a dataframe with column name msg that has string values.
I am trying to get this values using:
df['msg'].values
But I am getting integers(problaby the index of the dataframe) and not the texts.
What am I doing wrong?

Say you have a pandas dataframe with column 'msg':
df['msg'] = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']
You can print just the string values with just:
df['msg'].values** --> **['red', 'orange', 'yellow', 'green', 'blue', 'purple']
In order to print the index:
df['msg'].index.to_list()** --> **[0, 1, 2, 3, 4, 5]
You can print certain string values by indexing. If you wanted the first string value:
df['msg'][0] --> 'red'
Or last value:
df['msg'][5] --> 'purple'**

Related

Spacy NER output, How to save ent.label,ent.text?

I am using this code:
"
c=[]
for i, j in article.iterrows():
c.append(j)
d=[]
for i in c:
e={}
e['Urls']=(i[0])
a = str(i[2])
doc = ner(a)
for ent in doc.ents:
e[ent.label_]=(ent.text)
d.append(e)
"
My output looks something like this:
[{'Urls': 'https://somewebsite.com',
'Fruit': 'Apple',
'Fruit_colour': 'Red'},
{'Urls': 'Urls': 'https://some_other_website.com/',
'Fruit': 'Papaya',
'Fruit_Colour': 'Yellow'}
I have multiple values fruit , Desire output looks like:
{'Urls': 'https://somewebsite.com'
'Fruit': 'Apple',
'Fruit': 'orange',
'Fruit': 'watermelon',
'Fruit_colour': 'Red',
'Fruit_colour': 'orange',
'Fruit_colour': 'Green'}
{'Urls': 'Urls': 'https://some_other_website.com/',
'Fruit': 'Papaya',
'Fruit': 'Peach',
'Fruit': Mango'
'Fruit_Colour': 'Yellow',
'Fruit_Colour': 'Yellow
'Fruit_Colour': 'Green'}
Your help and time is much appreciated thank you.
It sounds like you want to save multiple values in a single key. You can use a defaultdict with lists for that.
from collections import defaultdict
out = defaultdict(list)
doc = ... get it from spaCy ...
for ent in doc.ents:
out[ent.label_].append(ent.text)
print(out)

Creating multiple columns in pandas with lambda function

I'm trying to create a set of new columns with growth rates within my df in a more efficient way than multiply imputing them one by one.
My df has +100 variables, but for simplicity, assume the following:
consumption = [5, 10, 15, 20, 25, 30, 35, 40]
wage = [10, 20, 30, 40, 50, 60, 70, 80]
period = [1, 2, 3, 4, 5, 6, 7, 8]
id = [1, 1, 1, 1, 1, 1, 1, 1]
tup= list(zip(id , period, wage))
df = pd.DataFrame(tup,
columns=['id ', 'period', 'wage'])
With two variables I could simply do this:
df['wage_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['wage'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
df['consumption_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['consumption'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
But maybe by using a for loop or something I could iterate over my column names creating new growth rate columns with the name columnname_chg as in the example above.
Any ideas?
Thanks
You can try DataFrame operation rather than Series operation in groupby.apply
cols = ['wage', 'columnname']
out = df.join(df.sort_values(by=['id', 'period'])
.groupby(['id'])[cols]
.apply(lambda g: (g/g.shift(4)-1)).fillna(0)
.add_suffix('_chg'))

Is there a faster method using Numpy instead of Pandas groupby to calculate the cumulative mean?

I am trying to, as time-efficiently as possible, calculate the cumulative mean for each Player when they play a specific Position for each stat column. However, as for the specific application I am only focused on past performance, I need to exclude the current value (shift the data once). I have created a list of all the stat columns named Stats in order to save time instead of looping through them.
Right now I am using the Pandas group-by function which is sadly too slow for my data. Although the question suggests using Numpy as an alternative, I am really just after the absolute fastest method.
This is my current code, with a minimum reproducible example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Player': ['Sam', 'Bob', 'Amy', 'Sam', 'Bob', 'Amy', 'Sam', 'Bob', 'Amy','Sam', 'Bob', 'Amy', 'Sam', 'Bob', 'Amy', 'Sam', 'Bob', 'Amy'],
'Position': ['Off', 'Def', 'Def', 'Def', 'Off', 'Def', 'Def', 'Off', 'Off', 'Off', 'Def', 'Def', 'Def', 'Off', 'Def', 'Def', 'Off', 'Off'],
'Stat A': [10, 20, 30, 25, 15, 10, 20, 20, 15, 15, 25, 35, 20, 10, 15, 25, 25, 10],
'Stat B': [15, 25, 35, 20, 10, 15, 25, 25, 10, 10, 20, 30, 25, 15, 10, 20, 20, 15]})
Stats = ['Stat A', 'Stat B']
dfgroupby = df[['Player', 'Position']]
dfshift1 = df.groupby(['Player', 'Position'])[Stats].shift(1)
dfshift2 = pd.concat([dfgroupby, dfshift1], axis = 1)
dfcumsum = dfshift2.groupby(['Player', 'Position'])[Stats].cumsum()
dfcumcount1 = dfshift2.groupby(['Player', 'Position'])[Stats].cumcount()
dfcumcount2 = pd.concat([dfcumcount1] * len(Stats), axis = 1)
dfcummean1 = pd.DataFrame(dfcumsum.values / dfcumcount2.values, columns = Stats).add_suffix(' - CumMean')
dfcummean2 = pd.concat([dfgroupby, dfcummean1], axis = 1)
dfcummean2

Python: Without Itertools, how do I aggregate values in a list of lists for only the years that are included in that list?

I'm attempting to iterate over a list of lists that includes a date range of 1978-2020, but with only built-in Python modules. For instance, my nested list looks something like:
listing =[['0010', 'green', '1978', 'light'], ['0020', 'blue', '1978', 'dark'], ... ['2510', 'red', '2020', 'light']]
As I am iterating through, I am trying to make an aggregated count of colors and shades for that year, and then append that year's totals into a new list such as:
# ['year', 'blues', 'greens', 'light', dark']
annual_totals = [['1978', 12, 34, 8, 16], ['1979', 14, 40, 13, 9], ... , ['2020', 48, 98, 14, 10]]
So my failed code looks something like this:
annual_totals = []
for i in range(1978, 2021):
for line in listing:
while i == line[2] #if year in list same as year in iterated range, count tally for year
blue = 0
green = 0
light = 0
dark = 0
if line[1] == 'blue'
blue += 1
if line[1] == 'green'
blue += 1
if line[3] == 'light'
light += 1
if line[3] == 'dark'
dark += 1
tally = [i, 'blue', 'green', 'light', dark']
annual_totals.append(tally)
Of course, I never get out of the While loop to get a new year for iterable i.

Pandas Series .loc() access error after appending

I have a multi-index pandas series as below. I want to add a new entry (new_series) to multi_df, calling it multi_df_appended. However I don't understand the change in behaviour between multi_df and multi_df_appended when I try to access a non-existing multi-index.
Below is the code that reproduces the problem. I want the penultimate line of code: multi_df_appended.loc['five', 'black', 'hard', 'square' ] to return an empty Series like it does with multi_df but instead I get the error given. What am I doing wrong here?
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'texture': ['soft', 'soft', 'hard','soft','hard',
'hard','hard','hard'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']
# try to access a non-existing multi-index combination:
multi_df.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64) # returns an empty Series as desired/expected.
# append multi_df with a new row
new_series = pd.Series([9], index = [('four', 'black', 'hard', 'round')] )
multi_df_appended = multi_df.append(new_series)
# now try again to access a non-existing multi-index combination:
multi_df_appended.loc['five', 'black', 'hard', 'square' ]
error: 'MultiIndex lexsort depth 0, key was length 4' # now instead of the empty Series, I get an error!?
As #Jeff answered, if I do .sortlevel(0) and then run .loc() for an unknown index, it does not give the "lexsort depth" error:
multi_df_appended_sorted = multi_df.append(new_series).sortlevel(0)
multi_df_appended_sorted.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64)