Spacy NER output, How to save ent.label,ent.text? - spacy

I am using this code:
"
c=[]
for i, j in article.iterrows():
c.append(j)
d=[]
for i in c:
e={}
e['Urls']=(i[0])
a = str(i[2])
doc = ner(a)
for ent in doc.ents:
e[ent.label_]=(ent.text)
d.append(e)
"
My output looks something like this:
[{'Urls': 'https://somewebsite.com',
'Fruit': 'Apple',
'Fruit_colour': 'Red'},
{'Urls': 'Urls': 'https://some_other_website.com/',
'Fruit': 'Papaya',
'Fruit_Colour': 'Yellow'}
I have multiple values fruit , Desire output looks like:
{'Urls': 'https://somewebsite.com'
'Fruit': 'Apple',
'Fruit': 'orange',
'Fruit': 'watermelon',
'Fruit_colour': 'Red',
'Fruit_colour': 'orange',
'Fruit_colour': 'Green'}
{'Urls': 'Urls': 'https://some_other_website.com/',
'Fruit': 'Papaya',
'Fruit': 'Peach',
'Fruit': Mango'
'Fruit_Colour': 'Yellow',
'Fruit_Colour': 'Yellow
'Fruit_Colour': 'Green'}
Your help and time is much appreciated thank you.

It sounds like you want to save multiple values in a single key. You can use a defaultdict with lists for that.
from collections import defaultdict
out = defaultdict(list)
doc = ... get it from spaCy ...
for ent in doc.ents:
out[ent.label_].append(ent.text)
print(out)

Related

Inputting first and last name to output a value in Pandas Dataframe

I am trying to create an input function that returns a value for the corresponding first and last name.
For this example i'd like to be able to enter "Emily" and "Bell" and return "attempts: 3"
Heres my code so far:
import pandas as pd
import numpy as np
data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'lastname': ['Thompson','Wu', 'Downs','Hunter','Bell','Cisneros', 'Becker', 'Sims', 'Gallegos', 'Horne'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no',
'yes', 'yes', 'no', 'no', 'yes']
}
data
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
fname = input()
lname = input()
print(f"{fname} {lname}'s number of attempts: {???}")
I thought there would be specific documentation for this but I cant find any on the pandas dataframe documentation. I am assuming its pretty simple but can't find it.
fname = input()
lname = input()
# use loc to filter the row and then capture the value from attempts columns
print(f"{fname} {lname}'s number of attempts:{df.loc[df['name'].eq(fname) & df['lastname'].eq(lname)]['attempts'].squeeze()}")
Emily
Bell
Emily Bell's number of attempts:2
alternately, to avoid mismatch due to case
fname = input().lower()
lname = input().lower()
print(f"{fname} {lname}'s number of attempts:{df.loc[(df['name'].str.lower() == fname) & (df['lastname'].str.lower() == lname)]['attempts'].squeeze()}")
emily
BELL
emily bell's number of attempts:2
Try this:
df[(df['name'] == fname) & (df['lastname'] == lname)]['attempts'].squeeze()

columns.values is not returning the strings

I have a dataframe with column name msg that has string values.
I am trying to get this values using:
df['msg'].values
But I am getting integers(problaby the index of the dataframe) and not the texts.
What am I doing wrong?
Say you have a pandas dataframe with column 'msg':
df['msg'] = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']
You can print just the string values with just:
df['msg'].values** --> **['red', 'orange', 'yellow', 'green', 'blue', 'purple']
In order to print the index:
df['msg'].index.to_list()** --> **[0, 1, 2, 3, 4, 5]
You can print certain string values by indexing. If you wanted the first string value:
df['msg'][0] --> 'red'
Or last value:
df['msg'][5] --> 'purple'**

how to extract data from column which looks like a dictionary in Pandas?

Hi I am new to pandas/python and trying to read a txt file in pandas
I want to extract key, value pairs for each row.
Make the key as new column name and its respective value as values.
Input
data
{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'John', 'Class': 'Ninth'}
Expected Output:
Name Class Hobbies
Tim Ninth Football
Tom Ninth Football
Jim Ninth Football
John Ninth NA
import pandas as pd
df1 = pd.read_csv('9data.txt',sep = '\t')
df1['Name'] = df1['data'].apply(lambda x : x.values()[1])
print(df1)
Error: AttributeError: 'str' object has no attribute 'values'
Is there any way in which i can do this in pandas ?
The way the data was being read, I could get it a new dataframe using eval(). This will iterate over each cell creating a new dataframe then concatenating them.
data='''data
{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'John', 'Class': 'Ninth'}'''
df = pd.read_csv(io.StringIO(data), sep='\t', engine='python')
df1 = pd.concat([pd.json_normalize(eval(x)) for x in df['data']])
Output
Name Class Hobbies
0 Tim Ninth Football
0 Tom Ninth Football
0 Jim Ninth Football
0 John Ninth NaN
If you can get your data look like this, this is simpler method that Anurag Dabas alludes to. You might consider reading the file into a list first, then creating the dataframe, rather creating a dataframe from a dataframe.
datal = [{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'John', 'Class': 'Ninth'}]
df = pd.DataFrame(datal)
df

Convert pandas to dictionary defining the columns used fo the key values

There's the pandas dataframe 'test_df'. My aim is to convert it to a dictionary. Therefore I run this:
id Name Gender Age
0 1 'Peter' 'M' 32
1 2 'Lara' 'F' 45
Therefore I run this:
test_dict = test_df.set_index('id').T.to_dict()
The output is this:
{1: {'Name': 'Peter', 'Gender': 'M', 'Age': 32}, 2: {'Name': 'Lara', 'Gender': 'F', 'Age': 45}}
Now, I want to choose only the 'Name' and 'Gender' columns as the values of dictionary's keys. I'm trying to modify the above script into sth like this:
test_dict = test_df.set_index('id')['Name']['Gender'].T.to_dict()
with no success!
Any suggestion please?!
You was very close, use subset of columns [['Name','Gender']]:
test_dict = test_df.set_index('id')[['Name','Gender']].T.to_dict()
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}
Also T is not necessary, use parameter orient='index':
test_dict = test_df.set_index('id')[['Name','Gender']].to_dict(orient='index')
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}

Pandas Series .loc() access error after appending

I have a multi-index pandas series as below. I want to add a new entry (new_series) to multi_df, calling it multi_df_appended. However I don't understand the change in behaviour between multi_df and multi_df_appended when I try to access a non-existing multi-index.
Below is the code that reproduces the problem. I want the penultimate line of code: multi_df_appended.loc['five', 'black', 'hard', 'square' ] to return an empty Series like it does with multi_df but instead I get the error given. What am I doing wrong here?
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'texture': ['soft', 'soft', 'hard','soft','hard',
'hard','hard','hard'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']
# try to access a non-existing multi-index combination:
multi_df.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64) # returns an empty Series as desired/expected.
# append multi_df with a new row
new_series = pd.Series([9], index = [('four', 'black', 'hard', 'round')] )
multi_df_appended = multi_df.append(new_series)
# now try again to access a non-existing multi-index combination:
multi_df_appended.loc['five', 'black', 'hard', 'square' ]
error: 'MultiIndex lexsort depth 0, key was length 4' # now instead of the empty Series, I get an error!?
As #Jeff answered, if I do .sortlevel(0) and then run .loc() for an unknown index, it does not give the "lexsort depth" error:
multi_df_appended_sorted = multi_df.append(new_series).sortlevel(0)
multi_df_appended_sorted.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64)