Counter calling in pandas? - pandas

I want to call counter values inside pandas.
Effort so far:
from __future__ import unicode_literals
import spacy,en_core_web_sm
from collections import Counter
import pandas as pd
nlp = en_core_web_sm.load()
c = Counter(([token.pos_ for token in nlp('The cat sat on the mat.')]))
sbase = sum(c.values())
for el, cnt in c.items():
el, '{0:2.2f}%'.format((100.0* cnt)/sbase)
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
print df
Current Output:
index 0
0 NOUN 2
1 VERB 1
2 DET 2
3 ADP 1
4 PUNCT 1
Expected Output:
The below inside dataframe:
(u'NOUN', u'28.57%')
(u'VERB', u'14.29%')
(u'DET', u'28.57%')
(u'ADP', u'14.29%')
(u'PUNCT', u'14.29%')
I want to call el and cnt inside the data frame how?
It was a folow up question wherein i wanted to get percentage of POS distribution listed.
Percentage Count Verb, Noun using Spacy?
I was of understanding I need to put group el and cnt in place of c below:
df = pd.DataFrame.from_dict(c, orient='index').reset_index()

I can only fix your out put since I do not have the original data
(df['0']/df['0'].sum()).map("{0:.2%}".format)
Out[827]:
0 28.57%
1 14.29%
2 28.57%
3 14.29%
4 14.29%
Name: 0, dtype: object

Related

i'm trying to match up an int with a row and getting that row of data in tensorflow

I'm new to TensorFlow and I need to use it for an algorithm. I need it to match up an age with a row and fetch the data depending on the age and print it. how would I do this?(oh yea it's from a CSV btw)
You can use loc to get data from a certain condition.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.loc[df['age'] == 5])
Output
sex age kids marveldc sports ... action documentries sci fy sitcom thriller
0 0 5 1 1 0 ... 0 0 0 0 0
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Increment or reset counter based on an existing value of a data frame column in Pandas

I have a dataframe imported from csv file along the lines of the below:
Value Counter
1. 5 0
2. 15 1
3. 15 2
4. 15 3
5. 10 0
6. 15 1
7. 15 1
I want to increment the value of counter only if the value= 15 else reset it to 0. I tried cumsum but stuck how to reset it back to zero of nonmatch
Here is my code
import pandas as pd
import csv
import numpy as np
dfs = []
df = pd.read_csv("H:/test/test.csv")
df["Counted"] = (df["Value"] == 15).cumsum()
dfs.append(df)
big_frame = pd.concat(dfs, sort=True, ignore_index=False)
big_frame.to_csv('H:/test/List.csv' , index=False)
Thanks for your help
Here's my approach:
s = df.Value.ne(15)
df['Counter'] = (~s).groupby(s.cumsum()).cumsum()

how to convert a column of pandas series without the header

It is quite odd as I hadn't experienced the issue until now,, for conversion of data series.
So I have wind speed data by date & hour at different heights, retrieved from NREL.
file09 = 'wind/wind_yr2009.txt'
wind09 = pd.read_csv(file09, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
file10 = 'wind/wind_yr2010.txt'
wind10 = pd.read_csv(file10, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
I merge the two readings of .txt files below
wind = pd.concat([wind09, wind10], join='inner')
Then drop the duplicate headings..
wind = wind.reset_index().drop_duplicates(keep='first').set_index('index')
print(wind['HOUR-MST'])
Printing would return smth like the following -
index
0 HOUR-MST
1 1
2 2
I wasn't sure at first but apparently index 0 is on HOUR-MST, which is the column heading. Python does recognize it as I can infer the column data using the specific header. Yet, when I try converting into int
temp = hcodebook.iloc[wind['HOUR-MST'].astype(int) - 1]
Both errors were returned, as I later tried to convert to float
ValueError: invalid literal for int() with base 10: 'HOUR-MST'
ValueError: could not convert string to float: 'HOUR-MST'
I verified it is only the 0th index that has strings by using try/except in for loop.
I think the reason is because I didnt' use the parameter sep when reading these file - as that is the only difference with the previous attempts with other files where the data conversion is troubling me.
Yet it doesn't necessarily enlighten me in how to address it.
Kindly advise.
MCVE:
from io import StringIO
import pandas as pd
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+')
Header included in data:
a b c d
0 A B C D
1 1 2 3 4
2 5 6 7 8
Use skiprows to avoid getting headers:
from io import StringIO
import pandas as pd
​
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+', skiprows=1)
No headers:
a b c d
0 1 2 3 4
1 5 6 7 8

Create a Combined CSV Files

I have two CSV files reviews_positive.csv and reviews_negative.csv. How can I combine them into one CSV file, but in the following condition:
Have odd rows fill with reviews from reviews_positive.csv and even rows fill up with reviews from reviews_negative.csv.
I am using Pandas
I need this specific order because I want to build a balanced dataset for training using neural networks
Here is a working example
from io Import StringIO
import pandas as pd
pos = """rev
a
b
c"""
neg = """rev
e
f
g
h
i"""
pos_df = pd.read_csv(StringIO(pos))
neg_df = pd.read_csv(StringIO(neg))
Solution
pd.concat with the keys parameter to label the source dataframes as well as to preserve the desired order of positive first. Then we sort_index with parameter sort_remaining=False
pd.concat(
[pos_df, neg_df],
keys=['pos', 'neg']
).sort_index(level=1, sort_remaining=False)
rev
pos 0 a
neg 0 e
pos 1 b
neg 1 f
pos 2 c
neg 2 g
3 h
4 i
That said, you don't have to interweave them to take balanced samples. You can use groupby with sample
pd.concat(
[pos_df, neg_df],
keys=['pos', 'neg']
).groupby(level=0).apply(pd.DataFrame.sample, n=3)
rev
pos pos 1 b
2 c
0 a
neg neg 1 f
4 i
3 h

Pandas: find most frequent values in columns of lists

x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
I have dataframe like this. How can i find most frequent animals contained in all lists of column.
Method value_counts() consider list as one element and i can't use it.
something along these lines?
import pandas as pd
df = pd.DataFrame({'x' : [5,6,8], 'animal' : [['dog', 'cat'], ['elephant'], ['dog']]})
x = sum(df.animal, [])
#x
#Out[15]: ['dog', 'cat', 'elephant', 'dog']
from collections import Counter
c = Counter(x)
c.most_common(1)
#Out[17]: [('dog', 2)]
Maybe take a step back and redefine your data structure? Pandas is more suited if your dataframe is "flat".
Instead of:
x animal
0 5 [dog, cat]
1 6 [dog]
2 8 [elephant]
Do:
x animal
0 5 dog
1 5 cat
2 6 dog
3 8 elephant
Now you can count easily with len(df[df['animal'] == 'dog']) as well as many other Pandas things!
To flatten your dataframe, reference this answer:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas