Python KeyError when using pandas - pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?

You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

Related

How to split a column in many different columns?

I have a dataset and in one of it columns I have many values that I want to convert to new columns:
"{'availabilities': {'bikes': 4, 'stands': 28, 'mechanicalBikes': 4, 'electricalBikes': 0, 'electricalInternalBatteryBikes': 0, 'electricalRemovableBatteryBikes': 0}, 'capacity': 32}"
I tried to use str.split() and received the error because of the patterns.
bikes_table_ready[['availabilities',
'bikes',
'stands',
'mechanicalBikes',
'electricalBikes',
'electricalInternalBatteryBikes',
'electricalRemovableBatteryBikes',
'capacity']]= bikes_table_ready.totalStands.str.extract('{.}', expand=True)
ValueError: pattern contains no capture groups
Which patterns should I use to have it done?
IIUC, use ast.literal_eval with pandas.json_normalize.
With a dataframe df with two columns (id) and the column to be splitted (col), it gives this :
import ast
​
df["col"] = df["col"].apply(lambda x: ast.literal_eval(x.strip('"')))
​
out = df.join(pd.json_normalize(df.pop("col").str["availabilities"]))
# Output :
print(out.to_string())
id bikes stands mechanicalBikes electricalBikes electricalInternalBatteryBikes electricalRemovableBatteryBikes
0 id001 4 28 4 0 0 0
Welcome to Stack Overflow! Please provide a minimal reproducible example demonstrating the problem. To learn more about this community and how we can help you, please start with the tour and read How to Ask and its linked resources.
That being said, it seems that the data you are trying to use the method str.split() is not actually a string. Check this to find more about data types. It seems you are trying to retrieve the information from a Python List "[xxx] Or Dictionary "dicName{"Key":"value}". If that's the case, try checking this link which talks about how to use Python Lists or this which talks about dictionaries.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

Pandas, importing JSON-like file using read_csv

I would like to import data from .txt to dataframe. I can not import it using classical pd.read_csv, while using different types of sep it throws me errors. Data I want to import Cell_Phones_&_Accessories.txt.gz is in a format.
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
....
You can use jen for separator, then split by first : and pivot:
df = pd.read_csv('Cell_Phones_&_Accessories.txt', sep='¥', names=['data'], engine='python')
df1 = df.pop('data').str.split(':', n=1, expand=True)
df1.columns = ['a','b']
df1 = df1.assign(c=(df1['a'] == 'product/productId').cumsum())
df1 = df1.pivot('c','a','b')
Python solution with defaultdict and DataFrame constructor for improve performance:
from collections import defaultdict
data = defaultdict(list)
with open("Cell_Phones_&_Accessories.txt") as f:
for line in f.readlines():
if len(line) > 1:
key, value = line.strip().split(':', 1)
data[key].append(value)
df = pd.DataFrame(data)

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.