Pandas Keyerror in Simple calling of array elements

Pandas Keyerror in Simple calling of array elements - pandas

I am loading a .csv from Pandas. It has a column for country, month, year, and date. Since I'm only interested in these, I overwrite the imported dataframe with a simpler version.
df = pd.read_csv('data.csv')
idx_USA = df['Country'] == 'United States'
df = df.loc[:,['Year','Month','Date']]
print(df[:4])
This yeilds
Year Month Date
1 2007 1 1
2 2004 10 2
4 1999 10 14
7 2000 10 5
Now,oddly, when I try to access the years in a loop, I get a keyerror! This is so simple-- what is going on? Thanks.
for i in range(1,N):
print "Yr = ", df['Year'][i]
Yr = 2007
Yr = 2004
Yr =
Traceback (most recent call last):
File "testimport.py", line 19, in <module>
print "Yr = ", df['Year'][i]
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 491, in __getitem__
result = self.index.get_value(self, key)
File "/usr/lib/python2.7/dist-packages/pandas/core/index.py", line 1032, in get_value
return self._engine.get_value(s, k)
File "index.pyx", line 97, in pandas.index.IndexEngine.get_value (pandas/index.c:2957)
File "index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:2772)
File "index.pyx", line 149, in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)
File "hashtable.pyx", line 382, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)
File "hashtable.pyx", line 388, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)
KeyError: 3

Instead of resetting the index you can just iterate through the underlying array.
arr = df['year']
for i in range(1,N):
print arr[i]

Related

How can I output the "Name" value in pandas?

After I do an iloc (df.iloc[3]) , I get an output with all the column names and their values for a given row.
What should the code be if I only want to out put the "Name" value for the same row?
Eg:
Columns 1 Value 1
Columns 2 Value 2
Name: Row 1, dtype: object
So, in this case "Row 1".

>>> df = pd.DataFrame({'Name': ['Uncle', 'Sam', 'Martin', 'Jacob'], 'Salary': [1000, 2000, 3000, 1500]})
>>> df
Name Salary
0 Uncle 1000
1 Sam 2000
2 Martin 3000
3 Jacob 1500
df.iloc[3] gives the following:
>>> df.iloc[3]
Name Jacob
Salary 1500
Name: 3, dtype: object
However, df.iloc[3, 'Name'] throws the following exception:
>>> df.iloc[3, 'Name']
Traceback (most recent call last):
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 235, in _has_valid_tuple
self._validate_key(k, i)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2035, in _validate_key
"a [{types}]".format(types=self._valid_types)
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1418, in __getitem__
return self._getitem_tuple(key)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2092, in _getitem_tuple
self._has_valid_tuple(tup)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 239, in _has_valid_tuple
"[{types}] types".format(types=self._valid_types)
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Use df.loc[3, 'Name'] instead:
>>> df.loc[3, 'Name']
'Jacob'

df.iloc is a series
df.iloc['3'].name will return the name
Example:
>> df=pd.DataFrame({'data': [100,200]})
>> df=df.set_index(pd.Index(['A','B']))
>> df.iloc[1]
data 200
Name: B, dtype: int64
>> df.iloc[1].name
'B'

Can't create Dask dataframe although Pandas dataframe gets created for the same query (sqlalchemy.exc.NoSuchTableError)

Hello I am trying to create a Dask Dataframe by pulling data from an Oracle Database as:
import cx_Oracle
import pandas as pd
import dask
import dask.dataframe as dd
# Build connection string/URL
user='user'
pw='pw'
host = 'xxx-yyy-x000'
port = '9999'
sid= 'XXXXX000'
ora_uri = 'oracle+cx_oracle://{user}:{password}#{sid}'.format(user=user, password=pw, sid=cx_Oracle.makedsn(host,port,sid))
tstquery ="select ID from EXAMPLE where rownum <= 5"
# Create Pandas Dataframe from ORACLE Query pull
tstdf1 = pd.read_sql(tstquery
,con = ora_uri
)
print("Dataframe tstdf1 created by pd.read_sql")
print(tstdf1.info())
# Create Dask Dataframe from ORACLE Query pull
tstdf2 = dd.read_sql_table(table = tstquery
,uri = ora_uri
,index_col = 'ID'
)
print(tstdf2.info())
As you can see the Pandas DF gets created but not the Dask DF. Following is the stdout:
Dataframe tstdf1 created by pd.read_sql
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
ID 5 non-null int64
dtypes: int64(1)
memory usage: 120.0 bytes
None
Traceback (most recent call last):
File "dk_test.py", line 40, in <module>
,index_col = 'ID'
File "---------------------------python3.6/site-packages/dask/dataframe/io/sql.py", line 103, in read_sql_table
table = sa.Table(table, m, autoload=True, autoload_with=engine, schema=schema)
File "<string>", line 2, in __new__
File "---------------------------python3.6/site-packages/sqlalchemy/util/deprecations.py", line 130, in warned
return fn(*args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 496, in __new__
metadata._remove_table(name, schema)
File "---------------------------python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "---------------------------python3.6/site-packages/sqlalchemy/util/compat.py", line 154, in reraise
raise value
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 491, in __new__
table._init(name, metadata, *args, **kw)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 585, in _init
resolve_fks=resolve_fks,
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 609, in _autoload
_extend_on=_extend_on,
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in run_callable
return conn.run_callable(callable_, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 1604, in run_callable
return callable_(self, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/default.py", line 429, in reflecttable
table, include_columns, exclude_columns, resolve_fks, **opts
File "---------------------------python3.6/site-packages/sqlalchemy/engine/reflection.py", line 653, in reflecttable
raise exc.NoSuchTableError(table.name)
sqlalchemy.exc.NoSuchTableError: select ID from EXAMPLE where rownum <= 5
Needless to say, the table exists (As demonstrated by the creation of the Pandas DF), the Index is on
the col ID as well. What is the problem ?

Pandas and timeseries

I have a dictionary of dataframes. I want to convert each dataframe in it to its respective timeseries. I am able to convert one nicely. But, if I do it within an iterator, it complains. Eg:
This works:
df = dfDict[4]
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
print df.head() works nicely.
But, this doesn't work:
tsDict = {}
for id, df in dfDict.iteritems():
df['start_date'] = pd.to_datetime(df['start_date'])
df.set_index('start_date', inplace = True)
df.sort_index(inplace = True)
tsDict[id] = df
It gives the following error message:
Traceback (most recent call last):
File "tsa.py", line 105, in <module>
main()
File "tsa.py", line 84, in main
df['start_date'] = pd.to_datetime(df['start_date'])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'start_date'
I am unable to see the subtle problem here...

Arithmetic in pandas HDF5 queries

Why am I getting an error when I try to do simple arithmetic on constants in an HDF5 where clause? Here's an example:
>>> import pandas
>>> import numpy as np
>>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)})
>>> store = pandas.HDFStore('teststore.h5', mode='w')
>>> store.append('thingy', d, format='table', data_columns=True, append=False)
>>> store.select('thingy', where="B>50")
A B
0 0 61
1 1 63
6 6 80
7 7 79
8 8 52
9 9 82
>>> store.select('thingy', where="B>40+10")
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
store.select('thingy', where="B>40+10")
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 682, in select
return it.get_result()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 1365, in get_result
results = self.func(self.start, self.stop, where)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 675, in func
columns=columns, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4006, in read
if not self.read_axes(where=where, **kwargs):
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 3212, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4527, in __init__
self.condition, self.filter = self.terms.evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 580, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 122, in prune
res = pr(left.value, right.prune(klass))
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 317, in evaluate
raise ValueError("query term is not valid [%s]" % self)
ValueError: query term is not valid [[Condition : [None]]]
Querying directly on the underlying pytables object seems to work:
>>> for row in store.get_storer('thingy').table.where("B>40+10"):
... print(row[:])
(0L, 0, 61)
(1L, 1, 63)
(6L, 6, 80)
(7L, 7, 79)
(8L, 8, 52)
(9L, 9, 82)
So what is going on here?

This is simply not supported. I suppose it could fail with a slightly better message. it is trying to and the 2 nodes (the comparison and the +10) and doesn't know how to deal with it as it's not a comparison operation.
I suppose it could be implemented but IMHO is needlessly complex

Incompatible indexer with Series

Why do I get an error:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5] += 1
Output:
4 0
5 0
Traceback (most recent call last):
File "temp1.py", line 9, in <module>
dtype: int64
a.loc[4:5] += 1
File "lib\site-packages\pandas\core\indexing.py", line 88, in __setitem__
self._setitem_with_indexer(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 177, in _setitem_with_indexer
value = self._align_series(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 206, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
Pandas 0.12.

I think this is a bug, you can work around this by use tuple index:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5,] += 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas Keyerror in Simple calling of array elements - pandas

Instead of resetting the index you can just iterate through the underlying array. arr = df['year'] for i in range(1,N): print arr[i]

Related

How can I output the "Name" value in pandas?

Can't create Dask dataframe although Pandas dataframe gets created for the same query (sqlalchemy.exc.NoSuchTableError)

Pandas and timeseries

Arithmetic in pandas HDF5 queries

Incompatible indexer with Series

Categories

Resources