Incompatible indexer with Series - pandas

Why do I get an error:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5] += 1
Output:
4 0
5 0
Traceback (most recent call last):
File "temp1.py", line 9, in <module>
dtype: int64
a.loc[4:5] += 1
File "lib\site-packages\pandas\core\indexing.py", line 88, in __setitem__
self._setitem_with_indexer(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 177, in _setitem_with_indexer
value = self._align_series(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 206, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
Pandas 0.12.

I think this is a bug, you can work around this by use tuple index:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5,] += 1

Related

Exporting pandas df with column of tuples to BQ throws pyarrow error

I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({"id": [1,2,3], "items": [('a', 'b'), ('a', 'b', 'c'), tuple('d')]}
>print(df)
id items
0 1 (a, b)
1 2 (a, b, c)
2 3 (d,)
After registering my GCP/BQ credentials in the normal way...
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path_to_my_creds.json"
... I try to export it to a BQ table:
import pandas_gbq
pandas_gbq.to_gbq(df, "my_table_name", if_exists="replace")
but I keep getting the following error:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 1205, in to_gbq
...
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 342, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 915, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object
I have tried converting the tuple column to string with df = df.astype({"items":str}) and adding a table_schema param to the pandas_gbq.to_gbq... line but I keep getting this same error.
I have also tried replacing the pandas_gbq.to_gbq... line with the bq_client.load_table_from_dataframe method described here but still get the same pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object error...
So I think this is a weird issue with pandas dtypes being separate from Python types, and the astype only converting the type and not the pandas dtype. Try also converting the dtype to match the type after the astype statement.
Such that.
df = df.astype({"items": str})
Is replaced with:
df = df.astype({"items": str})
df = df.convert_dtypes()
Let me know if this works.

Why the dtype parameter of pd.Series cannot be used to convert integer strings to ints, but can convert to floats?

Can someone please explain the following ValueError for me:
>>> import pandas as pd
>>> pd.__version__
'1.3.5'
>>> pd.Series(['1', '2'], dtype='float64')
0 1.0
1 2.0
dtype: float64
>>> pd.Series(['1', '2'], dtype='int64')
C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\core\numeric.py:2446: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return bool(asarray(a1 == a2).all())
Traceback (most recent call last):
File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3444, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-171b1b2dceb0>", line 1, in <module>
pd.Series(['1', '2'], dtype='int64')
File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 439, in __init__
data = sanitize_array(data, index, dtype, copy)
File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\construction.py", line 569, in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\construction.py", line 754, in _try_cast
subarr = maybe_cast_to_integer_array(arr, dtype)
File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\dtypes\cast.py", line 2094, in maybe_cast_to_integer_array
raise ValueError(f"values cannot be losslessly cast to {dtype}")
ValueError: values cannot be losslessly cast to int64

Can't create Dask dataframe although Pandas dataframe gets created for the same query (sqlalchemy.exc.NoSuchTableError)

Hello I am trying to create a Dask Dataframe by pulling data from an Oracle Database as:
import cx_Oracle
import pandas as pd
import dask
import dask.dataframe as dd
# Build connection string/URL
user='user'
pw='pw'
host = 'xxx-yyy-x000'
port = '9999'
sid= 'XXXXX000'
ora_uri = 'oracle+cx_oracle://{user}:{password}#{sid}'.format(user=user, password=pw, sid=cx_Oracle.makedsn(host,port,sid))
tstquery ="select ID from EXAMPLE where rownum <= 5"
# Create Pandas Dataframe from ORACLE Query pull
tstdf1 = pd.read_sql(tstquery
,con = ora_uri
)
print("Dataframe tstdf1 created by pd.read_sql")
print(tstdf1.info())
# Create Dask Dataframe from ORACLE Query pull
tstdf2 = dd.read_sql_table(table = tstquery
,uri = ora_uri
,index_col = 'ID'
)
print(tstdf2.info())
As you can see the Pandas DF gets created but not the Dask DF. Following is the stdout:
Dataframe tstdf1 created by pd.read_sql
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
ID 5 non-null int64
dtypes: int64(1)
memory usage: 120.0 bytes
None
Traceback (most recent call last):
File "dk_test.py", line 40, in <module>
,index_col = 'ID'
File "---------------------------python3.6/site-packages/dask/dataframe/io/sql.py", line 103, in read_sql_table
table = sa.Table(table, m, autoload=True, autoload_with=engine, schema=schema)
File "<string>", line 2, in __new__
File "---------------------------python3.6/site-packages/sqlalchemy/util/deprecations.py", line 130, in warned
return fn(*args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 496, in __new__
metadata._remove_table(name, schema)
File "---------------------------python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "---------------------------python3.6/site-packages/sqlalchemy/util/compat.py", line 154, in reraise
raise value
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 491, in __new__
table._init(name, metadata, *args, **kw)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 585, in _init
resolve_fks=resolve_fks,
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 609, in _autoload
_extend_on=_extend_on,
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in run_callable
return conn.run_callable(callable_, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 1604, in run_callable
return callable_(self, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/default.py", line 429, in reflecttable
table, include_columns, exclude_columns, resolve_fks, **opts
File "---------------------------python3.6/site-packages/sqlalchemy/engine/reflection.py", line 653, in reflecttable
raise exc.NoSuchTableError(table.name)
sqlalchemy.exc.NoSuchTableError: select ID from EXAMPLE where rownum <= 5
Needless to say, the table exists (As demonstrated by the creation of the Pandas DF), the Index is on
the col ID as well. What is the problem ?

groupby.rolling.count() yield `non-unique multi-index` exception

I'm using train_sample.csv here.
Consider the following:
import pandas as pd
df = pd.read_csv('train_sample.csv')
df = df.drop(['attributed_time'], axis=1)
df['click_time'] = pd.to_datetime(df['click_time'])
df = df.set_index('click_time')
df = df.sort_index()
df['clicks_last_hour'] = df.groupby(['ip']).rolling('1H').count()
Where I'm trying to create a new column that counts the number times where a certain ip clicked in the last hour.
I'm getting:
Traceback (most recent call last):
File "train_sample.py", line 11, in <module>
df['clicks_last_hour'] = df.groupby(['ip']).rolling('1H').count()
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__
self._set_item(key, value)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3378, in _sanitize_column
value = reindexer(value).T
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3358, in reindexer
raise e
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3353, in reindexer
value = value.reindex(self.index)._values
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\util\_decorators.py", line 187, in wrapper
return func(*args, **kwargs)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3566, in reindex
return super(DataFrame, self).reindex(**kwargs)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\generic.py", line 3689, in reindex
fill_value, copy).__finalize__(self)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3501, in _reindex_axes
fill_value, limit, tolerance)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\frame.py", line 3509, in _reindex_index
tolerance=tolerance)
File "C:\Users\galah\Miniconda3\envs\venv\lib\site-packages\pandas\core\indexes\multi.py", line 2068, in reindex
raise Exception("cannot handle a non-unique multi-index!")
Exception: cannot handle a non-unique multi-index!
Though from what i checked there are no duplicates based on same ip and click_time.
What am I doing wrong?

Arithmetic in pandas HDF5 queries

Why am I getting an error when I try to do simple arithmetic on constants in an HDF5 where clause? Here's an example:
>>> import pandas
>>> import numpy as np
>>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)})
>>> store = pandas.HDFStore('teststore.h5', mode='w')
>>> store.append('thingy', d, format='table', data_columns=True, append=False)
>>> store.select('thingy', where="B>50")
A B
0 0 61
1 1 63
6 6 80
7 7 79
8 8 52
9 9 82
>>> store.select('thingy', where="B>40+10")
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
store.select('thingy', where="B>40+10")
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 682, in select
return it.get_result()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 1365, in get_result
results = self.func(self.start, self.stop, where)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 675, in func
columns=columns, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4006, in read
if not self.read_axes(where=where, **kwargs):
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 3212, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4527, in __init__
self.condition, self.filter = self.terms.evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 580, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 122, in prune
res = pr(left.value, right.prune(klass))
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 317, in evaluate
raise ValueError("query term is not valid [%s]" % self)
ValueError: query term is not valid [[Condition : [None]]]
Querying directly on the underlying pytables object seems to work:
>>> for row in store.get_storer('thingy').table.where("B>40+10"):
... print(row[:])
(0L, 0, 61)
(1L, 1, 63)
(6L, 6, 80)
(7L, 7, 79)
(8L, 8, 52)
(9L, 9, 82)
So what is going on here?
This is simply not supported. I suppose it could fail with a slightly better message. it is trying to and the 2 nodes (the comparison and the +10) and doesn't know how to deal with it as it's not a comparison operation.
I suppose it could be implemented but IMHO is needlessly complex