Pandasql returns error with a basic example - pandas

The following code when run
import pandas as pd
from pandasql import sqldf
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [10, 20, 30, 40]})
query = "SELECT * FROM df WHERE col1 > 2"
result = sqldf(query, globals())
print(result)
gives the following error:
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/.virtualenvs/r-reticulate/lib64/python3.11/site-packages/sqlalchemy/engine/base.py:1410, in Connection.execute(self, statement, parameters, execution_options)
1409 try:
-> 1410 meth = statement._execute_on_connection
1411 except AttributeError as err:
AttributeError: 'str' object has no attribute '_execute_on_connection'
The above exception was the direct cause of the following exception:
ObjectNotExecutableError Traceback (most recent call last)
Cell In[1], line 11
8 query = "SELECT * FROM df WHERE col1 > 2"
10 # Execute the query using pandasql
---> 11 result = sqldf(query, globals())
13 print(result)
File ~/.virtualenvs/r-reticulate/lib64/python3.11/site-packages/pandasql/sqldf.py:156, in sqldf(query, env, db_uri)
124 def sqldf(query, env=None, db_uri='sqlite:///:memory:'):
125 """
126 Query pandas data frames using sql syntax
127 This function is meant for backward compatibility only. New users are encouraged to use the PandaSQL class.
(...)
154 >>> sqldf("select avg(x) from df;", locals())
...
1416 distilled_parameters,
1417 execution_options or NO_OPTIONS,
1418 )
ObjectNotExecutableError: Not an executable object: 'SELECT * FROM df WHERE col1 > 2'
Could someone please help me?

The problem could be fixed by downgrading SQLAlchemy:
pip install SQLAlchemy==1.4.46
See bug report for more details.

Related

upload pandas dataframe to redshift - relation "sqlite_master" does not exist

I am trying to write a dataframe from pandas to redshift.
here is the code
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
from sqlalchemy import create_engine
import sqlalchemy
sql_engine = create_engine('postgresql://username:password#host:port/dbname')
conn = sql_engine.raw_connection()
df.to_sql('tmp_table', conn, index = False, if_exists = 'replace')
However, I get the following error
---------------------------------------------------------------------------
UndefinedTable Traceback (most recent call last)
~/opt/anaconda3/envs/UserExperience/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
1594 else:
-> 1595 cur.execute(*args)
1596 return cur
UndefinedTable: relation "sqlite_master" does not exist
...
...
...
1593 cur.execute(*args, **kwargs)
1594 else:
-> 1595 cur.execute(*args)
1596 return cur
1597 except Exception as exc:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': relation "sqlite_master" does not exist
I tried to user pandas_redshift however, seems first one has to upload to s3 bucket and then to the redshift. I would like to directly upload. Similarly, Here I see the answer suggest to upload to s3 first and then to the redshift
I can read and do query on the database using the same connection.
Try using sql_engine instead of conn.
I just had the same issue and using engine did the trick, try the following:
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://username:password#url:5439/db_name')
print(bool(engine)) # <- just to keep track of the process
with engine.connect() as conn:
print(bool(conn)) # <- just to keep track of the process
df.to_sql(name=table_name, con=engine)
print("end") # <- just to keep track of the process

Problems using pandas read sql with a connection using cx_Oracle 6.0b2

When using cx_Oracle 5.3 I did not have this issue, but for a particularly large query that I am trying to run using:
connection = cx_Oracle.connect('Username/Password#host/dbname')
pd.read_sql(Query,connection)
I get the following value error:
ValueError Traceback (most recent call last)
<ipython-input-22-916f315e0bf6> in <module>()
----> 1 OracleEx = pd.read_sql(x,connection)
2 OracleEx.head()
C:\Users\kevinb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
497 sql, index_col=index_col, params=params,
498 coerce_float=coerce_float, parse_dates=parse_dates,
--> 499 chunksize=chunksize)
500
501 try:
C:\Users\kevinb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
1606 parse_dates=parse_dates)
1607 else:
-> 1608 data = self._fetchall_as_list(cursor)
1609 cursor.close()
1610
C:\Users\kevinb\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\sql.py in _fetchall_as_list(self, cur)
1615
1616 def _fetchall_as_list(self, cur):
-> 1617 result = cur.fetchall()
1618 if not isinstance(result, list):
1619 result = list(result)
ValueError: invalid literal for int() with base 10: '8.9'
Setting up my own cursor and using cur.fetchall() I get a similar result:
ValueError Traceback (most recent call last)
<ipython-input-46-d32c0f219cdf> in <module>()
----> 1 y=x.fetchall()
2 pd.DataFrame(y)
ValueError: invalid literal for int() with base 10: '7.3'
The values '8.9' and '7.3' change with every run.
Any ideas on why I am getting these value errors?
pd.read_sql and using cur.fetchall() have worked for some queries, but not the particular one I am using which has worked in previous versions of cx_Oracle.
Please try with the release candidate instead of beta 2. There was an issue when retrieving certain numeric expressions.
python -m pip install cx_Oracle --upgrade --pre

How to call unique() on dask DataFrame

How do I call unique on a dask DataFrame ?
I get the following error if I try to call it the same way as for a regular pandas dataframe:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'
For both Pandas and Dask.dataframe you should use the drop_duplicates method
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20
This works with dask==2022.11.1
ddf.symbol.unique().compute()
I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.
http://dask.pydata.org/en/latest/dataframe-api.html
You could try this:
sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))
I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.

Error in filtering groupby results in pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)
This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3