pandas.read_sql read array columns directly into native structures? - pandas

Is there any way to get pandas to read a table with array-typed columns directly into native structures? By default, a int[] column ends up as an object column containing python list of python ints. There are ways to convert this into a column of Series, or better, a column with a multi-index, but this are very slow (~10 seconds) for 500M rows. Would be much faster if the data was originally loaded into a dataframe. I don't what to unroll the array in sql because I have very many array columns.
url = "postgresql://u:p#host:5432/dname"
engine = sqlalchemy.create_engine(url)
df = pd.read_sql_query("select 1.0 as a, 2.2 as b, array[1,2,3] as c;", engine)
print df
print type(df.loc[0,'c']) # list
print type(df.loc[0,'c'][0]) # int

Does it help if you use read_sql_table instead of read_sql_query ? Also, type detection can fail due to missing values. Maybe this is the cause.

Related

Improving for loop over groupby keys of a pandas dataframe and concatenating

I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).

how to merge small dataframes into a large without copy

I have a large pandas dataframe, and want to merge a couple smaller dataframes into it, thus adding more columns. However, it seems there is an implicit of copy of the large dataframe after each merge, which I want to avoid. What's the most efficient way to do this? (Note the resulting dataframe will have the same rows, as it is growing with more columns.) map seems better, as it keeps the original dataframe, but there is overhead to create dictionary. Also not sure it works with merging multiple col into the main one. Or maybe the merge may not be deep copying everything internally?
Base case:
id(df) # before merge
df = df.merge(df1[["sid", "col1"]], how="left", on=["sid"])
id(df) # will be different <-- trying to avoid copying df every time a smaller one merged into it
df = df.merge(df2[["sid", "col2"]], how="left", on=["sid", "key2"])
id(df) # will be different
...
Using map():
d_col1 = {d["sid"]:d["col1"] for d in df1[["sid", "col1"]].to_dict("records")}
df["col1"] = df["sid"].map(d_col1)
id(df) # this is the same object
Some post referred dask, haven't tested that yet.
here is another way. First map can be done with a Series and as df1 is already built, I don't know if it is less efficient than using a dictionary though.
df["col1"] = df["sid"].map(df1.set_index('sid')['col1'])
Now with two or more columns, you can play with index
df['col2'] = (
df2.set_index(['sid','key2'])['col2']
.reindex(pd.MultiIndex.from_frame(df[['sid','key2']]))
.to_numpy()
)

Pandas, turn all the data frame to unique categorical values

I am relatively new to Pandas and to python and I am trying to find out how to turn all content(all fields are strings) of a Pandas Dataframe to categorical one.
All the values from rows and columns have to be treated as a big unique data set before turning them to categorical numbers.
So far I was able to write the following piece of code
for col_name in X.columns:
if(X[col_name].dtype == 'object'):
X[col_name]= X[col_name].astype('category')
X[col_name] = X[col_name].cat.codes
that works on a data frame X of multiple columns. It takes the strings and turns them to unique numbers.
What I am not sure for the code above is that my for loop only works per column and I am not sure if the codes assigned are unique per column or per whole data frame (the latter is the desired action).
Can you please provide advice on how I can turn my code to provide unique numbers considering all the values of the data frame?
I would like to thank you in advance for your help.
Regards
Alex
Use DataFrame.stack with Series.unstack for set MultiIndex Series to unique values:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].stack().astype('category').cat.codes.unstack()

Using dask to read data from Hive

I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)

pandas read_sql not reading all rows

I am running the exact same query both through pandas' read_sql and through an external app (DbVisualizer).
DbVisualizer returns 206 rows, while pandas returns 178.
I have tried reading the data from pandas by chucks based on the information provided at How to create a large pandas dataframe from an sql query without running out of memory?, it didn't make a change.
What could be the cause for this and ways to remedy it?
The query:
select *
from rainy_days
where year=’2010’ and day=‘weekend’
The columns contain: date, year, weekday, amount of rain at that day, temperature, geo_location (row per location), wind measurements, amount of rain the day before, etc..
The exact python code (minus connection details) is:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass#server.com/weatherhist?port=5439',
)
query = """
select *
from rainy_days
where year=’2010’ and day=‘weekend’
"""
df = pandas.read_sql(query, con=engine)
https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/14
If you use pure engine.execute you should care about format manually
The problem is that pandas returns a packed dataframe (DF). For some reason this is always on by default and the results varies widely as to what is shown. The solution is to use the unpacking operator (*) before/when trying to print the df, like this:
print(*df)
(This is also know as the splat operator for Ruby enthusiasts.)
To read more about this, please check out these references & tutorials:
https://treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/
https://www.geeksforgeeks.org/python-star-or-asterisk-operator/
https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558
https://towardsdatascience.com/unpacking-operators-in-python-306ae44cd480
It's not a fix, but what worked for me was to rebuild the indices:
drop the indices
export the whole thing to a csv:
delete all the rows:
DELETE FROM table
import the csv back in
rebuild the indices
pandas:
df = read_csv(..)
df.to_sql(..)
If that works, then at least you know you have a problem somewhere with the indices keeping up to date.