Dask dataframe groupby fails with type error, but identical pandas groupby succeeds - pandas

I have created a dask dataframe from geopandas futures that each yield a pandas dataframe following the example here: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
daskdf = dd.from_delayed(lazy_dataframes,lazy_dataframes, meta=lazy_dataframes[0].compute())
All dtypes seem reasonable
daskdf.dtypes
left float64
bottom float64
right float64
top float64
score object
label object
height float64
area float64
geometry geometry
shp_path object
geo_index object
Year int64
Site object
dtype: object
but dd groupby operations fails
daskdf.groupby(['Site']).height.mean().compute()
...
"/Users/ben/miniconda3/envs/crowns/lib/python3.7/site-packages/dask/dataframe/utils.py", line 577, in _nonempty_series
data = np.array([entry, entry], dtype=dtype)
builtins.TypeError: data type not understood
whereas pandas has no problem with the same process on the same data.
daskdf.compute().groupby(['Site']).height.mean()
Site
SOAP 15.102355
Name: height, dtype: float64
What might be happening here with the metadata types that could cause this. As I scale my workflow, I would like to perform distributed operations on persisted data.

The problem is the 'geometry' dtype which comes from geopandas. My pandas dataframe came from loading a shapefile using geopandas.read_file(). Future users beware, drop this column when creating a dask dataframe. I know there was a dask-geopandas attempt some time ago. This was harder to follow since the statement
daskdf.groupby(['Site']).height.mean().compute()
does not involve the geometry column. Dask must check the dtypes of all columns, not just the ones used in an operation. Be careful!
Dropping the geometry column yields the expected result.
daskdf.drop(columns="geometry")
daskdf.groupby(['Site']).height.mean().compute()
Tagging with geopandas in hopes future users can find this.

Related

Sparse columns in pandas: directly access the indices of non-null values

I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. I figured that this could be achieved by df.<column name>[<indices of non-null values>]. However, I can't see how to access <indices of non-null values> directly, i.e. without any computation. When I try df.<column name>.index it tells me that it's a RangeIndex, which doesn't help. I can even see <indices of non-null values> when I run df.<column name>.values, but looking through dir(df.<column name>.values) I still cant't see a way to access them.
To make clear what I mean, here is a toy example:
In this example <indices of non-null values> is [0,1,3].
EDIT: The answer below by #Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values> directly via an attribute of the column or array?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
columns=['a'])
just filter without nan:
filtered_df = df[df['a'].notnull()]
transform column from df to array:
s_array = filtered_df[["a"]].to_numpy()
or - transform indexes from df to array:
filtered_df.index.tolist()

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Check if a value in one Dask dataframe is in another Dask dataframe

I found the answer in this post for pandas dataframe
But when I try to apply it to a Dask dataframe I get a “not implemented” error.
You should load the data into memory, cause dask is lazy loading, it's data not in memory until you execute compute(). and function isin in Dask requires a list-like parameter, not a Dask-object. Official website has instructions as:
Parameters: values:set or list-like
So we can change the example of for pandas dataframe to:
Df1.name.isin(Df2.IDs.compute()).astype(int).compute()
0 1
1 1
2 0
3 0
Name: name, dtype: int64
Refer: Dask.compute

Pandas- Groupby Plot is not working for object

I am new to Pandas and doing some analysis csv file. I have successfully read csv and shown all details. I have got two column as an object type which I need to plot. I have done groupy for those two columns and getting first and all data, However I am not sure, how to do plotting for these object types in Pandas. Below is my sample of Groupby and smaple for event_type and event_description for which I need to do plotting. If I can plot for Application and Network for event_type that will be great help
import pandas as pd
data = pd.read_csv('/Users/temp/Downloads/sample.csv’)
data.head()
grouped_df = data.groupby([ "event_type", "event_description"])
grouped_df.first()
As commented - need more info, but IIUC, try:
df['event_type'].value_counts(sort=True).plot(kind='barh')

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.
In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s