Groupby and filter with dask - pandas

I want to use dask to make a groupby.
Using pandas I would have to write this to make a basic a basic group by and filter.
My dataset contains 2 indexes : ORDER_ID and PROD_ID. Each ORDER defined by ORDER_ID, we can have 1 or more product defined by its PROD_ID.
My objective is to remove ORDER_ID that contain 1 product.
Using pandas I can do it this way:
df = df.groupby('ORDER_ID').filter(lambda x: len(x) >= 2)
I didn't find any suitable solution with dask.

https://docs.dask.org/en/latest/dataframe-best-practices.html discusses the issues with pandas and dask.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.
So this task is not working in pandas as it takes too much memory?

Related

pandas to_numeric a large wide dataframe

I need to apply pd.to_numeric to a long and wide (1000+ columns) dataframe where invalid values are coerced as NaN.
Currently I'm using
df.apply(pd.to_numeric, errors="coerce")
which can take substantial amount of time due to the number of columns.
df.astype()
does not work either as it does not take coerce option.
Any comment is appreciated.
As it has already been commented on, the amount of data you're working with makes it pretty hard for pandas transformations to not be extremely slow.
I recommend you set up a PySpark session inside your local machine, transform the DataFrame column types and proceed to convert to Pandas at the end if you really need it.
In PySpark, you can convert all your dataframe's column to float by doing this:
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Afterwards you can just save your DataFrame back to where you want it to be (or maybe stick to PySpark and join the group!):
df.toPandas().to_csv('my_file.csv')

Combining pandas dataframes and ArcGIS feature classes

I have a bit of a general question about the compatibility of Pandas dataframes and Arc featureclasses.
My current project is within ArcGIS and so I am mapping mostly with featureclasses. I am however, most familiar with using pandas to perform simple data analysis with tables. Therefore, I am attempting to work with dataframes for the most part, and then join their data to feature classes for final mapping using some key field common between sets.
Attempts:
1.I have come to find that arcpy AddJoin does not accept dfs.
2.I am currently trying convert df to csv and then do an Addjoin however I am unsure if this is supported and I far prefer the functionality of filtering dfs with "df.loc" etc.
Update cursor seems to be a good option, however, I am experiencing issues accessing the key field of the "row" in my loop to match records. I will post another question about this as it is a separate issue.
Which of these or other options is the best for this purpose?
Thanks!
Esri introduced something called Spatially Enabled DataFrame:
The Spatially Enabled DataFrame inserts a custom namespace called spatial into the popular Pandas DataFrame structure to give it spatial abilities. This allows you to use intutive, pandorable operations on both the attribute and spatial columns.
import arcpy
import pandas as pd
# important as it "enhances" Pandas by importing these classes
from arcgis.features import GeoAccessor, GeoSeriesAccessor
# from a shape file
df = pd.DataFrame.spatial.from_featureclass(r"data\hospitals.shp")
# from a map layer
project = arcpy.mp.ArcGISProject('CURRENT')
map = project.activeMap
first_layer = map.listLayers()[0]
layer_name = first_layer.name
df = pd.DataFrame.spatial.from_featureclass(layer_name)
# or directly by name
df = pd.DataFrame.spatial.from_featureclass("Streets")
# of if nested within a group layer (e.g. Buildings)
df = pd.DataFrame.spatial.from_featureclass("Buildings\Residential")
# save to shapefile
df.spatial.to_featureclass(location=r"c:\temp\residential_buildings.shp")
However, you have to use intermediate files if you go back and forth (to my knowledge). Although it's a bit tricky having geopandas installed along arcpy, it may be worth looking into (only) using geopandas.
IMHO, I would recommend that you avoid unnecessarily going back and forth between arcpy and pandas. Pandas allows to merge, join and concat dataframes. Or, you may be able to do everything in geopandas without needing to touch arcpy functions at all.

Speeding up pandas multi row assignment with loc()

I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one.
The standard solution are of the following syntax:
df.loc[row_mask, cols] = assigned_val
Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.
Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.
Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.
I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.
You could give np.where a try.
Here is an simple example of np.where
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])
np.where() do nothing if condition fails
has another example.
In your case, it might be
df[col] = np.where(df[col]==row_condition, assigned_val, df[col])
I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Transform after groupby in dask

I'm looking to perform a transform-like operation after a groupby using dask dataframes. Looking through the docs, it doesn't seem like dask currently gives this option - but does anyone have any work around?
In practice: I'm looking to subtract the means of column B (after doing a groupby on A) from the raw values of B. In pure pandas, it looks like this:
def demean_and_log(x):
x_log = np.log(x)
x_log_mean = x_log.mean()
return x_log - x_log_mean
log_demean_col = X.groupby(['A'])['B'].transform(demean_and_log)
However, this is very slow - as I'm dealing with very large dataframes, and since I'm using a custom transformation function in Python, pandas doesn't release the GIL.