Is it possible to use Series.str.extract with Dask? - pandas

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.

You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object

Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Related

Combining pandas dataframes and ArcGIS feature classes

I have a bit of a general question about the compatibility of Pandas dataframes and Arc featureclasses.
My current project is within ArcGIS and so I am mapping mostly with featureclasses. I am however, most familiar with using pandas to perform simple data analysis with tables. Therefore, I am attempting to work with dataframes for the most part, and then join their data to feature classes for final mapping using some key field common between sets.
Attempts:
1.I have come to find that arcpy AddJoin does not accept dfs.
2.I am currently trying convert df to csv and then do an Addjoin however I am unsure if this is supported and I far prefer the functionality of filtering dfs with "df.loc" etc.
Update cursor seems to be a good option, however, I am experiencing issues accessing the key field of the "row" in my loop to match records. I will post another question about this as it is a separate issue.
Which of these or other options is the best for this purpose?
Thanks!
Esri introduced something called Spatially Enabled DataFrame:
The Spatially Enabled DataFrame inserts a custom namespace called spatial into the popular Pandas DataFrame structure to give it spatial abilities. This allows you to use intutive, pandorable operations on both the attribute and spatial columns.
import arcpy
import pandas as pd
# important as it "enhances" Pandas by importing these classes
from arcgis.features import GeoAccessor, GeoSeriesAccessor
# from a shape file
df = pd.DataFrame.spatial.from_featureclass(r"data\hospitals.shp")
# from a map layer
project = arcpy.mp.ArcGISProject('CURRENT')
map = project.activeMap
first_layer = map.listLayers()[0]
layer_name = first_layer.name
df = pd.DataFrame.spatial.from_featureclass(layer_name)
# or directly by name
df = pd.DataFrame.spatial.from_featureclass("Streets")
# of if nested within a group layer (e.g. Buildings)
df = pd.DataFrame.spatial.from_featureclass("Buildings\Residential")
# save to shapefile
df.spatial.to_featureclass(location=r"c:\temp\residential_buildings.shp")
However, you have to use intermediate files if you go back and forth (to my knowledge). Although it's a bit tricky having geopandas installed along arcpy, it may be worth looking into (only) using geopandas.
IMHO, I would recommend that you avoid unnecessarily going back and forth between arcpy and pandas. Pandas allows to merge, join and concat dataframes. Or, you may be able to do everything in geopandas without needing to touch arcpy functions at all.

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Groupby and filter with dask

I want to use dask to make a groupby.
Using pandas I would have to write this to make a basic a basic group by and filter.
My dataset contains 2 indexes : ORDER_ID and PROD_ID. Each ORDER defined by ORDER_ID, we can have 1 or more product defined by its PROD_ID.
My objective is to remove ORDER_ID that contain 1 product.
Using pandas I can do it this way:
df = df.groupby('ORDER_ID').filter(lambda x: len(x) >= 2)
I didn't find any suitable solution with dask.
https://docs.dask.org/en/latest/dataframe-best-practices.html discusses the issues with pandas and dask.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.
So this task is not working in pandas as it takes too much memory?

One Hot Encoding of large dataset

I want to build recommendation system using association rules with implemented in mlxtend library apriori algorithm. In my sales data there is information about 36 millions of transactions and 50k unique products.
I tried to use sklearn OneHotEncoder and pandas get_dummies() but both are giving OOM error as they are not able to create frame in shape of (36 mil, 50k)
MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8
Is there any other solution?
Like you, I too had out of memory error with mlxtend at first, but the following small changes fixed the problem completely.
`
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
te = TransactionEncoder()
#te_ary = te.fit(itemSetList).transform(itemSetList)
#df = pd.DataFrame(te_ary, columns=te.columns_)
fitted = te.fit(itemSetList)
te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good
df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good
# now you can call mlxtend's fpgrowth() followed by association_rules()
`
You should also use fpgrowth instead of apriori on the big transaction datasets because apriori is too primitive. fpgrowth is more intelligent and modern than apriori but gives equivalent results. The mlxtend lib supports both apriori and fpgrowth.
I think a good solution would be to use embeddings instead of one-hot encoding for your problem. In addition, I recommend that you split your dataset into smaller subsets to further avoid the memory consumption problems.
You should also consult this thread : https://datascience.stackexchange.com/questions/29851/one-hot-encoding-vs-word-embeding-when-to-choose-one-or-another

Transform after groupby in dask

I'm looking to perform a transform-like operation after a groupby using dask dataframes. Looking through the docs, it doesn't seem like dask currently gives this option - but does anyone have any work around?
In practice: I'm looking to subtract the means of column B (after doing a groupby on A) from the raw values of B. In pure pandas, it looks like this:
def demean_and_log(x):
x_log = np.log(x)
x_log_mean = x_log.mean()
return x_log - x_log_mean
log_demean_col = X.groupby(['A'])['B'].transform(demean_and_log)
However, this is very slow - as I'm dealing with very large dataframes, and since I'm using a custom transformation function in Python, pandas doesn't release the GIL.