joblib.Memory and pandas.DataFrame inputs - pandas

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Related

Combining pandas dataframes and ArcGIS feature classes

I have a bit of a general question about the compatibility of Pandas dataframes and Arc featureclasses.
My current project is within ArcGIS and so I am mapping mostly with featureclasses. I am however, most familiar with using pandas to perform simple data analysis with tables. Therefore, I am attempting to work with dataframes for the most part, and then join their data to feature classes for final mapping using some key field common between sets.
Attempts:
1.I have come to find that arcpy AddJoin does not accept dfs.
2.I am currently trying convert df to csv and then do an Addjoin however I am unsure if this is supported and I far prefer the functionality of filtering dfs with "df.loc" etc.
Update cursor seems to be a good option, however, I am experiencing issues accessing the key field of the "row" in my loop to match records. I will post another question about this as it is a separate issue.
Which of these or other options is the best for this purpose?
Thanks!
Esri introduced something called Spatially Enabled DataFrame:
The Spatially Enabled DataFrame inserts a custom namespace called spatial into the popular Pandas DataFrame structure to give it spatial abilities. This allows you to use intutive, pandorable operations on both the attribute and spatial columns.
import arcpy
import pandas as pd
# important as it "enhances" Pandas by importing these classes
from arcgis.features import GeoAccessor, GeoSeriesAccessor
# from a shape file
df = pd.DataFrame.spatial.from_featureclass(r"data\hospitals.shp")
# from a map layer
project = arcpy.mp.ArcGISProject('CURRENT')
map = project.activeMap
first_layer = map.listLayers()[0]
layer_name = first_layer.name
df = pd.DataFrame.spatial.from_featureclass(layer_name)
# or directly by name
df = pd.DataFrame.spatial.from_featureclass("Streets")
# of if nested within a group layer (e.g. Buildings)
df = pd.DataFrame.spatial.from_featureclass("Buildings\Residential")
# save to shapefile
df.spatial.to_featureclass(location=r"c:\temp\residential_buildings.shp")
However, you have to use intermediate files if you go back and forth (to my knowledge). Although it's a bit tricky having geopandas installed along arcpy, it may be worth looking into (only) using geopandas.
IMHO, I would recommend that you avoid unnecessarily going back and forth between arcpy and pandas. Pandas allows to merge, join and concat dataframes. Or, you may be able to do everything in geopandas without needing to touch arcpy functions at all.

Speeding up pandas multi row assignment with loc()

I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one.
The standard solution are of the following syntax:
df.loc[row_mask, cols] = assigned_val
Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.
Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.
Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.
I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.
You could give np.where a try.
Here is an simple example of np.where
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])
np.where() do nothing if condition fails
has another example.
In your case, it might be
df[col] = np.where(df[col]==row_condition, assigned_val, df[col])
I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Fastest way to load multiple numpy arrays into spark rdd?

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)
For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.