How to override type inference for partition columns in Hive partitioned dataset using PyArrow? - pandas

I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:
ArrowInvalid: error parsing '3760212050' as scalar of type int32
This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.
How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?
Example of dataset partition values that causes this problem:
/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet
Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")
The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])
I figured out since my original post that I can do what I want with the PyArrow API directly like this:
from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))
df = ParquetDataset("mydataset",
filters=filters,
partitioning=partitioning,
use_legacy_dataset=False).read_pandas().to_pandas()
I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.

Related

Combining pandas dataframes and ArcGIS feature classes

I have a bit of a general question about the compatibility of Pandas dataframes and Arc featureclasses.
My current project is within ArcGIS and so I am mapping mostly with featureclasses. I am however, most familiar with using pandas to perform simple data analysis with tables. Therefore, I am attempting to work with dataframes for the most part, and then join their data to feature classes for final mapping using some key field common between sets.
Attempts:
1.I have come to find that arcpy AddJoin does not accept dfs.
2.I am currently trying convert df to csv and then do an Addjoin however I am unsure if this is supported and I far prefer the functionality of filtering dfs with "df.loc" etc.
Update cursor seems to be a good option, however, I am experiencing issues accessing the key field of the "row" in my loop to match records. I will post another question about this as it is a separate issue.
Which of these or other options is the best for this purpose?
Thanks!
Esri introduced something called Spatially Enabled DataFrame:
The Spatially Enabled DataFrame inserts a custom namespace called spatial into the popular Pandas DataFrame structure to give it spatial abilities. This allows you to use intutive, pandorable operations on both the attribute and spatial columns.
import arcpy
import pandas as pd
# important as it "enhances" Pandas by importing these classes
from arcgis.features import GeoAccessor, GeoSeriesAccessor
# from a shape file
df = pd.DataFrame.spatial.from_featureclass(r"data\hospitals.shp")
# from a map layer
project = arcpy.mp.ArcGISProject('CURRENT')
map = project.activeMap
first_layer = map.listLayers()[0]
layer_name = first_layer.name
df = pd.DataFrame.spatial.from_featureclass(layer_name)
# or directly by name
df = pd.DataFrame.spatial.from_featureclass("Streets")
# of if nested within a group layer (e.g. Buildings)
df = pd.DataFrame.spatial.from_featureclass("Buildings\Residential")
# save to shapefile
df.spatial.to_featureclass(location=r"c:\temp\residential_buildings.shp")
However, you have to use intermediate files if you go back and forth (to my knowledge). Although it's a bit tricky having geopandas installed along arcpy, it may be worth looking into (only) using geopandas.
IMHO, I would recommend that you avoid unnecessarily going back and forth between arcpy and pandas. Pandas allows to merge, join and concat dataframes. Or, you may be able to do everything in geopandas without needing to touch arcpy functions at all.

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Implementing pythonic statistical functions on spark dataframes

I have very large datasets in spark dataframes that are distributed across the nodes.
I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .
If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:
import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)
I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.
My question is:
Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?
Thank you for any valuable insight
Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats
#pandas_udf('double', PandasUDFType.SCALAR)
def vector_jarque_bera(x):
return stats.jarque_bera(x)
# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))
Note that the vectorized function column takes a column as its argument and returns a column.
(Nb. The #pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)

Converting NaN floats to other types in Parquet format

I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of which are Boolean types. I know Pandas does not support optional bool types at this time, but is there anyway to specify to either FastParquet or PyArrow what type I would like a field to be? I am fine with the data being a float64 in my DF, but can't have it as such in my Parquet store due to existing files already being an optional Boolean Type.
You should try using the fastparquet engine, and the following keyword argument
object_encoding={'bool_col': 'bool'}
Also, pandas does now allow boolean columns with nans as an extension type, but it is not yet exactly default. That should work directly.
Example
import fastparquet as fp
df = pd.DataFrame({'a': [0, 1, 'nan']})
fp.write('out.parq', df, object_encoding={'a': 'bool'})
fp.write('out.parq', df.astype(float), object_encoding={'a': 'bool'})