Polybase from Parquet error: Cannot cast Java.lang.Double to - pandas

Loading Azure Data Warehouse via Polybase, I'm reading Parquet files that are on Azure Blob.
First I created an external table in SQL to point to the Parquet file, and then loading with CTAS. No matter what data type I use in SQL, it gives me this type casting error. I've tried DECIMAL, NUMERIC, FLOAT. But loading VARCHAR works fine.
I suspect it has something to do with how the Parquet file was created, which is from a Python Pandas dataframe, using df.to_parquet and using pyarrow. Digging into the source code and experimenting, I see that the data type data when it is in Arrow (step before Parquet) is Double. Maybe that's why?
Also, I tried both Gzip and Snappy as compression types when creating the file, and when creating the SQL external table, no dice.
Going crazy from this. Any ideas?
Steps to reproduce
Environment:
conda create -n testenv python=3.6
conda install -n testenv -c conda-forge pyarrow
conda list -n testenv
# Name Version Build Channel
arrow-cpp 0.13.0 py36hee3af98_1 conda-forge
boost-cpp 1.68.0 h6a4c333_1000 conda-forge
brotli 1.0.7 he025d50_1000 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
certifi 2019.3.9 py36_0 conda-forge
gflags 2.2.2 he025d50_1001 conda-forge
glog 0.3.5 h6538335_1
intel-openmp 2019.3 203
libblas 3.8.0 5_mkl conda-forge
libcblas 3.8.0 5_mkl conda-forge
liblapack 3.8.0 5_mkl conda-forge
libprotobuf 3.7.1 h1a1b453_0 conda-forge
lz4-c 1.8.1.2 h2fa13f4_0
mkl 2019.3 203
numpy 1.16.2 py36h8078771_1 conda-forge
openssl 1.1.1b hfa6e2cd_2 conda-forge
pandas 0.24.2 py36h6538335_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
pip 19.0.3 py36_0
pyarrow 0.13.0 py36h8c67754_0 conda-forge
python 3.6.8 h9f7ef89_7
python-dateutil 2.8.0 py_0 conda-forge
pytz 2019.1 py_0 conda-forge
re2 2019.04.01 vc14h6538335_0 [vc14] conda-forge
setuptools 41.0.0 py36_0
six 1.12.0 py36_1000 conda-forge
snappy 1.1.7 h6538335_1002 conda-forge
sqlite 3.27.2 he774522_0
thrift-cpp 0.12.0 h59828bf_1002 conda-forge
vc 14.1 h0510ff6_4
vs2015_runtime 14.15.26706 h3a45250_0
wheel 0.33.1 py36_0
wincertstore 0.2 py36h7fe50ca_0
zlib 1.2.11 h2fa13f4_1004 conda-forge
zstd 1.3.3 vc14_1 conda-forge
Python:
>>> import pandas as pd
>>> df = pd.DataFrame({'ticker':['AAPL','AAPL','AAPL'],'price':[101,102,103]})
>>> df
ticker price
0 AAPL 101
1 AAPL 102
2 AAPL 103
>>> df.to_parquet('C:/aapl_test.parquet',engine='pyarrow',compression='snappy',index=False)
Azure steps:
Uploaded the Parquet file to Azure Blob
Using Azure Data Warehouse Gen2, size: DW400c
Per the docs and a tutorial, created the DATABASE SCOPED CREDENTIAL, EXTERNAL DATA SOURCE, and EXTERNAL FILE FORMAT
SQL Code:
CREATE EXTERNAL FILE FORMAT [ParquetFileSnappy] WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
GO
CREATE EXTERNAL DATA SOURCE [AzureBlobStorage] WITH (
TYPE = HADOOP,
LOCATION = N'wasbs://[redacted: containerName]#[redacted: storageAccountName].blob.core.windows.net',
CREDENTIAL = [AzureQuantBlobStorageCredential] -- created earlier
)
GO
CREATE EXTERNAL TABLE ext.technicals(
[ticker] VARCHAR(5) NOT NULL ,
[close_px] DECIMAL(8,2) NULL
) WITH (
LOCATION='/aapl_test.parquet',
DATA_SOURCE=AzureBlobStorage,
FILE_FORMAT=ParquetFileSnappy
);
CREATE TABLE [dbo].TechnicalFeatures
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].technicals
OPTION (LABEL = 'CTAS : Load [dbo].[TechnicalFeatures]')
;
And here is the error:
Msg 106000, Level 16, State 1, Line 20
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Long cannot be cast to class parquet.io.api.Binary (java.lang.Long is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
Edit:
Also tried using fastparquet instead of pyarrow, same error.

I repeated your Python file creation ... you owe me a beer for the pain and suffering inflicted by an Anaconda install ;)
On examining the file using parquet-tools, the problem is that your data values are being written as long integers (101,102,103), but you're trying to map them as decimals in your Create External Table statement.
If you change the DECIMAL(8,2) to BIGINT, then your data will load.
Alternatively, write your data values as doubles by adding a decimal point (101.0, 102.0, 103.0), then you can read them by changing DECIMAL(8,2) to DOUBLE PRECISION, or even FLOAT as they are small and precise numbers in this case.
(just kidding about the beer)

Related

`conda search PKG --info` shows different dependencies than what conda wants to install?

I'm building a new conda environment using python=3.9 for the
osx-arm64 architecture.
conda create -n py39 python=3.9 numpy
conda list
...
numpy 1.21.1 py39h1a24bff_2
...
python 3.9.7 hc70090a_1
So far so good: numpy=1.21.1 is the one i want. Now I want to add
scipy, and the first one seems to fit the bill:
conda search scipy --info
scipy 1.7.1 py39h2f0f56f_2
--------------------------
file name : scipy-1.7.1-py39h2f0f56f_2.conda
name : scipy
version : 1.7.1
build : py39h2f0f56f_2
build number: 2
size : 14.8 MB
license : BSD 3-Clause
subdir : osx-arm64
url : https://repo.anaconda.com/pkgs/main/osx-arm64/scipy-1.7.1-py39h2f0f56f_2.conda
md5 : edbd5a5399e973d1d0325147b7118f79
timestamp : 2021-08-25 16:12:39 UTC
dependencies:
- blas * openblas
- libcxx >=12.0.0
- libgfortran 5.*
- libgfortran5 >=11.1.0
- libopenblas >=0.3.17,<1.0a0
- numpy >=1.19.5,<2.0a0
- python >=3.9,<3.10.0a0
in particular, python >=3.9 and numpy >=1.19 seems just right.
but when i try the install
conda install scipy
...
The following packages will be DOWNGRADED:
numpy 1.21.1-py39h1a24bff_2 --> 1.19.5-py39habd9f23_3
(I have bumped into various constraints with numpy=1.19 (numba,
pandas,) and am trying to avoid it.)
Why isn't the scipy package happy with the numpy=1.21 version I
have?!
The only possible clue is that conda reports a different python
version (3.8.11) than the v3.9 I specified for this environment:
conda info
active environment : py39
active env location : .../miniconda3/envs/py39
shell level : 1
user config file : .../.condarc
populated config files : .../.condarc
conda version : 4.11.0
conda-build version : not installed
python version : 3.8.11.final.0 <-------------------
virtual packages : __osx=12.1=0
...
but all the environment's pointers seem to be set correctly:
(py39) % which python
.../miniconda3/envs/py39/bin/python
(py39) % python
Python 3.9.7 (default, Sep 16 2021, 23:53:23)
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin
Thanks, any hints as to what's broken will be greatly appreciated!
I now have things working, but I'm afraid I can't point to a satisfying "answer." Others (eg #merv) seem to not be having the same problems and I can't identify the difference.
The one thing that I did find that seemed to create issues in my install was what seems to be some mislabeling of the pandas package: pandas v1.3.5 breaks a numpy==1.19.5 requirement that is the only way i've been able to push it thru. i posted a pandas issue comment

'Could not find a version that satisfies the requirement matplotlib==3.4.3' problem on Python 3.10

I need help with pip install -r matplotlib==3.4.3 on Python 3.10.
Here's my CMD output:
Collecting matplotlib==3.4.3
Downloading matplotlib-3.4.3.tar.gz (37.9 MB)
Preparing metadata (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: `'C:\Users\eob_o\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'`
C:\\Users\\eob_o\\AppData\\Local\\Temp\\pip-install-txwy9aql\\matplotlib_201a53d35123474cbeaa8a08acd5c0c5\\setup.py'"'"'
ERROR: Command errored out with exit status 1:
command: `'C:\Users\eob_o\venv\Scripts\python.exe' 'C:\Users\eob_o\venv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py'` build_wheel
'C:\Users\eob_o\AppData\Local\Temp\tmpqjub6dxu'
Complete output (200 lines):
setup.py:63: RuntimeWarning: NumPy 1.21.2 may not yet support Python 3.10.
warnings.warn(#Running from numpy source directory.
C:\Users\eob_o\AppData\Local\Temp\pip-wheel_qobiqz_\numpy_24f149b83cd943538729a21c1b35fa75\tools\cythonize.py:69:
DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives from distutils.version import LooseVersion
Processing numpy/random\_bounded_integers.pxd.in
Processing numpy/random\bit_generator.pyx
Processing numpy/random\mtrand.pyx
Processing numpy/random\_bounded_integers.pyx.in
Processing numpy/random\_common.pyx
Processing numpy/random\_generator.pyx
Processing numpy/random\_mt19937.pyx
Processing numpy/random\_pcg64.pyx
Processing numpy/random\_philox.pyx
Processing numpy/random\_sfc64.pyx
...
BUILDING MATPLOTLIB
matplotlib: yes [3.4.3]
python: yes [3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC
v.1929 64 bit (AMD64)]]
platform: yes [win32]
tests: no [skipping due to configuration]
macosx: no [Mac OS-X only]
----------------------------------------
WARNING: Discarding
https://files.pythonhosted.org/packages/21/37/197e68df384ff694f78d687a49ad39f96c67b8d75718bc61503e1676b617/matplotlib-3.4.3.tar.gz#sha256=fc4f526dfdb31c9bd6b8ca06bf9fab663ca12f3ec9cdf4496fb44bc680140318 (from https://pypi.org/simple/matplotlib/) (requires-python:>=3.7).
Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement matplotlib==3.4.3 (from versions: 0.86, 0.86.1, 0.86.2, 0.91.0, 0.91.1, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1rc1, 1.4.1, 1.4.2, 1.4.3, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1, 2.0.2, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.2.0rc1, 2.2.0, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 3.0.0rc2, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0rc1, 3.1.0rc2, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.2.0rc1, 3.2.0rc3, 3.2.0, 3.2.1, 3.2.2, 3.3.0rc1, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4, 3.4.0rc1, 3.4.0rc2, 3.4.0rc3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.5.0b1, 3.5.0rc1)
ERROR: No matching distribution found for matplotlib==3.4.3
By following the clue setup.py:63: RuntimeWarning: NumPy 1.21.2 may not yet support Python 3.10., I decided to uninstall Python 3.10 and replace with Python 3.9. And my problem got solved!

Conda package bug? binary incompatability

I'm working in a remote Jupyter notebook on a system where I don't have root access, or even a shell in which to make many adjustments. I can retrieve packages from Conda's archive and run functions in notebook cells that install packages like this
!conda install /path/to/package-vvv.tar.bz2
I've run into situations where I guess wrong on the version number, install something that is incompatible. The error messages are like the one I produce below, binary incompatability in numpy or mkl.
Now I'm re-tracing problem on an Ubuntu 20.10 notebook where I have admin access. I have a reproducible problem to show and share.
Create an environment with same version of python, numpy and pandas, as we have on remote machine:
$ conda create -n cenv-py368 python=3.6.8 pandas=1.1.2 numpy=1.15.4
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.5.12
latest version: 4.9.2
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /home/pauljohn/LinuxDownloads/miniconda3/envs/cenv-py368
added / updated specs:
- numpy=1.15.4
- pandas=1.1.2
- python=3.6.8
The following packages will be downloaded:
package | build
---------------------------|-----------------
libffi-3.2.1 | hf484d3e_1007 52 KB
python-3.6.8 | h0371630_0 34.4 MB
libgcc-ng-9.1.0 | hdf63c60_0 8.1 MB
libstdcxx-ng-9.1.0 | hdf63c60_0 4.0 MB
blas-1.0 | mkl 6 KB
_libgcc_mutex-0.1 | main 3 KB
------------------------------------------------------------
Total: 46.6 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex: 0.1-main
blas: 1.0-mkl
ca-certificates: 2021.1.19-h06a4308_0
certifi: 2020.12.5-py36h06a4308_0
intel-openmp: 2020.2-254
libedit: 3.1.20191231-h14c3975_1
libffi: 3.2.1-hf484d3e_1007
libgcc-ng: 9.1.0-hdf63c60_0
libgfortran-ng: 7.3.0-hdf63c60_0
libstdcxx-ng: 9.1.0-hdf63c60_0
mkl: 2020.2-256
mkl-service: 2.3.0-py36he8ac12f_0
mkl_fft: 1.2.0-py36h23d657b_0
mkl_random: 1.1.1-py36h0573a6f_0
ncurses: 6.2-he6710b0_1
numpy: 1.15.4-py36h7e9f1db_0
numpy-base: 1.15.4-py36hde5b4d6_0
openssl: 1.1.1i-h27cfd23_0
pandas: 1.1.2-py36he6710b0_0
pip: 20.3.3-py36h06a4308_0
python: 3.6.8-h0371630_0
python-dateutil: 2.8.1-pyhd3eb1b0_0
pytz: 2021.1-pyhd3eb1b0_0
readline: 7.0-h7b6447c_5
setuptools: 52.0.0-py36h06a4308_0
six: 1.15.0-pyhd3eb1b0_0
sqlite: 3.33.0-h62c20be_0
tk: 8.6.10-hbc83047_0
wheel: 0.36.2-pyhd3eb1b0_0
xz: 5.2.5-h7b6447c_0
zlib: 1.2.11-h7b6447c_3
Proceed ([y]/n)? y
Downloading and Extracting Packages
libffi-3.2.1 | 52 KB | ##################################### | 100%
python-3.6.8 | 34.4 MB | ##################################### | 100%
libgcc-ng-9.1.0 | 8.1 MB | ##################################### | 100%
libstdcxx-ng-9.1.0 | 4.0 MB | ##################################### | 100%
blas-1.0 | 6 KB | ##################################### | 100%
_libgcc_mutex-0.1 | 3 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate cenv-py368
#
# To deactivate an active environment, use
#
# $ conda deactivate
activate that environment.
Install, for example, the package called "fastparquet":
(cenv-py368) $ conda install fastparquet
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.5.12
latest version: 4.9.2
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /home/pauljohn/LinuxDownloads/miniconda3/envs/cenv-py368
added / updated specs:
- fastparquet
The following packages will be downloaded:
package | build
---------------------------|-----------------
pyparsing-2.4.7 | pyhd3eb1b0_0 59 KB
packaging-20.9 | pyhd3eb1b0_0 35 KB
------------------------------------------------------------
Total: 95 KB
The following NEW packages will be INSTALLED:
fastparquet: 0.5.0-py36h6323ea4_1
libllvm10: 10.0.1-hbcb73fb_5
llvmlite: 0.34.0-py36h269e1b5_4
numba: 0.51.2-py36h0573a6f_1
packaging: 20.9-pyhd3eb1b0_0
pyparsing: 2.4.7-pyhd3eb1b0_0
thrift: 0.11.0-py36hf484d3e_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
pyparsing-2.4.7 | 59 KB | ##################################### | 100%
packaging-20.9 | 35 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Observe failure of import
(cenv-py368) $ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fastparquet
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pauljohn/LinuxDownloads/miniconda3/envs/cenv-py368/lib/python3.6/site-packages/fastparquet/__init__.py", line 5, in <module>
from .core import read_thrift
File "/home/pauljohn/LinuxDownloads/miniconda3/envs/cenv-py368/lib/python3.6/site-packages/fastparquet/core.py", line 9, in <module>
from . import encoding
File "/home/pauljohn/LinuxDownloads/miniconda3/envs/cenv-py368/lib/python3.6/site-packages/fastparquet/encoding.py", line 13, in <module>
from .speedups import unpack_byte_array
File "fastparquet/speedups.pyx", line 1, in init fastparquet.speedups
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject
>>> AA
Do you agree I found a bug?
Seems like either Conda should work, or it should say there is no compatible version of fastparquet.
That error usually indicates that the NumPy is older than is compatible with the library that is using it, in this case fastparquet. Try updating the Python version to 3.7 or 3.8; Python 3.6 and NumPy 1.15 are not within the recommended versions today. (Updating Python to 3.7+ should also update NumPy; this is not usually done when you do conda update ...). Some recipes pin to >= some minimum version, this one did not seem to.
https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table
It is a flaw in the preparation of some Python libraries you are importing. When the authors of a package like fastparquet do not correctly set the minimum compatible version of numpy or python for their package, the Conda environment reconciliation has no way to know that the package is incorrect. Conda offers up the package as a solution, although in fact it is not.
In a larger sense, this is a flaw in the way Conda finds compatible packages. Perhaps it is working as intended, so it is not a bug. But it is a flaw, in the sense that when the user pegs numpy=1.15, then the correct answer from Conda should be "there is no compatible package". However, because Conda relies on the version dependencies of contributed packages, it is not able to do so.
I've not encountered the same problem with packaging for RedHat or Debian Linux systems, they tend to report "nothing" rather than providing an inaccurate match.

Inconsistent scipiy.find_peaks results from pandas_udf with pyspark 3.0

I try to use scipy's find_peaks inside pyspark's pandas_udf. A barebone example:
from pyspark.sql import SparkSession, SQLContext, Row
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import DoubleType
import pandas as pd
import numpy as np
from scipy.signal import find_peaks
spark = SparkSession.builder.master("yarn") \
.appName("UDF_debug") \
.config("spark.yarn.dist.archives", "hdfs://PATH/TO/MY/USERFOLDER/envs/my_env.zip#MYENV")\
.config("spark.submit.deployMode", "client")\
.config("spark.yarn.queue", "root.dev")\
.enableHiveSupport()\
.getOrCreate()
# Create a sample dataframe and a corresponding pandas data frame, for cross-checking
df = spark.createDataFrame(
[Row(id=1, c=3),
Row(id=2, c=6),
Row(id=3, c=2),
Row(id=4, c=9),
Row(id=5, c=7)])
dfp = df.toPandas()
def peak_finder(C: pd.Series) -> pd.Series:
# Find peaks (maxima)
pos_peaks, pos_properties = find_peaks(C)
# Create an empty series of appropriate length
r = pd.Series(np.full(len(C), np.nan))
# Wherever a peak was found ...
for idx in pos_peaks:
# ... mark it by noting its height
r[idx] = C[idx]
return r
# Peak finding using pyspark's pandas_udf
peak_finder_udf = pandas_udf(peak_finder, returnType=DoubleType())
df = df.withColumn('peak', peak_finder_udf(df.c))
df.show()
# Peak finding directly on a pandas df
dfp["peaks_pandas"] = peak_finder(dfp["c"])
print(dfp)
The result of the two prints is as follows. First, peak finding with pandas_udf:
+---+---+----+
| id| c|peak|
+---+---+----+
| 1| 3|null|
| 2| 6|null|
| 3| 2|null|
| 4| 9| 9.0|
| 5| 7|null|
+---+---+----+
Second, using just stock pandas and numpy on the edge node:
id c peaks_pandas
0 1 3 NaN
1 2 6 6.0
2 3 2 NaN
3 4 9 9.0
4 5 7 NaN
The line with id=2 is inconsistent.
This might be understandable from the pyspark documentation, stating:
Internally, PySpark will execute a Pandas UDF by splitting columns
into batches and calling the function for each batch as a subset of
the data, then concatenating the results together.
It seems weird that such a small split should occur, but maybe ...
Question 1: Is this inconsistent behaviour expected? Can I avoid it?
EDIT: Answer: Yes, this is due to partitioning. See my comment below.
Another weird behaviour might point to the solution (but to me raises further questions). Continuing with the above code:
fname = "debug.parquet"
df.dropna().write.parquet(fname)
dfnew = spark.read.parquet(fname)
dfnew.show()
It yields the result
+---+---+----+
| id| c|peak|
+---+---+----+
| 4| 9|null|
+---+---+----+
Peak is no longer = 9, as it should be, but null.
Question 2: Can anyone explain this data loss during saving?
Relevant packages in conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 main defaults
arrow-cpp 0.15.1 py38h7cd5009_5 defaults
attrs 19.3.0 py_0 defaults
backcall 0.2.0 py_0 defaults
blas 1.0 mkl defaults
bleach 3.1.5 py_0 defaults
boost-cpp 1.71.0 h7b6447c_0 defaults
brotli 1.0.7 he6710b0_0 defaults
brotlipy 0.7.0 py38h7b6447c_1000 defaults
bzip2 1.0.8 h7b6447c_0 defaults
c-ares 1.15.0 h7b6447c_1001 defaults
ca-certificates 2020.6.24 0 defaults
certifi 2020.6.20 py38_0 defaults
cffi 1.14.0 py38he30daa8_1 defaults
chardet 3.0.4 py38_1003 defaults
cryptography 2.9.2 py38h1ba5d50_0 defaults
dbus 1.13.16 hb2f20db_0 defaults
decorator 4.4.2 py_0 defaults
defusedxml 0.6.0 py_0 defaults
double-conversion 3.1.5 he6710b0_1 defaults
entrypoints 0.3 py38_0 defaults
expat 2.2.9 he6710b0_2 defaults
fontconfig 2.13.0 h9420a91_0 defaults
freetype 2.10.2 h5ab3b9f_0 defaults
gflags 2.2.2 he6710b0_0 defaults
glib 2.65.0 h3eb4bd4_0 defaults
glog 0.4.0 he6710b0_0 defaults
grpc-cpp 1.26.0 hf8bcb03_0 defaults
gst-plugins-base 1.14.0 hbbd80ab_1 defaults
gstreamer 1.14.0 hb31296c_0 defaults
icu 58.2 he6710b0_3 defaults
idna 2.10 py_0 defaults
importlib-metadata 1.7.0 py38_0 defaults
importlib_metadata 1.7.0 0 defaults
intel-openmp 2020.1 217 defaults
ipykernel 5.3.0 py38h5ca1d4c_0 defaults
ipython 7.16.1 py38h5ca1d4c_0 defaults
ipython_genutils 0.2.0 py38_0 defaults
ipywidgets 7.5.1 py_0 defaults
jedi 0.17.1 py38_0 defaults
jinja2 2.11.2 py_0 defaults
jpeg 9b h024ee3a_2 defaults
json5 0.9.5 py_0 defaults
jsonschema 3.2.0 py38_0 defaults
jupyter 1.0.0 py38_7 defaults
jupyter_client 6.1.3 py_0 defaults
jupyter_console 6.1.0 py_0 defaults
jupyter_core 4.6.3 py38_0 defaults
jupyterlab 2.1.5 py_0 defaults
jupyterlab_server 1.1.5 py_0 defaults
ld_impl_linux-64 2.33.1 h53a641e_7 defaults
libboost 1.71.0 h97c9712_0 defaults
libedit 3.1.20191231 h7b6447c_0 defaults
libevent 2.1.8 h1ba5d50_0 defaults
libffi 3.3 he6710b0_2 defaults
libgcc-ng 9.1.0 hdf63c60_0 defaults
libgfortran-ng 7.3.0 hdf63c60_0 defaults
libpng 1.6.37 hbc83047_0 defaults
libprotobuf 3.11.4 hd408876_0 defaults
libsodium 1.0.18 h7b6447c_0 defaults
libstdcxx-ng 9.1.0 hdf63c60_0 defaults
libuuid 1.0.3 h1bed415_2 defaults
libxcb 1.14 h7b6447c_0 defaults
libxml2 2.9.10 he19cac6_1 defaults
lz4-c 1.8.1.2 h14c3975_0 defaults
markupsafe 1.1.1 py38h7b6447c_0 defaults
mistune 0.8.4 py38h7b6447c_1000 defaults
mkl 2020.1 217 defaults
mkl-service 2.3.0 py38he904b0f_0 defaults
mkl_fft 1.1.0 py38h23d657b_0 defaults
mkl_random 1.1.1 py38h0573a6f_0 defaults
nbconvert 5.6.1 py38_0 defaults
nbformat 5.0.7 py_0 defaults
ncurses 6.2 he6710b0_1 defaults
notebook 6.0.3 py38_0 defaults
numpy 1.18.5 py38ha1c710e_0 defaults
numpy-base 1.18.5 py38hde5b4d6_0 defaults
openssl 1.1.1g h7b6447c_0 defaults
packaging 20.4 py_0 defaults
pandas 1.0.5 py38h0573a6f_0 defaults
pandoc 2.9.2.1 0 defaults
pandocfilters 1.4.2 py38_1 defaults
parso 0.7.0 py_0 defaults
pcre 8.44 he6710b0_0 defaults
pexpect 4.8.0 py38_0 defaults
pickleshare 0.7.5 py38_1000 defaults
pip 20.1.1 py38_1 defaults
prometheus_client 0.8.0 py_0 defaults
prompt-toolkit 3.0.5 py_0 defaults
prompt_toolkit 3.0.5 0 defaults
ptyprocess 0.6.0 py38_0 defaults
py4j 0.10.9 py_0 defaults
pyarrow 0.15.1 py38h0573a6f_0 defaults
pycparser 2.20 py_0 defaults
pygments 2.6.1 py_0 defaults
pyopenssl 19.1.0 py38_0 defaults
pyparsing 2.4.7 py_0 defaults
pyqt 5.9.2 py38h05f1152_4 defaults
pyrsistent 0.16.0 py38h7b6447c_0 defaults
pysocks 1.7.1 py38_0 defaults
pyspark 3.0.0 py_0 defaults
python 3.8.3 hcff3b4d_2 defaults
python-dateutil 2.8.1 py_0 defaults
pytz 2020.1 py_0 defaults
pyzmq 19.0.1 py38he6710b0_1 defaults
qt 5.9.7 h5867ecd_1 defaults
qtconsole 4.7.5 py_0 defaults
qtpy 1.9.0 py_0 defaults
re2 2019.08.01 he6710b0_0 defaults
readline 8.0 h7b6447c_0 defaults
requests 2.24.0 py_0 defaults
scipy 1.5.0 py38h0b6359f_0 defaults
send2trash 1.5.0 py38_0 defaults
setuptools 47.3.1 py38_0 defaults
sip 4.19.13 py38he6710b0_0 defaults
six 1.15.0 py_0 defaults
snappy 1.1.8 he6710b0_0 defaults
sqlite 3.32.3 h62c20be_0 defaults
terminado 0.8.3 py38_0 defaults
testpath 0.4.4 py_0 defaults
thrift-cpp 0.11.0 h02b749d_3 defaults
tk 8.6.10 hbc83047_0 defaults
tornado 6.0.4 py38h7b6447c_1 defaults
traitlets 4.3.3 py38_0 defaults
uriparser 0.9.3 he6710b0_1 defaults
urllib3 1.25.9 py_0 defaults
wcwidth 0.2.5 py_0 defaults
webencodings 0.5.1 py38_1 defaults
wheel 0.34.2 py38_0 defaults
widgetsnbextension 3.5.1 py38_0 defaults
xz 5.2.5 h7b6447c_0 defaults
zeromq 4.3.2 he6710b0_2 defaults
zipp 3.1.0 py_0 defaults
zlib 1.2.11 h7b6447c_3 defaults
zstd 1.3.7 h0b5b093_0 defaults
I also tried with pyspark 2.4.5 (combined with pyarrow 0.8). Exact same results.
Question 1: Inconsistent behaviour is indeed due to partitioning.
Question 2: Workaround found: Converting first to rdd and then immediately back to data frame solved the issue (i.e. adding .rdd.toDF()). I am unclear about the reason, probably something going on in the background that I don't understand.

Issue with tables and HDF5 python package

Here is my code:
import os
import pandas as pd
def load_hdf(filename):
"""
Load the first key of an HDF file
"""
hdf = pd.HDFStore(filename,mode = 'r')
keys = hdf.keys()
if not keys:
hdf.close()
return pd.DataFrame()
data_df = hdf.get(keys[0])
hdf.close()
return data_df
And when I do:
load_hdf(os.path.join(PATH, 'crm.hd5'))
I have this error:
HDFStore requires PyTables, "No module named 'tables'" problem importing
When I try:
pip install tables
I have the error:
Using Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
* USE_PKGCONFIG: False
.. ERROR:: Could not find a local HDF5 installation.
You may need to explicitly state where your local HDF5 headers and library can be found by setting the ``HDF5_DIR`` environment variable or by using the ``--hdf5`` command-line option.
...
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/2s/sn3gzfwd6_37v0ggqd0n8qy00000gn/T/pip-install-1mx6wjd3/tables/
I already have Pytables, hdf5 in my Anaconda. I have Python 3.7.
I also had pytables installed and could not find a solution. What worked for me is to install the release candidate of HDF5 2.8.0rc1 (as seen here). Seems that the HDF5 version that panda installs is not fully compatible.
So try:
pip install h5py==2.8.0rc1
Hope it helps.