For my workload, I need to serialize on disk Pandas dataframe (Text +Datas) with a size of 5Go per Dataframe.
Came across various solutions:
HDF5 : Issues with string
Feather: not stable
CSV: Ok, but large file size.
pickle : Ok, cross-platform, can we do better ?
gzip : Same than CSV (slow for read access).
SFrame: Good, but not maintained anymore.
Just wondering any alternative solution to pickle to store string Dataframe on disk ?
parquet is the best format because this is used for Big tech company storing petabytes of data....
I suggest reading this article: https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
The author concludes that feather is the most efficient serialization. However, it would not suitable for long-term storage - which is likely to be CSV (form long-term).
Related
I have Polars df and I want to save it into Parquet file. And next df too, and next.
Code df.write_parquet("path.parquet") is only rewriting it. How I can do it in Polars?
Polars does not support appending to Parquet files, and most tools do not, see for example this SO post.
Your best bet would be to cast the dataframe to an Arrow table using .to_arrow(), and use pyarrow.dataset.write_dataset. In particular, see the comment on the parameter existing_data_behavior. Still, that requires organizing your data in partitions, which effectively means you have a separate parquet file per partition, stored in the same directory. So each df you have, becomes its own parquet file, and you abstract away from that on the read. Polars does not support writing partitions as far as I'm aware. There is support for reading though, see the source argument in pl.read_parquet.
I have these 30 GiB SAS7BDAT files which correspond to a year's worth of data. When I try importing the file using pd.read_sas() I get a memory-related error. Upon research, I hear mentions of using Dask, segmenting the files into smaller chunks, or SQL. These answers sound pretty broad, and since I'm new, I don't really know where to begin. Would appreciate if someone could share some details with me. Thanks.
I am not aware of a partitioned loader of this sort of data for dask. However, the pandas API apparently allows you to stream the data by chunks, so you could write these chunks to other files in any convenient format, and then process those either serially or with dask. The best value of chunksize will depend on your data and available memory.
The following should work, but I don't have any of this sort of data to try it on.
with pd.read_sas(..., chunksize=100000) as file_reader:
for i, df in enumerate(file_reader):
df.to_parquet(f"{i}.parq")
then you can load the parts (in parallel) with
import dask.dataframe as dd
ddf = dd.read_parquet("*.parq")
It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.
I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.
The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.
Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?
Thank you in advance.
[1] Numpy loading csv TOO slow compared to Matlab
[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/
[Edit 1]
For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.
Sample:
{
arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)
}
Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.
Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.
Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.
The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.
For example here: Get CSV to Spark dataframe
I am planning to use Pandas HDFStore as temporary file for out of core csv operations.
(csv --> HDFStore --> Out of core operation in pandas).
Just wondering :
Limit in size of HDF5 for real life practical usage on 1 machine
(not the theoritical one....)
Cost of operation for pivot tables (100 columns, fixed VARCHAR, numerical).
Whether I would need to switch to Postgres (load csv into Postgres) and DB stuff...
Tried to find on google some benchmark limit size vs computation time for HDF5, but could not find any.
Total size of csv is around 500Go - 1To (uncompressed).