Reading partitioned Parquet file with Pyarrow uses too much memory - pandas

I have a large Impala database composed of partitioned Parquet files.
I copied one Parquet partition to the local disk using HDFS directly. This partition has 15GB total and is composed of lots of files with 10MB each. I'm trying to read this using Pandas with the Pyarrow engine or Pyarrow directly, but its size in memory uses more than 60GB of RAM and it doesn't read the entire dataset before using all memory. What could be the reason of such large memory usage?

The size of Parquet files on disk and in memory can vary up to a magnitude. Parquet using efficient encoding and compression techniques to store columns. When you load this data into RAM, the data is unpacked into its uncompressed form. Thus for a data set of files with a size of 15G, a RAM usage of 150G would be expected.
When you're unsure if this is your problem, load a single file using df = pandas.read_parquet and inspect its memory usage with df.memory_usage(deep=True). This should give you a good indication of the scaling between disk and RAM of your whole dataset.

Related

tf.data.experimental.save VS TFRecords

I have notice that the method tf.data.experimental.save (added in r2.3) allows to save a tf.data.Dataset to file in just one line of code, which seems extremely convenient. Are there still some benefits in serializing a tf.data.Dataset and writing it into a TFRecord ourselves, or is this save function supposed to replace this process?
TFRecord have several benefits especially when using the large datasets. TFRecord - If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
tf.data.experimental.save and tf.data.experimental.load will be useful if you are not worried about the performance of your import pipeline.
tf.data.experimental.save - The saved dataset is saved in multiple file "shards". By default, the dataset output is divided to shards in a round-robin fashion. The datasets saved through tf.data.experimental.save should only be consumed through tf.data.experimental.load, which is guaranteed to be backwards compatible.

Optimal maximum Parquet file size in S3

I'm trying to work out what the optimal file size when partitioning Parquet data on S3. AWS recommends avoiding having files less than 128MB. But is there also a recommended maximum file size?
Databricks recommends files should be around 1GB, but it's not clear to me whether this only applies to HDFS. I know that the optimal file size is dependent on the HDFS block size. However, S3 doesn't have any concept of block size.
Any thoughts?
You Should probably consider two things:
1) in case of pure object stores such as s3, it does not matter on s3 side what is your block size - you don't need to align to anything.
2) what is more important is how and with what are you going to read the data?
Consider partitioning, pruning, rowgroups and predicate pushdowns - also how you'll going to join this?
e.g.: Presto (Athena) prefers files that are over 128Mb, but too big will cause poor parallelisation - i usually aim for 1-2gb files
Redshift prefers to be massively parallel, so e.g. 4 nodes, 160 files will be better then 4 nodes 4 files :)
suggested read:
https://www.upsolver.com/blog/aws-athena-performance-best-practices-performance-tuning-tips
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

optimal size of a tfrecord file

From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?
In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?
Thanks
Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)
Reading large numbers of small files significantly impacts I/O
performance. One approach to get maximum I/O throughput is to
preprocess input data into larger (~100MB) TFRecord files. For smaller
data sets (200MB-1GB), the best approach is often to load the entire
data set into memory.
Currently (19-09-2020) Google recommends the following rule of thumb:
"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."
Source: https://www.tensorflow.org/tutorials/load_data/tfrecord

Memory limit in Azure Data Lake Analytics

I have implemented a custom extractor for NetCDF files and load the variables into arrays in memory before outputting them. Some arrays can be quite big, so I wonder what the memory limit is in ADLA. Is there some max amount of memory you can allocate?
Each vertex has 6GB available. Keep in mind that this memory is shared between the OS, the U-SQL runtime, and the user code running on the vertex.
In addition to Saveen's reply: Please note that a row can at most contain 4MB of data, thus your SqlArray will be limited by the maximal row size as well once you return it from your extractor.

Create discrepancy between size on disk and actual size in NTFS

I keep finding files which show a size of 10kb but a size on disk on 10gb. Trying to figure out how this is done, anyone have any ideas?
You can make sparse files on NTFS, as well as on any real filesystem. :-)
Seek to (10 GB - 10 kB), write 10 kB of data. There, you have a so-called 10 GB file, which in reality is only 10 kB big. :-)
You can create streams in NTFS files. It's like a separate file, but with the same filename. See here: Alternate Data Streams
I'm not sure about your case (or it might be a mistake in your question) but when you create a NTFS sparse file it will show different sizes for these fields.
When I create a 10MB sparse file and fill it with 1MB of data windows explorer will show:
Size: 10MB
Size on disk: 1MB
But in your case its the opposite. (or a mistake.)