Amazon Elastic Map Reduce: Does input fragments size matter - amazon-emr

Given I need to process input of 20 Gb with the use of 10 instances.
Is it different to have 10 input files of 2Gb compare to 4 input files of 5Gb?
In latter case, can Amazon Elastic MapReduce automatically distribute load of 4 input files across 10 instances? (I'm using Streaming method as my mapper is written using ruby)

The only thing that matters is whether the files are splittable.
If the files are uncompressed plain text or compressed with lzo then Hadoop will sort out the splitting.
x5 2gb files will result in ~100 splits and hence ~100 map tasks (10gb / 128mb (EMR blocksize) ~= 100)
x10 1gb files will result in again ~100 splits and hence, again, 100 map tasks.
If the files are gzip or bzip2 compressed then Hadoop (at least, the version running on EMR) will not split the files.
x5 2gb files will result in only 5 splits (and again hence only 5 map tasks)
x10 1gb files will result in only 10 splits (and again hence only 10 map tasks)
Mat

Related

Disk I/O extremely slow on P100-NC6s-V2

I am training an image segmentation model on azure ML pipeline. During the testing step, I'm saving the output of the model to the associated blob storage. Then I want to find the IOU (Intersection over Union) between the calculated output and the ground truth. Both of these set of images lie on the blob storage. However, IOU calculation is extremely slow, and I think it's disk bound. In my IOU calculation code, I'm just loading the two images (commented out other code), still, it's taking close to 6 seconds per iteration, while training and testing were fast enough.
Is this behavior normal? How do I debug this step?
A few notes on the drives that an AzureML remote run has available:
Here is what I see when I run df on a remote run (in this one, I am using a blob Datastore via as_mount()):
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 103080160 11530364 86290588 12% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3568556 0 3568556 0% /sys/fs/cgroup
/dev/sdb1 103080160 11530364 86290588 12% /etc/hosts
shm 2097152 0 2097152 0% /dev/shm
//danielscstorageezoh...-620830f140ab 5368709120 3702848 5365006272 1% /mnt/batch/tasks/.../workspacefilestore
blobfuse 103080160 11530364 86290588 12% /mnt/batch/tasks/.../workspaceblobstore
The interesting items are overlay, /dev/sdb1, //danielscstorageezoh...-620830f140ab and blobfuse:
overlay and /dev/sdb1 are both the mount of the local SSD on the machine (I am using a STANDARD_D2_V2 which has a 100GB SSD).
//danielscstorageezoh...-620830f140ab is the mount of the Azure File Share that contains the project files (your script, etc.). It is also the current working directory for your run.
blobfuse is the blob store that I had requested to mount in the Estimator as I executed the run.
I was curious about the performance differences between these 3 types of drives. My mini benchmark was to download and extract this file: http://download.tensorflow.org/example_images/flower_photos.tgz (it is a 220 MB tar file that contains about 3600 jpeg images of flowers).
Here the results:
Filesystem/Drive Download_and_save Extract
Local_SSD 2s 2s
Azure File Share 9s 386s
Premium File Share 10s 120s
Blobfuse 10s 133s
Blobfuse w/ Premium Blob 8s 121s
In summary, writing small files is much, much slower on the network drives, so it is highly recommended to use /tmp or Python tempfile if you are writing smaller files.
For reference, here the script I ran to measure: https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535
And this is how I ran it: https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c

Extremely high memory usage with pyarrow reading gzipped parquet files

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB of RAM. However when the file is > 1.5 MB (70+ rows) it starts consuming 9-10 GB of RAM without ever loading the dataframe. If I specify just 2-3 columns, it is able to load them from the "big" file (still consuming that kind of RAM), but anything beyond that seems impossible. All columns are text.
I am currently using pandas.read_parquet, but I have also tried pyarrow.read_table with same results.
Any ideas what could be going on? I just don't understand why loading that amount of data should blow up RAM like that and become unusable. My objective with this is to load the data in parquet to a database, so if there are better ways to do it that would be great to know as well.
The code is below; it's just a simple usage of pandas.read_parquet.
import pandas as pd
df = pd.read_parquet(bytesIO_from_file, columns=[...])
There was a memory usage issue in pyarrow 0.14 that has been resolved: https://issues.apache.org/jira/browse/ARROW-6060
The upcoming 0.15 release will have this fix, as well as a bunch of other optimizations in Parquet reading. If you're curious to try it now, see the docs for installing the development version.

Using CoreNLP ColumnDataClassifier for document classification with a large corpus

I'm trying to use the CoreNLP ColumnDataClassifier to classify a large number of documents. I have a little more than 1 million documents with about 20000 labels.
Is this even possible in terms of memory requirements? (I currently only have 16GB)
Is it somehow possible to train the classifier in an iterative way, splitting the input into many smaller files?
As an experiment I ran:
1.) 500,000 documents, each with 100 random words
2.) a label set of 10,000
This crashed with a memory error even when I gave it 40 GB of RAM.
I also ran:
1.) same 500,000 documents
2.) a label set of 6
This ran successfully to completion with 16 GB of RAM.
I'm not sure at what point growing the label set will cause a crash, but my advice would be to shrink the possible label set and experiment.

neo4j breadth first traversal memory issue

I have a graph with million nodes and 3 million edges loaded into Neo4j. It crashes while doing a breadth first traversal over it complaining of insufficient memory on a 8 GB machine. Each node label string has an average length of 40 characters.
What kind of internal representation does Neo4j's uses which requires so much memory, esp. for traversal? Given that Neo4j is able to represent the entire graph, why does it fail while trying to maintain the set of visited nodes required for breadth first traversal.
As per my understanding, a graph representation of the above graph in an adjacency list format should be in MBs.
Calculation assuming 64-bit representation of the node and edge
3 million edges * 8 bytes * 3 = 72 M (node link node)
1 million nodes * (40 + 8 + 8 bytes) = 56 M ( 64-bit hash links to node string label )
You might have 8 GB available, but are you configuring Neo4j to allow it to use that space? Can you see how much it's taking up when it's working?
Here are some resources:
http://neo4j.com/developer/guide-performance-tuning/
http://neo4j.com/docs/stable/server-performance.html
http://neo4j.com/docs/stable/configuration-settings.html#config_neostore.nodestore.db.mapped_memory
http://neo4j.com/developer/guide-sizing-and-hardware-calculator/
bingo #brian-underwood! you are right.
I hadn't configured Neo4J to use more memory.
Since the issue was related to nodes only, following is what I modified
neostore.nodestore.db.mapped_memory=256M # increased
neostore.relationshipstore.db.mapped_memory=3G # unchanged
neostore.propertystore.db.mapped_memory=256M # increased
neostore.propertystore.db.strings.mapped_memory=200M # unchanged
neostore.propertystore.db.arrays.mapped_memory=200M # unchanged
Also enabled, auto indexing for nodes and their keys
node_auto_indexing=true
node_keys_indexable=key_name

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?
Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?