Hadoop S3 No Space Left On Device - amazon-s3

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));

Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.


Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
chunksize = SAFE_CHUNK_SIZE,
memory_map = True,
) \
as df_reader_MMAPer_CtxMGR:
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Snowflake - Azure File upload - How can i partition the file if size is more than 40MB

I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000
We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog

Why received ZFS dataset uses less space than original?

I have a dataset on the server1 that I want to back up to the second server2.
Server1 (original):
zfs list -o name,used,avail,refer,creation,usedds,usedsnap,origin,compression,compressratio,refcompressratio,mounted,atime,lused storage/iscsi/webhost-old produces:
storage/iscsi/webhost-old 67,8G 1,87T 67,8G Út kvě 31 6:54 2016 67,8G 16K - lz4 1.00x 1.00x - - 67,4G
Sending volume to the 2nd server:
zfs send storage/iscsi/webhost-old | pv | ssh -c arcfour,aes128-gcm#openssh.com root# zfs receive -Fduv pool/bkp-storage
received 69,6GB stream in 378 seconds (189MB/sec)
Server2 zfs list produces:
pool/bkp-storage/iscsi/webhost-old 36,1G 3,01T 36,1G Pá pro 29 10:25 2017 36,1G 0 - lz4 1.15x 1.15x - - 28,4G
Why is there such a difference in sizes? Thanks.
From what you posted, I noticed 3 things that seemed odd:
the compressratio is 1.15x on system 2, but 1.00x on system 1
on system 2, used is 1.27x higher than logicalused
the logicalused and the number zfs receive report are ~2.3x higher on system 1 than system 2
These terms are all defined in the man page, but are still confusing to reverse-engineer explanations for in practice.
(1) could happen if you enabled compression on the source dataset after you wrote all the data to it, since ZFS doesn't rewrite the data to compress it when you enable that setting. The data sent by zfs send is uncompressed unless you use -c, but system 2 will try to compress it as it runs zfs receive if the setting is enabled on the destination dataset. If both system 1 and system 2 had the same compression settings before the data was written, they would have the same compressratio as well.
(2) can happen due to metadata written along with your data, but in this case it's too high for "normal" metadata, which accounts for 1-2% of most pools. It's probably caused by a pool-wide setting, like configuring RAID-Z, or a weird combination of striping and mirroring (like 4 stripes, but with one of them being a mirror).
For (3), I re-read the man page to try to figure it out:
The amount of space that is "logically" consumed by this dataset and
all its descendents. See the used property. The logical space
ignores the effect of the compression and copies properties, giving a
quantity closer to the amount of data that applications see.
If you were sending a dataset (instead of a single iSCSI volume) and the send size matched system 2's logicalused value (instead of system 1's), I would guess you forgot to send some child datasets (i.e. by using zfs send -R). However, neither of those are true in this case.
I had to do some additional digging -- this blog post from 2005 might contain the explanation. If system 1 didn't have compression enabled when the data was written (like I guessed above for (1)), the function responsible for not writing zeroed-out blocks (zio_compress_data) would not be run, so you probably have a bunch of empty blocks written to disk, and accounted for in the logicalused size. However, since lz4 is configured on system 2, it would run there, and those blocks would not be counted.

Content of the fsimage hdfs

I have a question on what is the metadata in the fsimage all about. I read that All mutations to the file system namespace, such as file renames, permission changes, file creations, block allocations are inside the fsimage. But the block location data as well?
Does it contain the information about where (on which datanode) the blocks are stores as well?
I get from this source: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ that the metadata on where blocks is stored is build by the block repots of the datanodes.
Is this true? So the Fsimage does not contain information about the block locations?
Namenode maintains two type of data
Block Location data : Since files are chopped into blocks, NN should know which piece is where.
This data is kept in memory and never persisted on disk, DNs talk to NN periodically and share the blockreport.
file system (metadata) : such as the file system hierarchy, permissions, etc. This info is persisted to the disk
when namenodes starts up it loads "snapshot" of filesystem from fsimage and applies the edit logs from edits onto it, after this process we get a new snapshot. from this point on namenode can accept files system requests from clients / DNs
Yes as far as I know fsimage does not contains any information about blocks. This information is stored by data nodes. Namenode gets this information when it starts up from datanodes.
Hadoop provides a tool that converts the fsimage file into human readable formats. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
Sample output:
bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
ImageVersion = -19
NamespaceID = 2109123098
GenerationStamp = 1003
INodes [NumInodes = 12]
INodePath =
Replication = 0
ModificationTime = 2009-03-16 14:16
AccessTime = 1969-12-31 16:00
BlockSize = 0
Blocks [NumBlocks = -1]
NSQuota = 2147483647
DSQuota = -1
Username = theuser
GroupName = supergroup
PermString = rwxr-xr-x
...remaining output omitted...

Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space

Using Apache Pig version (reported),
CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM)
$ cat data.txt
$ cat GrpTest.pig
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
pig -x local GrpTest.pig
[Thread-12] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
[Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[Thread-13] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#19a9bea3
[Thread-13] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
[Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0002
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B
The java.lang.OutOfMemoryError: Java heap space error occurs each time I use GROUP or JOIN in a pig script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.
Question 1: How come there is an OutOfMemory error while the data sample is minuscule and local mode is supposed to use less resources than HDFS mode?
Question 2: Is there a solution to run successfully a small pig scripts with GROUP or JOIN in local mode?
Solution: force pig to allocate less memory for the java property io.sort.mb
I set to 10 MB here and the error disappears. Not sure what would be the best value but at least, this allow to practice pig syntax in local mode
$ cat GrpTest.pig
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
The reason is you have less memory allocated to Java locally than you do on your Hadoop cluster machines. This is actually a pretty common error in Hadoop. It mostly occurs when you create a really long relation in Pig at any point, and happens because Pig always wants to load an entire relation into memory and doesn't want to lazy load it in any way.
When you do something like GROUP BY where the tuple you're grouping by is non-sparse over many records, you frequently wind up creating single long relations at least temporarily since you're basically taking a whole bunch of individual relations and cramming them all into one single long relation. Either change your code so you don't wind up creating single very long relations at any point (i.e. group by something more sparse), or increase the memory available to Java.