Content of the fsimage hdfs - apache

I have a question on what is the metadata in the fsimage all about. I read that All mutations to the file system namespace, such as file renames, permission changes, file creations, block allocations are inside the fsimage. But the block location data as well?
Does it contain the information about where (on which datanode) the blocks are stores as well?
I get from this source: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ that the metadata on where blocks is stored is build by the block repots of the datanodes.
Is this true? So the Fsimage does not contain information about the block locations?

Namenode maintains two type of data
Block Location data : Since files are chopped into blocks, NN should know which piece is where.
This data is kept in memory and never persisted on disk, DNs talk to NN periodically and share the blockreport.
file system (metadata) : such as the file system hierarchy, permissions, etc. This info is persisted to the disk
when namenodes starts up it loads "snapshot" of filesystem from fsimage and applies the edit logs from edits onto it, after this process we get a new snapshot. from this point on namenode can accept files system requests from clients / DNs

Yes as far as I know fsimage does not contains any information about blocks. This information is stored by data nodes. Namenode gets this information when it starts up from datanodes.

Hadoop provides a tool that converts the fsimage file into human readable formats. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
Sample output:
bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
FSImage
ImageVersion = -19
NamespaceID = 2109123098
GenerationStamp = 1003
INodes [NumInodes = 12]
Inode
INodePath =
Replication = 0
ModificationTime = 2009-03-16 14:16
AccessTime = 1969-12-31 16:00
BlockSize = 0
Blocks [NumBlocks = -1]
NSQuota = 2147483647
DSQuota = -1
Permissions
Username = theuser
GroupName = supergroup
PermString = rwxr-xr-x
...remaining output omitted...

Related

Snowflake - Azure File upload - How can i partition the file if size is more than 40MB

I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000
We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog

Why received ZFS dataset uses less space than original?

I have a dataset on the server1 that I want to back up to the second server2.
Server1 (original):
zfs list -o name,used,avail,refer,creation,usedds,usedsnap,origin,compression,compressratio,refcompressratio,mounted,atime,lused storage/iscsi/webhost-old produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
storage/iscsi/webhost-old 67,8G 1,87T 67,8G Út kvě 31 6:54 2016 67,8G 16K - lz4 1.00x 1.00x - - 67,4G
Sending volume to the 2nd server:
zfs send storage/iscsi/webhost-old | pv | ssh -c arcfour,aes128-gcm#openssh.com root#10.0.0.2 zfs receive -Fduv pool/bkp-storage
received 69,6GB stream in 378 seconds (189MB/sec)
Server2 zfs list produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
pool/bkp-storage/iscsi/webhost-old 36,1G 3,01T 36,1G Pá pro 29 10:25 2017 36,1G 0 - lz4 1.15x 1.15x - - 28,4G
Why is there such a difference in sizes? Thanks.
From what you posted, I noticed 3 things that seemed odd:
the compressratio is 1.15x on system 2, but 1.00x on system 1
on system 2, used is 1.27x higher than logicalused
the logicalused and the number zfs receive report are ~2.3x higher on system 1 than system 2
These terms are all defined in the man page, but are still confusing to reverse-engineer explanations for in practice.
(1) could happen if you enabled compression on the source dataset after you wrote all the data to it, since ZFS doesn't rewrite the data to compress it when you enable that setting. The data sent by zfs send is uncompressed unless you use -c, but system 2 will try to compress it as it runs zfs receive if the setting is enabled on the destination dataset. If both system 1 and system 2 had the same compression settings before the data was written, they would have the same compressratio as well.
(2) can happen due to metadata written along with your data, but in this case it's too high for "normal" metadata, which accounts for 1-2% of most pools. It's probably caused by a pool-wide setting, like configuring RAID-Z, or a weird combination of striping and mirroring (like 4 stripes, but with one of them being a mirror).
For (3), I re-read the man page to try to figure it out:
logicalused
The amount of space that is "logically" consumed by this dataset and
all its descendents. See the used property. The logical space
ignores the effect of the compression and copies properties, giving a
quantity closer to the amount of data that applications see.
If you were sending a dataset (instead of a single iSCSI volume) and the send size matched system 2's logicalused value (instead of system 1's), I would guess you forgot to send some child datasets (i.e. by using zfs send -R). However, neither of those are true in this case.
I had to do some additional digging -- this blog post from 2005 might contain the explanation. If system 1 didn't have compression enabled when the data was written (like I guessed above for (1)), the function responsible for not writing zeroed-out blocks (zio_compress_data) would not be run, so you probably have a bunch of empty blocks written to disk, and accounted for in the logicalused size. However, since lz4 is configured on system 2, it would run there, and those blocks would not be counted.

How to use the taildir source in Flume to append only newest lines of a .txt file?

I recently asked the question Apache Flume - send only new file contents
I am rephrasing the question in order to learn more and provide more benefitto future users of Flume.
Setup: Two servers, one with a .txt file that gets lines appended to it regularly.
Goal: Use flume TAILDIR source to append the most recently written line to a file on the other server.
Issue: Whenever the source file has a new line of data added, the current configuration appends everything in file on server 1 to the file in server 2. This results in duplicate lines in file 2 and does not properly recreate the file from server 1.
Configuration on server 1:
#configure the agent
agent.sources=r1
agent.channels=k1
agent.sinks=c1
#using memort channel to hold upto 1000 events
agent.channels.k1.type=memory
agent.channels.k1.capacity=1000
agent.channels.k1.transactionCapacity=100
#connect source, channel,sink
agent.sources.r1.channels=k1
agent.sinks.c1.channel=k1
#define source
agent.sources.r1.type=TAILDIR
agent.sources.r1.channels=k1
agent.sources.r1.filegroups=f1
agent.sources.r1.filegroups.f1=/home/tail_test_dir/test.txt
agent.sources.r1.maxBackoffSleep=1000
#connect to another box using avro and send the data
agent.sinks.c1.type=avro
agent.sinks.c1.hostname=10.10.10.4
agent.sinks.c1.port=4545
Configuration on server 2:
#configure the agent
agent.sources=r1
agent.channels=k1
agent.sinks=c1
#using memory channel to hold up to 1000 events
agent.channels.k1.type=memory
agent.channels.k1.capacity=1000
agent.channels.k1.transactionCapacity=100
#connect source, channel, sink
agent.sources.r1.channels=k1
agent.sinks.c1.channel=k1
#here source is listening at the specified port using AVRO for data
agent.sources.r1.type=avro
agent.sources.r1.bind=0.0.0.0
agent.sources.r1.port=4545
#use file_roll and write file at specified directory
agent.sinks.c1.type=file_roll
agent.sinks.c1.sink.directory=/home/Flume_dump
You have to set position json file. Then the source check the position and write only new added lines to sink.
ex) agent.sources.s1.positionFile = /var/log/flume/tail_position.json

Use flume to stream data to S3

I am trying flume for something very simple, where I would like to push content from my log files to S3. I was able to create a flume agent that would read the content from an apache access log file and use a logger sink. Now I am trying to find a solution where I can replace the logger sink with an "S3 sink". (I know this does not exist by default)
I was looking for some pointers to direct me in the correct path. Below is my test properties file that I am using currently.
a1.sources=src1
a1.sinks=sink1
a1.channels=ch1
#source configuration
a1.sources.src1.type=exec
a1.sources.src1.command=tail -f /var/log/apache2/access.log
#sink configuration
a1.sinks.sink1.type=logger
#channel configuration
a1.channels.ch1.type=memory
a1.channels.ch1.capacity=1000
a1.channels.ch1.transactionCapacity=100
#links
a1.sources.src1.channels=ch1
a1.sinks.sink1.channel=ch1
S3 is built over HDFS so you can use HDFS sink, you must replace hdfs path to your bucket in this way. Don't forget replace AWS_ACCESS_KEY and AWS_SECRET_KEY.
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>#<bucket.name>/prefix/
agent.sinks.s3hdfs.hdfs.fileType = DataStream
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.rollCount = 0
agent.sinks.s3hdfs.hdfs.rollSize = 67108864 #64Mb filesize
agent.sinks.s3hdfs.hdfs.batchSize = 10000
agent.sinks.s3hdfs.hdfs.rollInterval = 0
This makes sense, but can rollSize of this value be accompanied by
agent_messaging.sinks.AWSS3.hdfs.round = true
agent_messaging.sinks.AWSS3.hdfs.roundValue = 30
agent_messaging.sinks.AWSS3.hdfs.roundUnit = minute

Hadoop S3 No Space Left On Device

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
{
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));
};
Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.