How to analysis big jvm dump files - jvm

Our hadoop cluster has 1000+ nodes and 20PB data. So , our NameNode dump files is over 100GB and we find it is hard to analysis it by any tools.
Do any one has any suggestions about how to analysis such big jvm dump files?

MAT is analysis the dump file with index files, can you mapped the big file into many little file and store the index files to a common file system(eg: fastdfs)

Related

Optimal maximum Parquet file size in S3

I'm trying to work out what the optimal file size when partitioning Parquet data on S3. AWS recommends avoiding having files less than 128MB. But is there also a recommended maximum file size?
Databricks recommends files should be around 1GB, but it's not clear to me whether this only applies to HDFS. I know that the optimal file size is dependent on the HDFS block size. However, S3 doesn't have any concept of block size.
Any thoughts?
You Should probably consider two things:
1) in case of pure object stores such as s3, it does not matter on s3 side what is your block size - you don't need to align to anything.
2) what is more important is how and with what are you going to read the data?
Consider partitioning, pruning, rowgroups and predicate pushdowns - also how you'll going to join this?
e.g.: Presto (Athena) prefers files that are over 128Mb, but too big will cause poor parallelisation - i usually aim for 1-2gb files
Redshift prefers to be massively parallel, so e.g. 4 nodes, 160 files will be better then 4 nodes 4 files :)
suggested read:
https://www.upsolver.com/blog/aws-athena-performance-best-practices-performance-tuning-tips
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Reading partitioned Parquet file with Pyarrow uses too much memory

I have a large Impala database composed of partitioned Parquet files.
I copied one Parquet partition to the local disk using HDFS directly. This partition has 15GB total and is composed of lots of files with 10MB each. I'm trying to read this using Pandas with the Pyarrow engine or Pyarrow directly, but its size in memory uses more than 60GB of RAM and it doesn't read the entire dataset before using all memory. What could be the reason of such large memory usage?
The size of Parquet files on disk and in memory can vary up to a magnitude. Parquet using efficient encoding and compression techniques to store columns. When you load this data into RAM, the data is unpacked into its uncompressed form. Thus for a data set of files with a size of 15G, a RAM usage of 150G would be expected.
When you're unsure if this is your problem, load a single file using df = pandas.read_parquet and inspect its memory usage with df.memory_usage(deep=True). This should give you a good indication of the scaling between disk and RAM of your whole dataset.

what are the guidelines to improve flume performance

I have a use case where i have to transfer one million or more files in HDFS. File size can be vary from 10kb to 50kb.
I am using spool dir source and HDFS sink and file channel.
I am also using BLOB deserilizer as i do not want to break my source data.it should get transfer complete file as an event that i am able to achieve.
So far my flume agent design looks like this - my flume agent design
Still i am not able to get good performance.
I also want to understand is hadoop cluster configuration can be helpful to improve the performance?
AFAIK, there is no silver bullet for performance tuning. As usual, you will need to experiment and learn based on your data and infrastructure. The following articles discuss the various knobs (and general guidance) available to fine tune Flume performance:
Cloudera - Flume Performance Tuning, DZone - Flume Performance Tuning

Do files on your server influence website speed

If you have a webserver for your website, does it make a difference if there are a lot of other files on the server, even if they aren't used?
Example
An average webserver has a SSD with 500 GB of space. It's hosting a single website, but has a ton of other websites which are inactive. Though that single website is only 1GB in size, the hard drive is full for 50%. Will that influence site speed?
And does SSD vs HDD make a difference in that, apart from the speed difference between the two types.
Edit: I've read somewhere that the amount of files in your server influences it's speed, and it sounds logical due to Andrei's answer, concerning the having to search through more files. I've had a discussion about it with someone however, and he firmly states that it makes no difference.
Having other/unused files always has an impact on the performance, but the question is - how big it is. Usually not much and you will not notice it at all.
But think about how files are read from disk. First, you need to locate the file record in the file allocation table (FAT). Search in the table is similar to search in a tree-like data structure, as we have to deal with folders that contain other folders etc.
The more files you have, the bigger the FAT gets. And the search becomes slower, correspondingly.
All in all, with memory caching and other tricks, this is not an issue.
You will notice the impact when you have thousands of files in one folder. That's why picture-related services that host big amount of images usually store them in a folder structure that holds only limited amount of files per folder. For example, a file named '12345678.jpg' would be stored in '/1/2/3/4/5/12345678.jpg' path as well as other files whose names are '12345000'...'12345999'. Thus only 1000 files would be saved per folder.

Performance implications of storing 600,000+ images in the same folder (NTFS)

I need to store about 600,000 images on a web server that uses NTFS. Am I better off storing images in 20,000-image chunks in subfolders? (Windows Server 2008)
I'm concerned about incurring operating system overhead during image retrieval
Go for it. As long has you have an external index and have a direct file path to each file with out listing the contents of the directory then you are ok.
I have a folder with that is over 500 GB in size with over 4 million folders (which have more folders and files). I have somewhere in the order of 10 million files in total.
If I accidentally open this folder in windows explorer it gets stuck at 100% cpu usage (for one core) until I kill the process. But as long as you directly refer to the file/folder performance is great (meaning I can access any of those 10 million files with no overhead)
Depending on whether NTFS has directory indexes, it should be alright from the application level.
I mean, that opening files by name, deleting, renaming etc, programmatically should work nicely.
But the problem is always tools. Third party tools (such as MS explorer, your backup tool, etc) are likely to suck or at least be extremely unusable with large numbers of files per directory.
Anything which does a directory scan, is likely to be quite slow, but worse, some of these tools have poor algorithms which don't scale to even modest (10k+) numbers of files per directory.
NTFS folders store an index file with links to all its contents. With a large amount of images, that file is going to increase a lot and impact your performance negatively. So, yes, on that argument alone you are better off to store chunks in subfolders. Fragments inside indexes are a pain.