How to find less frequenlty accessed files in HDFS - hive

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.

I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

Related

How to efficiently filter a dataframe from an S3 bucket

I want to pull a specified number of days from an S3 bucket that is partitioned by year/month/day/hour. This bucket has new files added everyday and will grow to be rather large. I want to do spark.read.parquet(<path>).filter(<condition>), however when I ran this it took significantly longer (1.5 hr) than specifying the paths (.5 hr). I dont understand why it takes longer, should I be adding a .partitionBy() when reading from the bucket? or is it because of the volume of data in the bucket that has to be filtered?
That problem that you are facing is regarding the partition discovery. If you point to the path where your parquet files are with the spark.read.parquet("s3://my_bucket/my_folder") spark will trigger a task in the task manager called
Listing leaf files and directories for <number> paths
This is a partition discovery method. Why that happens? When you call with the path Spark has no place to find where the partitions are and how many partitions are there.
In my case if I run a count like this:
spark.read.parquet("s3://my_bucket/my_folder/").filter('date === "2020-10-10").count()
It will trigger the listing that will take 19 Seconds for around 1700 folders. Plus the 7 seconds to count, it has a total of 26 seconds.
To solve this overhead time you should use a Meta Store. AWS provide a great solution with AWS Glue, to be used just like the Hive Metastore in a Hadoop environment.
With Glue you can store the Table metadata and all the partitions. Instead of you giving the Parquet path you will point to the table just like that:
spark.table("my_db.my_table").filter('date === "2020-10-10").count()
For the same data, with the same filter. The list files doesn't exist and the whole process of counting took only 9 Seconds.
In your case that you partitionate by Year, Month, Day and Hour. We are talking about 8760 folders per year.
I would recommend you take a look at this link and this link
This will show how you can use Glue as your Hive Metastore. That will help a lot to improve the speed of Partition query.

Cache larger-than-memory dataframe to local disk with Dask

I have a bunch of files in S3 which comprise a larger-than-memory dataframe.
Currently, I use Dask to read the files into a dataframe, perform an inner-join with a smaller dataset (which will change on each call to this function, whereas huge_df is basically the full dataset & does not change), call compute to get a much smaller pandas dataframe, and then do some processing. E.g:
huge_df = ddf.read_csv("s3://folder/**/*.part")
merged_df = huge_df.join(small_df, how='inner', ...)
merged_df = merged_df.compute()
...other processing...
Most of the time is spent downloading the files from S3. My question is: is there a way to use Dask to cache the files from S3 on disk, so that on subsequent calls to this code, I could just read the dataframe files from disk, rather than from S3? I figure I can't just call huge_df.to_csv(./local-dir/) since that will bring huge_df into memory which won't work.
I'm sure there is a way to do this using a combination of other tools plus standard Python IO utilities, but I wanted to see if there was a way to use Dask to download the file contents from S3 and store them on the local disk without bringing everything into memory.
Doing huge_df.to_csv would have worked, because it would write each partition to a separate file locally, and so the whole thing would not have been in memory at once.
However, to answer the specific question, dask uses fsspec to manage file operations, and it allows for local caching, e.g., you could do
huge_df = ddf.read_csv("simplecache::s3://folder/**/*.part")
By default, this will store the files in a temporary folder, which gets cleaned up when you exit the python session, but you can provide options using an optional argument storage_options={"simplecache": {..}} to specify the cache location, or use "filecache" instead of "simplecache" if you want to enable the local copies to expire after some time or to check the target for updated versions.
Note that, obviously, these will only work with a distributed cluster only if all the workers have access to the same cache location, since the loading of a partition might happen on any of your workers.

ETL file loading: files created today, or files not already loaded?

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.

Processing Files - Keeping Track

Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.

Hadoop Input Files

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?
There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.
Take a look at this blog post from Cloudera:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.
I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.