NFS + Hard Links? - nfs

I know it is a condition of hard links that they cannot span filesystems. Does this apply to NFS mounts? Given the following directory structure, would I be able to create a hard link in directory A that points to a file in directory B?
/root
/A
/B <-NFS mount
For example, I'd like to run ln /root/B/file.txt /root/A/linkedfile.txt

It'd be a good idea to first understand exactly what a hard link is.
Usually on a unix-like system, a filename in a directory points to an inode number - essentially a number for a file. A "hard link" is just creating another filename with that same inode number. Now you have to different names that point to the same numbered file.
But notice that there's not really a direct connection between those two names. The relationship is that Name1 and Name2 both have their inode number set to 12756 - but there's nothing you can hold up and say "this thing in my hand is the link between two files". They're just two database entries that share an id number. You can do a query (slow, since you're walking through every file entry on the system) for filenames that share an id number, but that's it.
So it doesn't mean anything to create a "hard link between two filesystems" - since two filesystems have different numbering schemes (inode 1234 on system one, and 1234 on system two, point to completely different files), and the only thing you have to store is a name+inodeNumber, there's nothing to be done.

Well, since /B is a separate file system (a mounted NFS file system) you cannot make a hard link between it and /A, because they are not on the same file system.
It's because a hardlink doesn't make a copy of the data put only a copy of the pointer to that data, so they have to be in the same "address space".

Related

How to find less frequenlty accessed files in HDFS

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.
I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

How to name the second element?

As a programmer with OCD (Obsessive-Compulsive Disorder), I wonder usually how people name the second element (variable, file name, etc.) in the programmer's world?
For example, I create a file with the name file. I do NOT expect there is another one in this series.
However, one day I got a second one. What usually do you name it?
For example, it can be file1, or file2, or file0, or file_b, or fileB, or file_, or file (1) ...
There could be a lot. Which one is better (for some reasons)?
I am mostly concerned about file2 VS file1, as element starts from 0 in the computer science world, however the real world starts from 1.
Depending on how exactly it should read, I think most people will do file_001or file_002 but I've seen it on professionally-written code many different ways, though all numbering systems use numbers and not letters.
Also, always name your files with leading leading zeros so that the files don't get out of order: file11 would come before file2 in this case, so do something like file011 and file002.
It's usually not a big deal, but open source projects may specify a way to name files in the readme. If file naming is important to you, it never hurts to explain how you name your files in your project readme.
As is often the case, it's better to refactor instead of patching up the current "code": rename the first file as well (to file1, file_01, file_a or whatever), unless that would cause too much trouble (but even in that case, it would make sense to consider using a "view": leave the current file but add a file_01 hardlink/softlink to it - or probably better, a softlink from file to file_01).
For filenames in particular leavingfile as is will be annoying because it will usually get placed after the numbered files in directory listings.
And in the last paragraph, I imagine you meant file0 VS file1...?
If so, I'd say to go with 1, it's much more common in my experience.
And it's not true that element starts from 0 in the computer science world, that's indeed what most programming languages and maybe about all low-level stuff do, but it's not a must, and by personal experience I guarantee you that when you can without too much risk starting with 1 in many cases helps readibility a lot, and this base-0 thing is one of the many mantras that should be let go in software development.
But in any case for naming files and stuff in general (as opposed to array-indexing) it's more common to start from 1 (in my experience).

ETL file loading: files created today, or files not already loaded?

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.

boto3's atomic test and create?

In normal file systems is normal to have the pattern of trying to create a file and fail if it already existed to have the guarantee of being creating a unique filename.
How can the same be achieved with S3 : if I have many parallel tasks creating keys with random names on S3, how can I "test and write" atomically to guarantee that chances don't create a race and I end with messed data ?
Thanks
After a few days of thinking, I believe I have found a very decent solution to my own problem: activate versioning on bucket and save freely the key name you want. From the answer take versionId and encode the object url in a agreed format (e.g. s3://your-bucket/your-key?versionId=XXXXX ) . This url refers always to the object you wanted to save in the first place with no possibility of clashes/races.

Processing Files - Keeping Track

Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.