S3Fs directory listings slow - caching somehow possible? - amazon-s3

I'm wondering if there is any way to practically speedup directory listings of a s3fs mount? I have a WebDAV server, only for read operations, that basically access my s3fs mount. The problem is that listing directories is slow, while transfer speed is fine.
So I started to look a bit around the web a stumbled across "JuiceFS", sadly this was also not an option for several reasons. Then I tried "vmtouch" to index the mounted s3 storage to local memory, this is also not working as it's a shared resourced managed by the fuse kernel extension.
Even using S3FS built-in cache does not solve the issue, instead it makes it even worse as the file first getting downloaded from s3 into the cache locally and then served via WebDav ...
Is there no way to just speedup directory listing using S3? Basically, this is all I need in the end and no fancy POSIX compatible Block Device like JuiceFS which basically creates its own logic on top of your s3 bucket ... Not what I was searching for.

Unfortunately s3fs 1.91 has poor readdir performance. There are a few open issues and pull requests that track future improvements:
Option to not use head requests
Consider changing -o notsup_compat_dir default
Consider changing -o noobj_cache default
Increase -o multireq_max
Issue parallel requests in get_object_attribute
You can toggle #2-4 via command-line flags today but #5 is still in-progress. #1 is the big win that would give a 100x speedup but trades off less POSIX compatibility, e.g., no UID/GID, no permissions. One alternative that you can try today is goofys which implements #1.

Related

Host Disk Usage: Warning message regarding disk usage

I've downloaded version HDF_3.0.2.0_vmware of the Hortonworks Sandbox. I am using VMWare Player version 6.0.7 on my laptop. Shortly after startup/logging into Ambari, I see this alert:
The message that is cut off reads: "Capacity Used: [60.11%, 32.3 GB], Capacity Total: [53.7 GB], path=/usr/hdp". I'd hoped that I would be able to focus on NiFi/Storm development rather than administering the sandbox itself, however it looks like the VM is undersized. Here are the VM settings I have for storage. How do I go about correcting the underlying issue prompting the alert?
I had similar issue, it's about node partitioning and directories mounted for data under HDFS -> Configs -> Settings -> DataNode
You can check your node partitioning using below command-
lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
Mostly hdfs namenode or datanode directories point to root partitions. We can change thresholds values for alerts temporary and to have permanent solution we can add additional data directories.
Below links can he helpful to do the same.
https://community.hortonworks.com/questions/21212/configure-storage-capacity-of-hadoop-cluster.html
Check from above link - I think your partitioning is wrong you are not using "/" for hdfs directory. If you want use full disk capacity, you can create any folder name under "/" example /data/1 on every data node using command "#mkdir -p /data/1" and add to it dfs.datanode.data.dir. restart the hdfs service.
https://hadooptips.wordpress.com/2015/10/16/fixing-ambari-agent-disk-usage-alert-critical/
https://community.hortonworks.com/questions/21687/how-to-increase-the-capacity-of-hdfs.html
I am not currently able to replicate this, but based on the screenshots the warning is just that there is less space available than recommended. If this is the case everything should still work.
Given that this is a sandbox that should never be used for production, feel free to ignore the warning.
If you want to get rid fo the warning sign, it may be possible to do a quick fix by changing the warning treshold via the alert definition.
If this is still not sufficient, or you want to leverage more storage, please follow the steps outlined by #manohar

Rsnapshot without hard links?

I'm using Rsnapshot to backup all my servers on an EncFS encrypted partition. The partition has been created with the default paranoia mode offered by EncFS, thus it doesn't support hard links.
I'm able to run Rsnapshot the first time (creating daily.0, weekly.0, monthly.0) but not the second time.
Is there a way to use Rsnapshot without the hardlinking feature? I know it sounds a bit silly, but my rsnapshot.conf is very well configured and I don't want either to switch to another software or erase and recreate the EncFS volume.
Thank you
Look for this section in /etc/rsnapshot.conf file:
# If your version of rsync supports --link-dest, consider enable this.
# This is the best way to support special files (FIFOs, etc) cross-platform.
# The default is 0 (off).
#
#link_dest 0
Make sure the "link_dest" is disabled. This is used as a flag when rsync command is called in the background. As per the man page for rsync:
--link-dest=DIR hardlink to files in DIR when unchanged

Managing files on Amazon S3

I have a git repository that stores audio files.
Obviously, it's not the best usage of git, and the repo has become quite large.
As an alternative, I would like to be able to manipulate these audio files at the command line, "commiting" when some work is done.
Is this type of context possible with manipulating Amazon S3 files at the command line?
Or do you scp, for example, files to S3?
There are some rsync tools to S3 that may work for you, here is an example which I have not tried: http://www.s3rsync.com/
How important are the older versions of the audio? Amazon S3 buckets can have 'versioning' turned on, and you get full versioning support. You pay full $ for each version - I don't know if you have 10 GB or 10TB to store, and your budget, etc... The amazon versioning is nice, but there are not a lot of tools that fully support it.
To manipulate S3 files you will first have to download it and then upload it when you are done, this is relatively simple to do.
However, if the amount of files you have is truly large, the slow transfer rate and bandwidth charge will kill you. If you don't have that much files, DropBox is built on top of S3 and have syncing and a rudimentary version control, bandwidth is not charged..
I felt like using a good networked storage system and git on your LAN is still the better idea.

NFS file open in C code

If I open a file in my C/C++/Java code using a pathname that goes to an nfs directory, how the does the read and write syntax work with NFS being stateless and all? I have tried but cant find an example code accessing NFS mounted files. My current understanding is that it is the job of the NFS client to keep state (like read and write pointer) and the application uses the same syntax.
A related question is regarding VFS and UFS. Are all files in a current unix machine accessed through their vnodes first and then (depending on local vs remote) inode or rnode structures?
NFS (short of file locking) is no different than local storage to user-level applications. It might be slower, or it might drop out unexpectedly, but that can happen to local storage too. That's probably why you can't find specific NFS-centric example code.

Fastest / best way copy data between S3 to EC2?

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
Install s3cmd Package as
yum install s3cmd
or
sudo apt-get install s3cmd
depending on your OS
then copy data with this
s3cmd get s3://tecadmin/file.txt
also ls can list the files.
for more detils see this
For me the best form is:
wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext
from PuTTy