Set Cache Limit or Expiration for S3FS - amazon-s3

I have this option "-o use_cache=/tmp" set when I mount my S3 bucket. Is there a limit on how much room it will try to use in tmp? Is there a way to limit that or otherwise expire items after X amount of time?

You could use the (unsupported) sample_delcache.sh script from the s3fs-fuse project. Set up a cron job to run it every so often. There'd still be the risk of running out space (or inodes, as I just did) before the next time you ran the cleanup script, but you should be able to dial it in.

At this time there is an option ensure_diskfree to preserve some space.

The local cache's growth is apparently unbounded but it is truly a "cache" (as opposed to what might be called a "working directory") in the sense that it can be safely purged at any time, such as with a cron job that removes files after a certain age, combining find and xargs and rm.
(xargs isn't strictly necessary, but it avoids issues that can occur when too many files are found to remove in one invocation.)

Related

Snakemake 200000 job submission

I have 200000 fasta sequences. I am doing GATK to call variants and created a wildcard for every sequence. Now I would like to submit 200000 jobs using snakemake. Will this cause a problem to cluster? Is there a way to submit jobs in a set of 10-20?
First off, it might take some time to calculate the DAG, but I have been told the DAG calculation recently has been greatly improved. Anyways, it might be wise to split up in batches.
Most clusters won't allow you to submit more than X jobs at the same time, usually in the range of 100-1000. I believe the documentation is not fully correct, but when using --cluster cluster I believe the --jobs argument controls the number of active/submitted jobs at the same time, so by using snakemake --jobs 20 --cluster "myclustercommand" you should be able to control this. Know that this control the number of submitted jobs, not active jobs. It might be that all your jobs are in the queue, so probably best to check in with your cluster administrator and ask what the maximum number of submitted jobs is, and get as close to that number.

How to find less frequenlty accessed files in HDFS

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.
I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

Error on Write operation (code 22) after calling Truncate. - C# client

When I try to use Aerospike client Write() I obtain this error:
22 AS_PROTO_RESULT_FAIL_FORBIDDEN
The error occurs only when the Write operation is called after a Truncate() and only on specific keys.
I tried to:
change the key type (string, long, small numbers, big numbers)
change the Key type passed (Value, long, string)
change the retries number on WritePolicy
add a delay (200ms, 500ms) before every write
generate completely new keys (GUID.NewGuid().ToString())
None solved the case so I think the unique cause is the Truncate operation.
The error is systematic; for the same set of keys fail exactly on the same keys.
The error occurs also when after calling the Truncate I wait X seconds and checking the Console Management the Objects number on the Set is "0" .
I have to wait minutes (1 to 5) to be sure that running the process the problem is gone.
The cluster has 3 nodes with replica factor of 2. SSD persistence
I'm using the NuGet C# Aerospike.Client v 3.4.4
Running the process on a single local node (docker, in memory) does not give any error.
How can I know when the Truncate() process (the delete operation behind it) is completely terminated and I can safely use the Set ?
[Solution]
As suggested our devops checked the timespan synchronization. He found that the NTP was not enabled on the machine images (by mistake).
Enabled it. Tested again. No more errors.
Thanks,
Alex
Sounds like a potential issue with time synchronization across nodes, make sure you have ntp setup correctly... That would be my only guess at this point, especially as you are mentioning it does work on a single node. The truncate command will capture the current time (if you don't specify a time) and will use that to prevent records written 'prior' to that time from being written. Check under the (from top of my head, sorry if not exactly this) /opt/aerospike/smd/truncate.smd to see on each node the timestamp of the truncated command and check the time across the different nodes.
[Thanks #kporter for the comment. So the time would be the same in all truncate.smd file, but a time discrepancy between machine would then still cause writes to fail against some of the nodes]

Alternatives to slow DEL large key

There is async UNLINK in the upcoming Redis 4, but until then, what are some good alternatives to implementing DELete of large set keys with no or minimal blocking?
Is RENAME to some unique name followed by EXPIRE 1 second a good solution? RENAME first so that the original key name becomes available for use. Freeing the memory right away is not of immediate concern, Redis can do async garbage collection when it can.
EXPIRE will not eliminate the delay, only delay it until the server actually expires the value (note that Redis uses an approximate expiration algorithm). Once the server gets to actually expiring the value, it will issue a DEL command that will block the server until the value is deleted.
If you are unable to use v4's UNLINK, the best way you could go about deleting a large set is by draining it incrementally. This can be easily accomplished with a server-side Lua script to reduce the bandwidth, such as this one:
local target = KEYS[1]
local count = tonumber(ARGV[1]) or 100
local reply = redis.call('SPOP', target, count)
if reply then
return #reply
else
return nil
end
To drain, call repeatedly the script above with the key-to-be-deleted's name, and with or without a count argument, until you get a nill Redis reply.

What Redis data type fit the most for following example

I have following scenario:
Fetch array of numbers (from REDIS) conditionally
For each number do some async stuff (fetch something from DB based on number)
For each thing in result set from DB do another async stuff
Periodically repeat 1. 2. 3. because new numbers will be constantly added to REDIS structure.Those numbers represent unix timestamp in milliseconds so out of the box those numbers will always be sorted in time of addition
Conditionally means fetch those unix timestamp from REDIS that are less or equal to current unix timestamp in milliseconds(Date.now())
Question is what REDIS data type fit the most for this use case having in mind that this code will be scaled up to N instances, so N instances will share access to single REDIS instance. To equally share the load each instance will read for example first(oldest) 5 numbers from REDIS. Numbers are unique (adding same number should fail silently) so REDIS SET seems like a good choice but reading M first elements from REDIS set seems impossible.
To prevent two different instance of the code to read same numbers REDIS read operation should be atomic, it should read the numbers and delete them. If any async operation fail on specific number (steps 2. and 3.), numbers should be added again to REDIS to be handled again. They should be re-added back to the head not to the end to be handled again as soon as possible. As far as i know SADD would push it to the tail.
SMEMBERS key would read everything, it looks like a hammer to me. I would need to include some application logic to get first five than to check what is less or equal to Date.now() and then to delete those and to wrap somehow everything in single transaction. Besides that set cardinality can be huge.
SSCAN sounds interesting but i don't have any clue how it works in "scaled" environment like described above. Besides that, per REDIS docs: The SCAN family of commands only offer limited guarantees about the returned elements since the collection that we incrementally iterate can change during the iteration process. Like described above collection will be changed frequently
A more appropriate data structure would be the Sorted Set - members have a float score that is very suitable for storing a timestamp and you can perform range searches (i.e. anything less or equal a given value).
The relevant starting points are the ZADD, ZRANGEBYSCORE and ZREMRANGEBYSCORE commands.
To ensure the atomicity when reading and removing members, you can choose between the the following options: Redis transactions, Redis Lua script and in the next version (v4) a Redis module.
Transactions
Using transactions simply means doing the following code running on your instances:
MULTI
ZRANGEBYSCORE <keyname> -inf <now-timestamp>
ZREMRANGEBYSCORE <keyname> -inf <now-timestamp>
EXEC
Where <keyname> is your key's name and <now-timestamp> is the current time.
Lua script
A Lua script can be cached and runs embedded in the server, so in some cases it is a preferable approach. It is definitely the best approach for short snippets of atomic logic if you need flow control (remember that a MULTI transaction returns the values only after execution). Such a script would look as follows:
local r = redis.call('ZRANGEBYSCORE', KEYS[1], '-inf', ARGV[1])
redis.call('ZREMRANGEBYSCORE', KEYS[1], '-inf', ARGV[1])
return r
To run this, first cache it using SCRIPT LOAD and then call it with EVALSHA like so:
EVALSHA <script-sha> 1 <key-name> <now-timestamp>
Where <script-sha> is the sha1 of the script returned by SCRIPT LOAD.
Redis modules
In the near future, once v4 is GA you'll be able to write and use modules. Once this becomes a reality, you'll be able to use this module we've made that provides the ZPOP command and could be extended to cover this use case as well.