S3FS - Recursive CHOWN/CHMOD takes a LONG time - amazon-s3

Any recursive chown or chmod command on an s3fs mount takes a long time when you have a few directories (about 70) each with quite a few files.
Either of these commands are likely to take almost 24 hours. I have to do this or the Apache process cannot access these files/directories. The command on a normal mount takes about 20 seconds.
Mounting with:
/storage -o noatime -o allow_other -o use_cache=/s3fscache -o default_acl=public-read-write
In /etc/fuse.conf:
user_allow_other
Using latest version: 1.78
Any thoughts on how to do this faster?

After a while, I found it just better to parallel the processes in order to speed it up. Example:
find /s3fsmount/path/to/somewhere -print | xargs --max-args=1 --max-procs=100 chmod 777
It is still slow, but nowhere near as slow as it was.

Using aws cli may help.
what I do:
use aws cli to get the full file list of the target directory.
write a script to parallel execute chmod 777 to each file (with > /dev/null 2>&1 &)
then I found that the chmod jobs finished immediately, from ps -ef.
my PHP code:
<?php
$s3_dir = 'path/to/target/';
$s3fs_dir = '/mnt/s3-drive/' .$s3_dir;
echo 'Fetching file list...' . "\n\n";
sleep(1.5);
$cmd = 'aws s3 ls --recursive s3://<bucket_name>/' . $s3_dir;
exec($cmd, $output, $return);
$num = 0;
if ( is_array($output) ) {
foreach($output as $file_str) {
if ( $num>100 ) {
sleep(4);
$num=0;
}
$n = sscanf( $file_str, "%s\t%s\t%s\t". $s3_dir ."%s", $none1, $none2, $none3, $file );
$cmd = 'chmod 777 ' . $s3fs_dir . $file . ' > /dev/null 2>&1 &';
echo $cmd ."\n";
exec( $cmd );
$num+=1;
}
}
?>

For Change user
find /s3fsmount/path/to/somewher -print | xargs --max-args=1 --max-procs=100 sudo chown -R user:user
It work me..

Related

Running a script when connecting to server with ssh

I use the kitty terminal emulator, so when I connect to a new server, I (usually) need to ad the terminfo (at least, this way it seems to work). To do this I wrote a script. While I was at it, I added a bit of code to add a public key if the user wants it to.
Not really relevant for the question, but here is the code:
#!/bin/bash
host=$1
ip=$(echo $host | cut -d# -f2 | cut -d: -f1)
# Check if it is a unknown host
if [[ -z $(ssh-keygen -F $ip) ]]; then
# Check if there are any ssh-keys
if [ $(ls $HOME/.ssh/*.pub > /dev/null | wc -l) -ne 0 ]; then
keys=$(echo $( (cd $HOME/.ssh/ && ls *.pub) | sed "s/.pub//g" ))
ssh -q -o PubkeyAuthentication=yes -o PasswordAuthentication=no $host "ls > /dev/null 2>&1"
# Check if the server has one of the public keys
if [ $? -ne 0 ]; then
echo "Do you want to add a SSh key to the server?"
while true; do
read -p " Choose [$keys] or leave empty to skip: " key
if [[ -z $key ]]; then
break
elif [[ -e $HOME/.ssh/$key ]]; then
# Give the server a public key
ssh $host "mkdir -p ~/.ssh && chmod 700 ~/.ssh && echo \"$(cat $HOME/.ssh/$key.pub)\" >> ~/.ssh/authorized_keys"
break
else
echo "No key with the name \"$key\" found."
fi
done
fi
fi
# Copy terminfo to server
ssh -t $host "echo \"$(infocmp -x)\" > \"\$TERM.info\" && tic -x \"\$TERM.info\" && rm \$TERM.info"
fi
It is not the best code, but it seems to work. Tips are ofcourse welcome.
The problem is that I need to run this script every time I connect te a new remote server (or I need to keep track of which server is new, but that is even worse). Is there a way to run this script every time I connect to a server (the script checks if the ip is a known host).
Or is there an other way to do this? Adding the public keys is nice to have, but not very important.
I hope somone can help,
Thanks!
There is a trick to identify that you are using ssh to login on the target machine:
pgrep -af "sshd.*"$USER |wc -l
The above command will count the user's processes using sshd
You can add the above command in the target machine, to test if you are connected via ssh. Add the above command to your .profile or .bash_profile script in the target machine.
So that only if you login via ssh your script will run initiation script on the target machine when you login/connect.
Sample .bash_profile on target machine
#!/bin/bash
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
if [[ $(pgrep -af "sshd.*"$USER |wc -l) -gt 0 ]]; then
your_init_script
fi

Is there a simple test in Bash to test if a logical path is a directory

The question is best explained through a small example:
Create an infrastructure of directories as follows:
rm -rf /tmp/work
mkdir -p /tmp/work/a/b/c
cd /tmp/work
ln -s a/b/c s
mkdir t
tree
This results in the following infrastructure:
There is a very simple way to check for the existence of a directory:
cd /tmp/work
if test -d s/../t; then echo EXISTS; else echo DOES NOT EXIST; fi
However, it seems that the '-d' test only checks for the physical path (a/b/t); The following clearly shows that the logical path/directory does exist:
cd s/../t
pwd
I am hoping for a simple test to get this situation checked...
Checking for similar questions in StackOverflow, I haven't found an answer that differentiates logical paths from physical paths...
I did a brute force (silly) attempt to get some more insight:
for o in -b -c -d -e -f -g -G -k -h -L -O -p -r -S -s -t -u -w -x; do
if test $o s/../t; then echo EXISTS; else echo DOES NOT EXIST; fi
done
As expected, all of the tests indicate failure...
The closest I get to a simple solution is the following test:
if (unset CDPATH; cd -L s/../t &>/dev/null); then echo EXISTS; else echo DOES NOT EXIST; fi
If using '-P' rather than '-L', then this syntax tests for physical paths (the equivalent of 'test -d').
Note that if we're sure that we're using builtins, then the syntax is just a bit simpler:
if (unset CDPATH; cd s/../t &>/dev/null); then echo EXISTS; else echo DOES NOT EXIST; fi
But there must be a better way...

CasperJS and Parallel Crawling

I'm trying to crawl some website. However my crawling process is so long i need to use multiple instances to shorten it. I've searched for other ways and aborted all the unnecessary resources requested still it's way too slow for me(around 8-9 secs).
What is the easiest way to parallel casperjs instances or even run only two casperjs at the same time to crawl in parallel?
I have used parallel gnu from a blog post i've found however it seems like although the process' are alive they are not crawling in parallel because total execution time is still the same with one instance.
Should i use a nodejs server to create instances?
What is the easiest and most practical way?
Can you adapt this:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Breadth-first-parallel-web-crawler-mirrorer
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; \
wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN

sudo useradd wont make home directory

I have an automatic script which works, only it just never makes a home directory. The data is extracted from a database.
Heres the script:
$SQL -s -e "SELECT uid, password FROM registrations WHERE processed = 0" \
| while read A B; do
sudo useradd $A -p $B -m /home/
as you can see the -m is there, but it seems to ignore it and never make a home directory and I have no idea why. I must be missing something but i've no idea what
If you run man useradd you'll see that the -m does not expect a parameter.
Running it this way should do the trick (or at least it just did on my Debian Squeeze):
useradd $A -p $B -m
In the man pages you'll also find other useful options such as: -d or -b

background xargs/wget not adhering to -P and -n limits

I'm having a problem with xargs and Wget when run as shell scripts in an Applescript app. I want Wget to run 4 parallel processes in the background. The problem: basically, when I try to run the process in the background with
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E -b 1> NUL 2> NUL
a Wget process is apparently started for each URL passed in from the .txt file. This is too burdensome on the user's memory. When I run it in the foreground, however, with something like:
cat urls.txt | xargs -P 4 -n 1 /usr/local/bin/wget -q -E
I seem to get the four parallel Wget processes I need. Does anybody know how to get this script to run in the background with only 4 processes? I'm a bit of a novice, and I'm afraid I can't figure out why backgrounding the process causes this change.
You might run xargs on the background instead:
cat urls.txt | xargs -P4 -n1 wget -q &
Or if you want to return control to the AppleScript, disown the xargs process:
do shell script "cat urls.txt | xargs -P4 -n1 /usr/local/bin/wget -q & disown $!"
As far as I can tell, I have solved the problem with
cat urls.txt| (xargs -P4 -n1 wget -q -E >/dev/null 2>&1) &
There may well be a better solution, though...