Handle gsutil ls and rm command errors if no files present - gsutil

I am running the following command to remove files from a gcs bucket prior to loading new files there.
gsutil -m rm gs://mybucket/subbucket/*
If there are no files in the bucket, it throws the "CommandException: One or more URLs matched no objects".
I would like for it to delete the files if exists without throwing the error.
There is same error with gsutil ls gs://mybucket/subbucket/*
How can I rewrite this without having to handle the exception explicitly? Or, how to best handle these exceptions in batch script?

Try this:
gsutil -m rm gs://mybucket/foo/* 2> /dev/null || true
Or:
gsutil -m ls gs://mybucket/foo/* 2> /dev/null || true
This has the effect of suppressing stderr (it's directed to /dev/null), and returning a success error code even on failure.

You might not want to ignore all errors as it might indicate something different that file not found. With the following script you'll ignore only the 'One or more URLs matched not objects' but will inform you of a different error. And if there is no error it will just delete the file:
gsutil -m rm gs://mybucket/subbucket/* 2> temp
if [ $? == 1 ]; then
grep 'One or more URLs matched no objects' temp
if [ $? == 0 ]; then
echo "no such file"
else
echo temp
fi
fi
rm temp
This will pipe stderr to a temp file and will check the message to decide whether to ignore it or show it.
And it also works for single file deletions. I hope it helps.
Refs:
How to grep standard error stream
Bash Reference Manual - Redirections

You may like rsync to sync files and folders to a bucket. I used this for clearing a folder in a bucket and replacing it with new files from my build script.
gsutil rsync -d newdata gs://mybucket/data - replaces data folder with newdata

Related

GitLab pipeline - Copy file if exists

I have a pipeline that needs to copy some files from a folder to a new one only if the files exists in the source folder.
This is my script line:
script:
- cp source_folder/file.txt dest_folder/ 2>/dev/null
I have also tried this:
script:
- test -f source_folder/file.txt && cp source_folder/file.txt dest_folder/ 2>/dev/null
but still fails if the file do not exists.
Cleaning up project directory and file based variables.
ERROR: Job failed: exit code 1
How can I check the file and copy it only if exists?
EDIT:
this command is executed on a server, the pipeline use ssh to log into
Check for the existence of the file (-f) and, in positive case, copy it.
script:
- |
files=(conf.yaml log.txt)
for file in $files; do
if [[ -f "source_folder/$file" ]]; then
cp source_folder/$file dest_folder
fi
done
Take a look at other answers for one-shot less-flexible statements.
Note: I haven't tested the script above, but I'm quite accustomed with Gitlab pipeline and bash.

CommandException: Caught non-retryable exception - aborting rsync

After using gsutil for more than 1 year I suddenly have this error:
.....
At destination listing 8350000...
At destination listing 8360000...
CommandException: Caught non-retryable exception - aborting rsync
.....
I tried to locate the files with this sync problem but I am not able to do so. Is there a "skip error" option of is there a way I can have gsutil more verbose?
My command line is like this:
gsutil -V -m rsync -d -r -U -P -C -e -x -x 'Download/*' /opt/ gs://mybucket1/kraanloos/
I have created a script to split the problem. This gives me more info for a solution
!#/bin/bash
array=(
3ware
AirTime
Amsterdam
BigBag
Download
guide
home
Install
Holding
Multimedia
newsite
Overig
Trak-r
)
for i in "${array[#]}"
do
echo Processing : $i
PROCESS="/usr/bin/gsutil -m rsync -d -r -U -P -C -e -x 'Backup/*' /opt/$i/ gs://mybucket1/kraanloos/$i/"
echo $PROCESS
$PROCESS
echo ""
echo ""
done
I've been struggling with the same problem the last few days. One way to make it super verbose is to put the -D flag before the rsync argument, as in:
gsutil -D rsync ...
By doing that, I found that my problem is due to having # characters in filenames, as in this question.
In my case, it was because of a broken link to a directory.
As blambert said, use the -D option to see exactly what file causes the problem.
I had struggled with this problem as well. I figured it out now.
you need to re-authenticate your Google Cloud SDK Shell and set a target project again.
It seems like rsync will not show the correct error message.
try cp instead, it will guide you to authentic and set the correct primary project
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/
after that, your gsutil rsync should run fine.

Apache pig script delete a folder if exists

I want to delete the output folder of the previous execution through the apache pig script. This command works fine.
sh [ -e /home/LocalPig/test ] && rm -rf /home/LocalPig/test
but if I write
sh OutpuFile=/home/LocalPig/test
sh [ -e OutputFile] && rm -rf OutputFile
I got the error about OutputFile!
ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : OutputFile
Does anybody have any idea?
Thanks
Hope this solves the problem. Its simply the below command from .pig script file. You don't have to write any shell command. It can be accomplished from within the pig environment using the built in fs command.
Example, put a statement in your pig script like below, it will also not error out due to non-existence of the folder. It will delete if exists or gracefully exists the statement.
fs -rm -f -r -R /user/horton/denver_total;
Offcourse you can also do a lot of work outside the pig but its very useful to perform any delete within your script that controls creation of the data. It makes life simpler to trace the lineage of create and destroy of that files.
Reference: Parameter Substituion
%declare OutputFile '/home/LocalPig/test'
sh [ -e '$OutputFile' ] && rm -rf '$OutputFile'

How to ignore certain files when branching / checking out?

I'd like to compare a few files from the bazaar branch lp:ubuntu/nvidia-graphics-drivers. I'm mainly interested in the debian subdirectory inside that branch, but due to the binary blob in http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files, it takes ages to get just the text files. I've already downloaded 555MB and it's still counting.
Is it possible to retrieve a bazaar branch, including or excluding certain files by one of the following properties:
file size
file extension
file name (include only debian/ for example)
I do not need to push back any changes, nor do I need to view the history of a file. I just want to compare two files in the debian/ directory, files with the .in extension and files without.
As far as I'm aware, no. You're downloading the branch history, not just the individual files. And each file is an integral part of the branch's history.
On the bright side, you only have to check it out once. Unless those binary files change, they'll be skipped the next time you pull from Launchpad.
Depending on the branch's history, you may be able to cut down on the download size if you use a lightweight checkout (bzr checkout --lightweight). But of course, that may come back and bite you later, as it means you won't get a local copy of the branch, only the checked-out files. So it'll work much like SVN, where every operation has to go through the server. And as long as you don't need to look at the branch history, or commit your changes, that should serve you just fine, I believe.
I ended up doing some dirty grep-ing on the HTTP response since bzr info "$branch" and bzr ls -d "$branch" "$directory" did not provide enough information to me.
The below Bash script relies on the working of Launchpads front-end Loggerhead. It recursively downloads from a given URL. Currently, it ignores *.run files. Save it as bzrdl in a directory available from $PATH and run it with bzrdl http://launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files/head:/debian/. All files will be saved in the current directory, be sure that it's empty to avoid conflicts.
#!/bin/bash
max_retries=5
rooturl="$1"
if ! [[ $rooturl =~ /$ ]]; then
echo "Usage: ${0##*/} URL"
echo "URL must end with a slash. Example URL:"
echo "http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-graphics-drivers/oneiric/files/head:/"
exit 1
fi
tmpdir="$(mktemp -d)"
target="$(pwd)"
# used for holding HTTP response before extracting data
tmp="$(mktemp)"
# url_filter reads download URLs from stdin (piped)
url_filter() {
grep -v '\.run$'
}
get_files_from_dir() {
local slash=/
local dir="$1"
# to avoid name collision: a/b/c/ -> a.d/b.d/c.d/
local storedir="${dir//$slash/.d${slash}}"
mkdir -p "$tmpdir/$storedir" "$target/$dir"
local i subdir
for ((i=0; i<$max_retries; i++ )); do
if wget -O "$tmp" "$rooturl$dir"; then
# store file list
grep -F -B 1 '<img src="/static/images/ico_file_download.gif" alt="Download File" />' "$tmp" |\
grep '^<a' | cut -d '"' -f 2 | url_filter \
> "$tmpdir/$storedir/files"
IFS=$'\n'
for subdir in $(grep -F -B 1 '<img src="/static/images/ico_folder.gif" ' "$tmp" | \
grep -F '<a ' | rev | cut -d / -f 2 | rev); do
IFS=$' \t\n'
get_files_from_dir "$dir$subdir/"
done
return
fi
done
echo "Failed to download directory listing of: $dir" >> "$tmpdir/errors"
}
download_files() {
local slash=/
local dir="$1"
# to avoid name collision: a/b/c/ -> a.d/b.d/c.d/
local storedir="${dir//$slash/.d${slash}}"
local done=false
local subdir
cd "$tmpdir/$storedir"
for ((i=0; i<$max_retries; i++)); do
if wget -B "$rooturl$dir" -nc -i files -P "$target/$dir"; then
done=true
break
fi
done
$done || echo "Failed to download all files from $dir" >> "$tmpdir/errors"
for subdir in *.d; do
download_files "$dir${subdir%%.d}/"
done
}
get_files_from_dir ''
# make *.d expand to nothing if no directories are found
shopt -s nullglob
download_files ''
echo "TMP dir: $tmpdir"
echo "Errors : $(wc -l "$tmpdir/errors" 2>/dev/null | cut -d ' ' -f 2 || echo 0)"
The temporary directory and file is not removed afterwards, that must be done manually. Any errors (failures to download) will be written to $tmpdir/errors
It's confirmed to work with:
bzrdl http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-settings/oneiric/files/head:/debian/
Feel free to correct any mistakes or add improvements.
There is no way to selectively check out a specific directory from a Bazaar branch at the moment, although we do have plans to add such support in the future.
There is definitely too much traffic for the clone you are doing, considering the size of the branch. It's probably a bug in the client implementation.
Here on bzr 2.4 it is still quite slow but not too bad (60s):
localhost:/tmp% bzr branch http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/oneiric/nvidia-settings/oneiric
Most recent Ubuntu Oneiric version: 275.09.07-0ubuntu1
Packaging branch status: CURRENT
Branched 37 revision(s).
From the log:
[11866] 2011-07-31 00:56:57.007 INFO: Branched 37 revision(s).
56.786 Transferred: 5335kB (95.8kB/s r:5314kB w:21kB)

Is there a curl/wget option that prevents saving files in case of http errors?

I want to download a lot of urls in a script but I do not want to save the ones that lead to HTTP errors.
As far as I can tell from the man pages, neither curl or wget provide such functionality.
Does anyone know about another downloader who does?
I think the -f option to curl does what you want:
-f, --fail
(HTTP) Fail silently (no output at all) on server errors. This is mostly done to better
enable scripts etc to better deal with failed attempts. In normal cases when an HTTP
server fails to deliver a document, it returns an HTML document stating so (which often
also describes why and more). This flag will prevent curl from outputting that and
return error 22. [...]
However, if the response was actually a 301 or 302 redirect, that still gets saved, even if its destination would result in an error:
$ curl -fO http://google.com/aoeu
$ cat aoeu
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
here.
</BODY></HTML>
To follow the redirect to its dead end, also give the -L option:
-L, --location
(HTTP/HTTPS) If the server reports that the requested page has moved to a different
location (indicated with a Location: header and a 3XX response code), this option will
make curl redo the request on the new place. [...]
One liner I just setup for this very purpose:
(works only with a single file, might be useful for others)
A=$$; ( wget -q "http://foo.com/pipo.txt" -O $A.d && mv $A.d pipo.txt ) || (rm $A.d; echo "Removing temp file")
This will attempt to download the file from the remote Host. If there is an Error, the file is not kept. In all other cases, it's kept and renamed.
Ancient thread.. landed here looking for a solution... ended up writing some shell code to do it.
if [ `curl -s -w "%{http_code}" --compress -o /tmp/something \
http://example.com/my/url/` = "200" ]; then
echo "yay"; cp /tmp/something /path/to/destination/filename
fi
This will download output to a tmp file, and create/overwrite output file only if status was a 200. My usecase is slightly different.. in my case the output takes > 10 seconds to generate... and I did not want the destination file to remain blank for that duration.
NOTE: I am aware that this is an older question, but I believe I have found a better solution for those using wget than any of the above answers provide.
wget -q $URL 2>/dev/null
Will save the target file to the local directory if and only if the HTTP status code is within the 200 range (Ok).
Additionally, if you wanted to do something like print out an error whenever the request was met with an error, you could check the wget exit code for non-zero values like so:
wget -q $URL 2>/dev/null
if [ $? != 0]; then
echo "There was an error!"
fi
I hope this is helpful to someone out there facing the same issues I was.
Update:
I just put this into a more script-able form for my own project, and thought I'd share:
function dl {
pushd . > /dev/null
cd $(dirname $1)
wget -q $BASE_URL/$1 2> /dev/null
if [ $? != 0 ]; then
echo ">> ERROR could not download file \"$1\"" 1>&2
exit 1
fi
popd > /dev/null
}
I have a workaround to propose, it does download the file but it also removes it if its size is 0 (which happens if a 404 occurs).
wget -O <filename> <url/to/file>
if [[ (du <filename> | cut -f 1) == 0 ]]; then
rm <filename>;
fi;
It works for zsh but you can adapt it for other shells.
But it only saves it in first place if you provide the -O option
As alternative you can create a temporal rotational file:
wget http://example.net/myfile.json -O myfile.json.tmp -t 3 -q && mv list.json.tmp list.json
The previous command will always download the file "myfile.json.tmp" however only when the wget exit status is equal to 0 the file is rotated as "myfile.json".
This solution will prevent to overwrite the final file when a network failure occurs.
The advantage of this method is that in case that something is wrong you can inspect the temporal file and see what error message is returned.
The "-t" parameter attempt to download the file several times in case of error.
The "-q" is the quiet mode and it's important to use with cron because cron will report any output of wget.
The "-O" is the output file path and name.
Remember that for Cron schedules it's very important to provide always the full path for all the files and in this case for the "wget" program it self as well.
You can download the file without saving using "-O -" option as
wget -O - http://jagor.srce.hr/
You can get mor information at http://www.gnu.org/software/wget/manual/wget.html#Advanced-Usage