Perl : Scrape website and how to download PDF files from the website using Perl Selenium:Chrome - selenium

So I'm studying Scraping website using Selenium:Chrome on Perl, I just wondering how can I download all pdf files from year 2017 to 2021 and store it into a folder from this website https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 . So far this is what I've done
use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;
my $collection_name = "mre_zen_test3";
make_path("$collection_name");
#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;
#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);
#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");
This script only download the entire content of the website. Hope someone here can help me and teach me. Thank you very much.

Here is some working code, to help you get going hopefully
use warnings;
use strict;
use feature 'say';
use Path::Tiny; # only convenience
use Selenium::Chrome;
my $base_url = q(https://www.fda.gov/drugs/)
. q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
my $show = 1; # to see navigation. set to false for headless operation
# A little demo of how to set some browser options
my %chrome_capab = do {
my #cfg = ($show)
? ('window-position=960,10', 'window-size=950,1180')
: 'headless';
'extra_capabilities' => { 'goog:chromeOptions' => { args => [ #cfg ] } }
};
my $drv = Selenium::Chrome->new( %chrome_capab );
my #years = 2017..2021;
foreach my $year (#years) {
my $url = $base_url . "untitled-letters-$year";
$drv->get($url);
say "\nPage title: ", $drv->get_title;
sleep 1 if $show;
my $elem = $drv->find_element(
q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
);
sleep 1 if $show;
# Downloading the file is surprisingly not simple with Selenium (see text)
# But as we found the link we can get its url and then use Selenium-provided
# user-agent (it's LWP::UserAgent)
my $href = $elem->get_attribute('href');
say "pdf's url: $href";
my $response = $drv->ua->get($href);
die $response->status_line if not $response->is_success;
say "Downloading 'Content-Type': ", $response->header('Content-Type');
my $filename = "download_$year.pdf";
say "Save as $filename";
path($filename)->spew( $response->decoded_content );
}
This takes shortcuts, switches approaches, and sidesteps some issues (which one need resolve for a fuller utility of this useful tool). It downloads one pdf from each page; to download all we need to change the XPath expression used to locate them
my #hrefs =
map { $_->get_attribute('href') }
$drv->find_elements(
# There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
q{//li[contains(text(), '(PDF)')]}
. q{/a[starts-with(#href, '/media/') and contains(#href, '/download')]}
);
Now loop over the links, forming filenames more carefully, and download each like in the program above. I can fill the gaps further if there's need for that.
The code puts the pdf files on disk, in its working directory. Please review that before running this so to make sure that nothing gets overwritten!
See Selenium::Remove::Driver for starters.
Note: there is no need for Selenium for this particular task; it's all straight-up HTTP requests, no JavaScript. So LWP::UserAgent or Mojo would do it just fine. But I take it that you want to learn how to use Selenium, since it often is needed and is useful.

Related

Perl web server: How to route

As seen in my code below, I am using apache to serve my Perl web server. I need Perl to have multple routes for my client as seen in my %dispatch. If I figure one out I'm sure the rest will be very similar. If we look at my Subroutine sub resp_index, how can I modify this to link to my index.html file located in my root: /var/www/perl directory?
/var/www/perl/perlServer.pl:
#!/usr/bin/perl
{
package MyWebServer;
use HTTP::Server::Simple::CGI;
use base qw(HTTP::Server::Simple::CGI);
my %dispatch = (
'/index.html' => \&resp_index,
# ...
);
sub handle_request {
my $self = shift;
my $cgi = shift;
my $path = $cgi->path_info();
my $handler = $dispatch{$path};
if (ref($handler) eq "CODE") {
print "HTTP/1.0 200 OK\r\n";
$handler->($cgi);
} else {
print "HTTP/1.0 404 Not found\r\n";
print $cgi->header,
$cgi->start_html('Not found'),
$cgi->h1('Not found'),
$cgi->end_html;
}
}
sub resp_index {
my $cgi = shift; # CGI.pm object
return if !ref $cgi;
my $who = $cgi->param('name');
print $cgi->header,
$cgi->start_html("index"),
$cgi-h1("THIS IS INDEX"),
$cgi->end_html;
}
}
my $pid = MyWebServer->new()->background();
print "Use 'kill $pid' to stop server.\n";
I think what you're asking is how do you serve a file from your web server? Open it and print it, like any other file.
use autodie;
sub resp_index {
my $cgi = shift;
return if !ref $cgi;
print $cgi->header;
open my $fh, "<", "/var/www/perl/index.html";
print <$fh>;
}
Unless this is an exercise, really, really, REALLY don't write your own web framework. It's going to be slow, buggy, and insecure. Consider a small routing framework like Dancer.
For example, mixing documents like index.html and executable code like perlServer.pl in the same directory invites a security hole. Executable code should be isolated in their own directory so they can be given wholly different permissions and stronger protection.
Let's talk about this line...
return if !ref $cgi;
This line is hiding an error. If your functions are passed the wrong argument, or no argument, it will silently return and you (or the person using this) will have no idea why nothing happened. This should be an error...
use Carp;
croak "resp_index() was not given a CGI object" if !ref $cgi;
...but really you should use one of the existing function signature modules such as Method::Signatures.
use Method::Signatures;
func resp_index(CGI $cgi) {
...
}

Rancid/ Looking Glass perl script hitting an odd error: $router unavailable

I am attempting to set up a small test environment (homelab) using CentOS 6.6, Rancid 3.1, Looking Glass, and some Cisco Switches/Routers, with httpd acting as the handler. I have picked up a little perl by means of this endeavor, but python (more 2 than 3) is my background. Right now, everything on the rancid side of things works without issue: bin/clogin successfully logs into all of the equipment in the router.db file, and logging of the configs is working as expected. All switches/routers to be accessed are available and online, verified by ssh connection to devices as well as using bin/clogin.
Right now, I have placed the lg.cgi and lgform.cgi files into var/www/cgi-bin/ which allows the forms to be run as cgi scripts. I had to modify the files to split on ';' instead of ':' due to the change in the .db file in Rancid 3.1:#record = split('\:', $_); was replaced with: #record = split('\;', $_); etc. Once that change was made, I was able to load the lgform.cgi with the proper router.db parsing. At this point, it seemed like everything should be good to go. When I attempt to ping from one of those devices out to 8.8.8.8, the file correctly redirects to lg.cgi, and the page loads, but with
main is unavailable. Try again later.
as the error, where 'main' is the router hostname. Using this output, I was able to find the function responsible for this output. Here it is before I added anything:
sub DoRsh
{
my ($router, $mfg, $cmd, $arg) = #_;
my($ctime) = time();
my($val);
my($lckobj) = LockFile::Simple->make(-delay => $lock_int,
-max => $max_lock_wait, -hold => $max_lock_hold);
if ($pingcmd =~ /\d$/) {
`$pingcmd $router`;
} else {
`$pingcmd $router 56 1`;
}
if ($?) {
print "$router is unreachable. Try again later.\n";
return(-1);
}
if ($LG_SINGLE) {
if (! $lckobj->lock("$cache_dir/$router")) {
print "$router is busy. Try again later.\n";
return(-1);
}
}
$val = &DoCmd($router, $mfg, $cmd, $arg);
if ($LG_SINGLE) {
$lckobj->unlock("$cache_dir/$router");
}
return($val);
}
In order to dig in a little deeper, I peppered that function with several print statements. Here is the modified function, followed by the output from the loaded lg.cgi page:
sub DoRsh
{
my ($router, $mfg, $cmd, $arg) = #_;
my($ctime) = time();
my($val);
my($lckobj) = LockFile::Simple->make(-delay => $lock_int,
-max => $max_lock_wait, -hold => $max_lock_hold);
if ($pingcmd =~ /\d$/) {
`$pingcmd $router`;
} else {
`$pingcmd $router 56 1`;
}
print "About to test the ($?) branch.\n";
print "Also who is the remote_user?:' $remote_user'\n";
print "What about the ENV{REMOTE_USER} '$ENV{REMOTE_USER}'\n";
print "Here is the ENV{HOME}: '$ENV{HOME}'\n";
if ($?) {
print "$lckobj is the lock object.\n";
print "#_ something else to look at.\n";
print "$? whatever this is suppose to be....\n";
print "Some variables:\n";
print "$mfg is the mfg.\n";
print "$cmd was the command passed in with $arg as the argument.\n";
print "$pingcmd $router\n";
print "$cloginrc - Is the cloginrc pointing correctly?\n";
print "$LG_SINGLE the next value to be tested.\n";
print "$router is unreachable. Try again later.\n";
return(-1);
}
if ($LG_SINGLE) {
if (! $lckobj->lock("$cache_dir/$router")) {
print "$router is busy. Try again later.\n";
return(-1);
}
}
$val = &DoCmd($router, $mfg, $cmd, $arg);
if ($LG_SINGLE) {
$lckobj->unlock("$cache_dir/$router");
}
return($val);
}
OUTPUT:
About to test the (512) branch.
Also who is the remote_user?:' '
What about the ENV{REMOTE_USER} ''
Here is the ENV{HOME}: '.'
LockFile::Simple=HASH(0x1a13650) is the lock object.
main cisco ping 8.8.8.8 something else to look at.
512 whatever this is suppose to be....
Some variables:
cisco is the mfg.
ping was the command passed in with 8.8.8.8 as the argument.
/bin/ping -c 1 main
./.cloginrc - Is the cloginrc pointing correctly?
1 the next value to be tested.
main is unreachable. Try again later.
I can provide the code for when DoRsh is called, if necessary, but it looks mostly like this:&DoRsh($router, $mfg, $cmd, $arg);.
From what I can tell the '$?' special variable (or at least according to
this reference it is a special var) is returning the 512 value, which is causing that fork to test true. The problem is I don't know what that 512 means, nor where it is coming from. Using the ref site's description ("The status returned by the last pipe close, backtick (``) command, or system operator.") and the formation of the conditional tree above, I can see that it is some error of some kind, but I don't know how else to proceed with this inspection. I'm wondering if maybe it is in response to some permission issue, since the remote_user variable is null, when I didn't expect it to be. Any guidance anyone may be able to provide would be helpful. Furthermore, if there is any information that I may have skipped over, that I didn't think to include, or that may prove helpful, please ask, and I will provide to the best of my ability
May be you put in something like
my $pingret=$pingcmd ...;
print 'Ping result was:'.$pingret;
And check the returned strings?

Yii + Zend gdata. Youtube upload

I want to upload videos with next way:
I just upload file to server (as usual)
My server-side Yii-application takes that video and uploads it on Youtube from a special account on youTube
What do i have:
My YouTube (google) account name and email. "name" or "name#gmail.com"
My password
A developer Key, which I found in Google's "Product Dashboard"
A name of the application, which names 'myapp':
Product Dashboard: myapp
So, I read some docs in google and decided that best way for me is to use ClientLogin auth type, because I have only one account to use and I have all necessary data. I found an example for ZendFramework's GData and I imported it into my Yii application.
I specially simplified the code just to upload one single video from /upload directory to test that it works. I expect to find a video in my YT account uploaded. Of course there is no video and here I am :-) Complete code of the action is below:
Yii::import('application.vendors.*');
require_once 'Zend/Loader.php';
Zend_Loader::loadClass('Zend_Gdata_YouTube');
Zend_Loader::loadClass('Zend_Gdata_ClientLogin');
$yt_user = 'myYTname';
$yt_pass = 'myYTpass';
$yt_source = 'myapp';
$yt_api_key = 'veryVERYlongKEYhere';
$authenticationURL= 'https://www.google.com/accounts/ClientLogin';
$httpClient = Zend_Gdata_ClientLogin::getHttpClient(
$username = $yt_user,
$password = $yt_pass,
$service = 'youtube',
$client = null,
$source = $yt_source,
$loginToken = null,
$loginCaptcha = null,
$authenticationURL
);
$yt = new Zend_Gdata_YouTube($httpClient, $yt_source, null, $yt_api_key);
$myVideoEntry = new Zend_Gdata_YouTube_VideoEntry();
$filesource = $yt->newMediaFileSource(Yii::getpathOfAlias('webroot').'/upload/videos/video.mp4');
$filesource->setContentType('video/mp4');
$filesource->setSlug('video.mp4');
$myVideoEntry->setMediaSource($filesource);
$myVideoEntry->setVideoTitle('My Test Movie');
$myVideoEntry->setVideoDescription('My Test Movie description');
$myVideoEntry->setVideoCategory('Autos');
$myVideoEntry->SetVideoTags('cars, funny');
$myVideoEntry->setVideoDeveloperTags(array('mydevtag', 'anotherdevtag'));
$uploadUrl = "http://uploads.gdata.youtube.com/feeds/api/users/{$yt_user}/uploads";
try {
$newEntry = $yt->insertEntry($myVideoEntry, $uploadUrl, 'Zend_Gdata_YouTube_VideoEntry');
} catch (Zend_Gdata_App_HttpException $httpException) {
echo $httpException->getRawResponseBody();
} catch (Zend_Gdata_App_Exception $e) {
echo $e->getMessage();
}
As you can see, there is a lot of default code from the official example. But it doesn't work. Noone echo shows me information. But when I deleted try-catch, I got an error:
Zend_Gdata_App_HttpException
Read timed out after 10 seconds
So, this problem is solved by myself :)
First of all: don't try to upload from localhost!
Then in my case I got an error, that I didn't say my dev-key! So, if you got the same error, try to change this:
$newEntry = $yt->insertEntry($myVideoEntry, $uploadUrl, 'Zend_Gdata_YouTube_VideoEntry');
by adding the 4th parameter - extra headers:
$yt->insertEntry($myVideoEntry, $uploadUrl, 'Zend_Gdata_YouTube_VideoEntry', array(
'X-GData-Key' => 'key=yourBIGbigBIGdeveloperKEYhere'
));
Good luck and have fun with youtube API!

How can I implement incremental (find-as-you-type) search on command line?

I'd like to write small scripts which feature incremental search (find-as-you-type) on the command line.
Use case: I have my mobile phone connected via USB, Using gammu --sendsms TEXT I can write text messages. I have the phonebook as CSV, and want to search-as-i-type on that.
What's the easiest/best way to do it? It might be in bash/zsh/Perl/Python or any other scripting language.
Edit:
Solution: Modifying Term::Complete did what I want. See below for the answer.
I get the impression GNU Readline supports this kind of thing. Though, I have not used it myself. Here is a C++ example of custom auto complete, which could easily be done in C too. There is also a Python API for readline.
This StackOverflow question gives examples in Python, one of which is ...
import readline
def completer(text, state):
options = [x in addrs where x.startswith(text)]
if state < options.length:
return options[state]
else
return None
readline.set_completer(completer)
this article on Bash autocompletion may help. This article also gives examples of programming bash's auto complete feature.
Following Aiden Bell's hint, I tried Readline in Perl.
Solution 1 using Term::Complete (also used by CPAN, I think):
use Term::Complete;
my $F;
open($F,"<","bin/phonebook.csv");
my #terms = <$F>; chomp(#terms);
close($F);
my $input;
while (!defined $input) {
$input = Complete("Enter a name or number: ",#terms);
my ($name,$number) = split(/\t/,$input);
print("Sending SMS to $name ($number).\n");
system("sudo gammu --sendsms TEXT $number");
}
Press \ to complete, press Ctrl-D to see all possibilities.
Solution 2: Ctrl-D is one keystroke to much, so using standard Term::Readline allows completion and the display off possible completions using only \.
use Term::ReadLine;
my $F;
open($F,"<","bin/phonebook.csv");
my #terms = <$F>; chomp(#terms);
close($F);
my $term = new Term::ReadLine;
$term->Attribs->{completion_function} = sub { return #terms; };
my $prompt = "Enter name or number >> ";
my $OUT = $term->OUT || \*STDOUT;
while ( defined (my $input = $term->readline($prompt)) ) {
my ($name,$number) = split(/\t/,$input);
print("Sending SMS to $name ($number).\n");
system("sudo gammu --sendsms TEXT $number");
}
This solution still needs a for completion.
Edit: Final Solution
Modifying Term::Complete (http://search.cpan.org/~jesse/perl-5.12.0/lib/Term/Complete.pm) does give me on the fly completion.
Source code: http://search.cpan.org/CPAN/authors/id/J/JE/JESSE/perl-5.12.0.tar.gz
Solution number 1 works with this modification. I will put the whole sample online somewhere else if this can be used by somebody
Modifications of Completion.pm (just reusing it's code for Control-D and \ for each character):
170c172,189
my $redo=0;
#match = grep(/^\Q$return/, #cmp_lst);
unless ($#match < 0) {
$l = length($test = shift(#match));
foreach $cmp (#match) {
until (substr($cmp, 0, $l) eq substr($test, 0, $l)) {
$l--;
}
}
print("\a");
print($test = substr($test, $r, $l - $r));
$redo = $l - $r == 0;
if ($redo) { print(join("\r\n", '', grep(/^\Q$return/, #cmp_lst)), "\r\n"); }
$r = length($return .= $test);
}
if ($redo) { redo LOOP; } else { last CASE; }

Delete Amazon S3 buckets? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've been interacting with Amazon S3 through S3Fox and I can't seem to delete my buckets. I select a bucket, hit delete, confirm the delete in a popup, and... nothing happens. Is there another tool that I should use?
It is finally possible to delete all the files in one go using the new Lifecycle (expiration) rules feature. You can even do it from the AWS console.
Simply right click on the bucket name in AWS console, select "Properties" and then in the row of tabs at the bottom of the page select "lifecycle" and "add rule". Create a lifecycle rule with the "Prefix" field set blank (blank means all files in the bucket, or you could set it to "a" to delete all files whose names begin with "a"). Set the "Days" field to "1". That's it. Done. Assuming the files are more than one day old they should all get deleted, then you can delete the bucket.
I only just tried this for the first time so I'm still waiting to see how quickly the files get deleted (it wasn't instant but presumably should happen within 24 hours) and whether I get billed for one delete command or 50 million delete commands... fingers crossed!
Remeber that S3 Buckets need to be empty before they can be deleted. The good news is that most 3rd party tools automate this process. If you are running into problems with S3Fox, I recommend trying S3FM for GUI or S3Sync for command line. Amazon has a great article describing how to use S3Sync. After setting up your variables, the key command is
./s3cmd.rb deleteall <your bucket name>
Deleting buckets with lots of individual files tends to crash a lot of S3 tools because they try to display a list of all files in the directory. You need to find a way to delete in batches. The best GUI tool I've found for this purpose is Bucket Explorer. It deletes files in a S3 bucket in 1000 file chunks and does not crash when trying to open large buckets like s3Fox and S3FM.
I've also found a few scripts that you can use for this purpose. I haven't tried these scripts yet but they look pretty straightforward.
RUBY
require 'aws/s3'
AWS::S3::Base.establish_connection!(
:access_key_id => 'your access key',
:secret_access_key => 'your secret key'
)
bucket = AWS::S3::Bucket.find('the bucket name')
while(!bucket.empty?)
begin
puts "Deleting objects in bucket"
bucket.objects.each do |object|
object.delete
puts "There are #{bucket.objects.size} objects left in the bucket"
end
puts "Done deleting objects"
rescue SocketError
puts "Had socket error"
end
end
PERL
#!/usr/bin/perl
use Net::Amazon::S3;
my $aws_access_key_id = 'your access key';
my $aws_secret_access_key = 'your secret access key';
my $increment = 50; # 50 at a time
my $bucket_name = 'bucket_name';
my $s3 = Net::Amazon::S3->new({aws_access_key_id => $aws_access_key_id, aws_secret_access_key => $aws_secret_access_key, retry => 1, });
my $bucket = $s3->bucket($bucket_name);
print "Incrementally deleting the contents of $bucket_name\n";
my $deleted = 1;
my $total_deleted = 0;
while ($deleted > 0) {
print "Loading up to $increment keys...\n";
$response = $bucket->list({'max-keys' => $increment, }) or die $s3->err . ": " . $s3->errstr . "\n";
$deleted = scalar(#{ $response->{keys} }) ;
$total_deleted += $deleted;
print "Deleting $deleted keys($total_deleted total)...\n";
foreach my $key ( #{ $response->{keys} } ) {
my $key_name = $key->{key};
$bucket->delete_key($key->{key}) or die $s3->err . ": " . $s3->errstr . "\n";
}
}
print "Deleting bucket...\n";
$bucket->delete_bucket or die $s3->err . ": " . $s3->errstr;
print "Done.\n";
SOURCE: Tarkblog
Hope this helps!
recent versions of s3cmd have --recursive
e.g.,
~/$ s3cmd rb --recursive s3://bucketwithfiles
http://s3tools.org/kb/item5.htm
With s3cmd:
Create a new empty directory
s3cmd sync --delete-removed empty_directory s3://yourbucket
This may be a bug in S3Fox, because it is generally able to delete items recursively. However, I'm not sure if I've ever tried to delete a whole bucket and its contents at once.
The JetS3t project, as mentioned by Stu, includes a Java GUI applet you can easily run in a browser to manage your S3 buckets: Cockpit. It has both strengths and weaknesses compared to S3Fox, but there's a good chance it will help you deal with your troublesome bucket. Though it will require you to delete the objects first, then the bucket.
Disclaimer: I'm the author of JetS3t and Cockpit
SpaceBlock also makes it simple to delete s3 buckets - right click bucket, delete, wait for job to complete in transfers view, done.
This is the free and open source windows s3 front-end that I maintain, so shameless plug alert etc.
I've implemented bucket-destroy, a multi threaded utility that does everything it takes to delete a bucket. I handle non empty buckets, as well as version enabled bucket keys.
You can read the blog post here http://bytecoded.blogspot.com/2011/01/recursive-delete-utility-for-version.html and the instructions here http://code.google.com/p/bucket-destroy/
I've successfully deleted with it a bucket that contains double '//' in the key name, versioned key and DeleteMarker keys. Currently I'm running it on a bucket that contains ~40,000,000 so far I've been able to delete 1,200,000 in several hours on m1.large. Note that the utility is multi threaded but does not (yet) implemented shuffling (which will horizontal scaling, launching the utility on several machines).
If you use amazon's console and on a one-time basis need to clear out a bucket: You can browse to your bucket then select the top key then scroll to the bottom and then press shift on your keyboard then click on the bottom one. It will select all in between then you can right click and delete.
If you have ruby (and rubygems) installed, install aws-s3 gem with
gem install aws-s3
or
sudo gem install aws-s3
create a file delete_bucket.rb:
require "rubygems" # optional
require "aws/s3"
AWS::S3::Base.establish_connection!(
:access_key_id => 'access_key_id',
:secret_access_key => 'secret_access_key')
AWS::S3::Bucket.delete("bucket_name", :force => true)
and run it:
ruby delete_bucket.rb
Since Bucket#delete returned timeout exceptions a lot for me, I have expanded the script:
require "rubygems" # optional
require "aws/s3"
AWS::S3::Base.establish_connection!(
:access_key_id => 'access_key_id',
:secret_access_key => 'secret_access_key')
while AWS::S3::Bucket.find("bucket_name")
begin
AWS::S3::Bucket.delete("bucket_name", :force => true)
rescue
end
end
I guess the easiest way would be to use S3fm, a free online file manager for Amazon S3. No applications to install, no 3rd party web sites registrations. Runs directly from Amazon S3, secure and convenient.
Just select your bucket and hit delete.
One technique that can be used to avoid this problem is putting all objects in a "folder" in the bucket, allowing you to just delete the folder then go along and delete the bucket. Additionally, the s3cmd tool available from http://s3tools.org can be used to delete a bucket with files in it:
s3cmd rb --force s3://bucket-name
I hacked together a script for doing it from Python, it successfully removed my 9000 objects. See this page:
https://efod.se/blog/archive/2009/08/09/delete-s3-bucket
One more shameless plug: I got tired of waiting for individual HTTP delete requests when I had to delete 250,000 items, so I wrote a Ruby script that does it multithreaded and completes in a fraction of the time:
http://github.com/sfeley/s3nuke/
This is one that works much faster in Ruby 1.9 because of the way threads are handled.
This is a hard problem. My solution is at http://stuff.mit.edu/~jik/software/delete-s3-bucket.pl.txt. It describes all of the things I've determined can go wrong in a comment at the top. Here's the current version of the script (if I change it, I'll put a new version at the URL but probably not here).
#!/usr/bin/perl
# Copyright (c) 2010 Jonathan Kamens.
# Released under the GNU General Public License, Version 3.
# See <http://www.gnu.org/licenses/>.
# $Id: delete-s3-bucket.pl,v 1.3 2010/10/17 03:21:33 jik Exp $
# Deleting an Amazon S3 bucket is hard.
#
# * You can't delete the bucket unless it is empty.
#
# * There is no API for telling Amazon to empty the bucket, so you have to
# delete all of the objects one by one yourself.
#
# * If you've recently added a lot of large objects to the bucket, then they
# may not all be visible yet on all S3 servers. This means that even after the
# server you're talking to thinks all the objects are all deleted and lets you
# delete the bucket, additional objects can continue to propagate around the S3
# server network. If you then recreate the bucket with the same name, those
# additional objects will magically appear in it!
#
# It is not clear to me whether the bucket delete will eventually propagate to
# all of the S3 servers and cause all the objects in the bucket to go away, but
# I suspect it won't. I also suspect that you may end up continuing to be
# charged for these phantom objects even though the bucket they're in is no
# longer even visible in your S3 account.
#
# * If there's a CR, LF, or CRLF in an object name, then it's sent just that
# way in the XML that gets sent from the S3 server to the client when the
# client asks for a list of objects in the bucket. Unfortunately, the XML
# parser on the client will probably convert it to the local line ending
# character, and if it's different from the character that's actually in the
# object name, you then won't be able to delete it. Ugh! This is a bug in the
# S3 protocol; it should be enclosing the object names in CDATA tags or
# something to protect them from being munged by the XML parser.
#
# Note that this bug even affects the AWS Web Console provided by Amazon!
#
# * If you've got a whole lot of objects and you serialize the delete process,
# it'll take a long, long time to delete them all.
use threads;
use strict;
use warnings;
# Keys can have newlines in them, which screws up the communication
# between the parent and child processes, so use URL encoding to deal
# with that.
use CGI qw(escape unescape); # Easiest place to get this functionality.
use File::Basename;
use Getopt::Long;
use Net::Amazon::S3;
my $whoami = basename $0;
my $usage = "Usage: $whoami [--help] --access-key-id=id --secret-access-key=key
--bucket=name [--processes=#] [--wait=#] [--nodelete]
Specify --processes to indicate how many deletes to perform in
parallel. You're limited by RAM (to hold the parallel threads) and
bandwidth for the S3 delete requests.
Specify --wait to indicate seconds to require the bucket to be verified
empty. This is necessary if you create a huge number of objects and then
try to delete the bucket before they've all propagated to all the S3
servers (I've seen a huge backlog of newly created objects take *hours* to
propagate everywhere). See the comment at the top of the script for more
information about this issue.
Specify --nodelete to empty the bucket without actually deleting it.\n";
my($aws_access_key_id, $aws_secret_access_key, $bucket_name, $wait);
my $procs = 1;
my $delete = 1;
die if (! GetOptions(
"help" => sub { print $usage; exit; },
"access-key-id=s" => \$aws_access_key_id,
"secret-access-key=s" => \$aws_secret_access_key,
"bucket=s" => \$bucket_name,
"processess=i" => \$procs,
"wait=i" => \$wait,
"delete!" => \$delete,
));
die if (! ($aws_access_key_id && $aws_secret_access_key && $bucket_name));
my $increment = 0;
print "Incrementally deleting the contents of $bucket_name\n";
$| = 1;
my(#procs, $current);
for (1..$procs) {
my($read_from_parent, $write_to_child);
my($read_from_child, $write_to_parent);
pipe($read_from_parent, $write_to_child) or die;
pipe($read_from_child, $write_to_parent) or die;
threads->create(sub {
close($read_from_child);
close($write_to_child);
my $old_select = select $write_to_parent;
$| = 1;
select $old_select;
&child($read_from_parent, $write_to_parent);
}) or die;
close($read_from_parent);
close($write_to_parent);
my $old_select = select $write_to_child;
$| = 1;
select $old_select;
push(#procs, [$read_from_child, $write_to_child]);
}
my $s3 = Net::Amazon::S3->new({aws_access_key_id => $aws_access_key_id,
aws_secret_access_key => $aws_secret_access_key,
retry => 1,
});
my $bucket = $s3->bucket($bucket_name);
my $deleted = 1;
my $total_deleted = 0;
my $last_start = time;
my($start, $waited);
while ($deleted > 0) {
$start = time;
print "\nLoading ", ($increment ? "up to $increment" :
"as many as possible")," keys...\n";
my $response = $bucket->list({$increment ? ('max-keys' => $increment) : ()})
or die $s3->err . ": " . $s3->errstr . "\n";
$deleted = scalar(#{ $response->{keys} }) ;
if (! $deleted) {
if ($wait and ! $waited) {
my $delta = $wait - ($start - $last_start);
if ($delta > 0) {
print "Waiting $delta second(s) to confirm bucket is empty\n";
sleep($delta);
$waited = 1;
$deleted = 1;
next;
}
else {
last;
}
}
else {
last;
}
}
else {
$waited = undef;
}
$total_deleted += $deleted;
print "\nDeleting $deleted keys($total_deleted total)...\n";
$current = 0;
foreach my $key ( #{ $response->{keys} } ) {
my $key_name = $key->{key};
while (! &send(escape($key_name) . "\n")) {
print "Thread $current died\n";
die "No threads left\n" if (#procs == 1);
if ($current == #procs-1) {
pop #procs;
$current = 0;
}
else {
$procs[$current] = pop #procs;
}
}
$current = ($current + 1) % #procs;
threads->yield();
}
print "Sending sync message\n";
for ($current = 0; $current < #procs; $current++) {
if (! &send("\n")) {
print "Thread $current died sending sync\n";
if ($current = #procs-1) {
pop #procs;
last;
}
$procs[$current] = pop #procs;
$current--;
}
threads->yield();
}
print "Reading sync response\n";
for ($current = 0; $current < #procs; $current++) {
if (! &receive()) {
print "Thread $current died reading sync\n";
if ($current = #procs-1) {
pop #procs;
last;
}
$procs[$current] = pop #procs;
$current--;
}
threads->yield();
}
}
continue {
$last_start = $start;
}
if ($delete) {
print "Deleting bucket...\n";
$bucket->delete_bucket or die $s3->err . ": " . $s3->errstr;
print "Done.\n";
}
sub send {
my($str) = #_;
my $fh = $procs[$current]->[1];
print($fh $str);
}
sub receive {
my $fh = $procs[$current]->[0];
scalar <$fh>;
}
sub child {
my($read, $write) = #_;
threads->detach();
my $s3 = Net::Amazon::S3->new({aws_access_key_id => $aws_access_key_id,
aws_secret_access_key => $aws_secret_access_key,
retry => 1,
});
my $bucket = $s3->bucket($bucket_name);
while (my $key = <$read>) {
if ($key eq "\n") {
print($write "\n") or die;
next;
}
chomp $key;
$key = unescape($key);
if ($key =~ /[\r\n]/) {
my(#parts) = split(/\r\n|\r|\n/, $key, -1);
my(#guesses) = shift #parts;
foreach my $part (#parts) {
#guesses = (map(($_ . "\r\n" . $part,
$_ . "\r" . $part,
$_ . "\n" . $part), #guesses));
}
foreach my $guess (#guesses) {
if ($bucket->get_key($guess)) {
$key = $guess;
last;
}
}
}
$bucket->delete_key($key) or
die $s3->err . ": " . $s3->errstr . "\n";
print ".";
threads->yield();
}
return;
}
I am one of the Developer Team member of Bucket Explorer Team, We will provide different option to delete Bucket as per the users choice...
1) Quick Delete -This option will delete you data from bucket in chunks of 1000.
2) Permanent Delete-This option will Delete objects in queue.
How to delete Amazon S3 files and bucket?
Amazon recently added a new feature, "Multi-Object Delete", which allows up to 1,000 objects to be deleted at a time with a single API request. This should allow simplification of the process of deleting huge numbers of files from a bucket.
The documentation for the new feature is available here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/DeletingMultipleObjects.html
I've always ended up using their C# API and little scripts to do this. I'm not sure why S3Fox can't do it, but that functionality appears to be broken within it at the moment. I'm sure that many of the other S3 tools can do it as well, though.
Delete all of the objects in the bucket first. Then you can delete the bucket itself.
Apparently, one cannot delete a bucket with objects in it and S3Fox does not do this for you.
I've had other little issues with S3Fox myself, like this, and now use a Java based tool, jets3t which is more forthcoming about error conditions. There must be others, too.
You must make sure you have correct write permission set for the bucket, and the bucket contains no objects.
Some useful tools that can assist your deletion: CrossFTP, view and delete the buckets like the FTP client. jets3t Tool as mentioned above.
I'll have to have a look at some of these alternative file managers. I've used (and like) BucketExplorer, which you can get from - surprisingly - http://www.bucketexplorer.com/.
It's a 30 day free trial, then (currently) costing US$49.99 per licence (US$49.95 on the purchase cover page).
Try https://s3explorer.appspot.com/ to manage your S3 account.
This is what I use. Just simple ruby code.
case bucket.size
when 0
puts "Nothing left to delete"
when 1..1000
bucket.objects.each do |item|
item.delete
puts "Deleting - #{bucket.size} left"
end
end
Use the amazon web managment console. With Google chrome for speed. Deleted the objects a lot faster than firefox (about 10 times faster). Had 60 000 objects to delete.