I have setup two ceph clusters with a rados gateway on a node for each of them.
What I'm trying to achieve is to transfer all objects from a bucket "A" with an endpoint in my cluster "1" to a bucket "B" which can be reached from another endpoint on my cluster "2". It doesn't really matter for my issue but at least you understand the context.
I created a script in python using the boto3 module.
The script is really simple. I just wanted to put an object in a bucket.
The relevant part is as written below :
s3 = boto3.resource('s3',
endpoint_url=credentials['endpoint_url'],
aws_access_key_id=credentials['access_key'],
aws_secret_access_key=credentials['secret_key'],
use_ssl=False)
s3.Object('my-bucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
(hello.txt just contains a word)
Let's say this script is written and runs from a node (which is the radosgw endpoint node) in my cluster 1. It works well when the "endpoint_url" is the node I'm running the script from but it does not work when I'm trying to reach my other endpoint (the radosgw, located in another node within my cluster "2").
I got this error :
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL
The weird thing is that I can create a bucket without any error :
s3_src.create_bucket(Bucket=bucket_name)
s3_dest.create_bucket(Bucket=bucket_name)
I can even list the buckets of my two endpoints.
Do you have any idea why I can do pretty much everything but not put a single object in my second endpoint ?
I hope it makes any sense.
Ultimately, I found that the issue was not related with boto but with my ceph pool which countains my data.
The bucket pool was healthy, that's why I could create my buckets whereas the data pool was unhealthy, hence the issue when I tried to put an object in a bucket.
Related
I have an aws setup that requires me to assume role and get corresponding credentials in order to write to s3. For example, to write with aws cli, I need to use --profile readwrite flag. If I write code myself with boot, I'd assume role via sts, get credentials, and create new session.
However, there is a bunch of applications and packages relying on boto3's configuration, e.g. internal code runs like this:
s3 = boto3.resource('s3')
result_s3 = s3.Object(bucket, s3_object_key)
result_s3.put(
Body=value.encode(content_encoding),
ContentEncoding=content_encoding,
ContentType=content_type,
)
From documentation, boto3 can be set to use default profile using (among others) AWS_PROFILE env variable, and it clearly "works" in terms that boto3.Session().profile_name does match the variable - but the applications still won't write to s3.
What would be the cleanest/correct way to set them properly? I tried to pull credentials from sts, and write them as AWS_SECRET_TOKEN etc, but that didn't work for me...
Have a look at the answer here:
How to choose an AWS profile when using boto3 to connect to CloudFront
You can get boto3 to use the other profile like so:
rw = boto3.session.Session(profile_name='readwrite')
s3 = rw.resource('s3')
I think the correct answer to my question is one shared by Nathan Williams in the comment.
In my specific case, given that I had to initiate code from python, and was a bit worried about setting AWS settings that might spill into other operations, I used
the fact that boto3 has DEFAULT_SESSION singleton, used each time, and just overwrote this with a session that assumed the proper role:
hook = S3Hook(aws_conn_id=aws_conn_id)
boto3.DEFAULT_SESSION = hook.get_session()
(here, S3Hook is airflow's s3 handling object). After that (in the same runtime) everything worked perfectly
I am completely new to working with AWS. Currently I am in the following situation: My lambda function starts an EC2 instance. This instance will need the information contained in the 'ID' variable. I was wondering how I could transfer this data from my lambda function to the EC2 instance. Is this even possible?
import boto3
region = 'eu-west-1'
instances = ['AnEC2Instance-ID']
ec2 = boto3.client('ec2', region_name=region)
import os
def lambda_handler(event, context):
ID = event.get('ID')
ec2.start_instances(InstanceIds=instances)
print('started your instance: ' + str(instances))
Here 'AnEC2Instance-ID' is supposed to be an EC2 instance ID.
This lambda function is triggered by a gateway API. The ID is obtained from this Gatway API using the line: ID = event.get('ID')
These EC2 instances have already been launched and in this lambda are being started via boto3 ec2.start_instances. Prior to this you would have to do some clever AWS stuff to modify the instance's user-data and also have the instance configured to re-run the user-data at start (not just launch). Quite complex IMHO.
Two alternate suggestions:
Revisit your need to start an existing EC2 instance, as you can easily pass data to a new instance with boto3 in the client.run_instances function.
Or if you truly need to revive an existing EC2 instance, you might need a third component to manage the correlation of EC2 instance IDs and your Event IDs: how about DynamoDB? First your script above writes a key-value pair of the InstanceID and Event ID. Then invoke ec2.start_instances and when the EC2 instance starts it is pre-configured to do curl http://169.254.169.254/latest/meta-data/instance-id, and uses that value to query the DynamoDB?
When launch an Amazon EC2 instance, you can provide data in the User Data parameter.
This data will then be accessible on the instance via:
http://169.254.169.254/latest/user-data/
This technique is also used to pass a startup script to an instance. There is software provided on the standard Amazon AMIs that will run the script if it starts with specific identifiers. However, you can simply pass any data via User Data to make it available to the instance.
I`m trying to modify RDS DB Instance launched in vpc by AWS API using ModifyDBInstance action. I`m not change instance type (instance launched with db.m1.small type and not canged), but I`m reciving message:
AWS Error. Request ModifyDBInstance failed. Cannot modify the instance class because there are no instances of the requested class available in the current instance's availability zone. Please try your request again at a later time. (RequestID: xxx).
According to AWS docs
To determine the instance classes that are available for a particular DB engine, use the DescribeOrderableDBInstanceOptions action. Note that not all instance classes are available in all regions for all DB engines.
So I have two quastions:
Is it possible to get by API only Instance types available in specific AZ? In DescribeOrderableDBInstanceOptions actions responce I have many instance types, which not available. I`m also checked responce of DescribeReservedDBInstancesOfferings action, and it`s doesn`t fit.
Why it possible to launch DBInstance with some instance type, but have troubles on trying to modify it DBInstance without changing instance type?
Any ideas?
It looks like one of the return values listed in this AWS RDS CLI call is AvailabilityZones
AvailabilityZones -> (list)
A list of Availability Zones for the orderable DB instance.
(structure)
Contains Availability Zone information.
This data type is used as an element in the following data type:
OrderableDBInstanceOption
Name -> (string)
The name of the availability zone.
Generally the CLI allows your to filter but it is not support for rds for some reason or another.
--filters (list)
This parameter is not currently supported.
The API returns the object OrderableDBInstanceOption which also has the AZ listed.
To answer #2 is that AWS does have capacity issues from time to time, like any other cloud or service provider, they are generally better at handling it than others. What AZ are you trying to use and size of instance? If you continue to have issues I would open a support ticket with AWS.
The easiest way is to select any one of the Rds instance you have in your infrastructure and click on Modify and there will be one option like dbInstanceTypes it is like drop down where you can find available instance types available in the particular region.
Currently, I have the following that downloads all files under an AWS (Amazon Web Services) S3 bucket.
To do this in boto3, you could do something like this:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
for object_summary in bucket.objects.all():
print(object_summary.key)
print(object_summary.last_modified)
In this case boto3 will handle all pagination so you won't be limited to only the first page of results.
Is that what you are trying to do?
From Documentation:
http://boto3.readthedocs.io/en/latest/reference/services/s3.html?highlight=s3#S3.Client.list_objects_v2
Returns some or all (up to 1000) of the objects in a bucket.
response = client.list_objects_v2(
Bucket='string',
Delimiter='string',
EncodingType='url',
MaxKeys=123,
Prefix='string',
ContinuationToken='string',
FetchOwner=True|False,
StartAfter='string',
RequestPayer='requester'
)
MaxKeys (integer) -- Sets the maximum number of keys returned in the response. The response might contain fewer keys but will never contain more.
Do you know how many objects do you have in s3?
Limitation: boto3 list_object() and list_object_v2() will return maximum 1000 keys. To continue get object, you need to make use of "ContinuationToken" parameter. boto3.s3.resource.objects.all() will generate an iterator, currently there is no known limit.
However, you must know this :
Downloading from S3 to your local system from the internet WILL cost you, i.e. 0.09/GB. If you need frequently process on such files, maybe you should run your download inside EC2 or use lambda to do the post-processing.
same region S3 data to EC2 download is free. you can run test massive download proof (e.g. say 100GB+ ) without worry about the bill. You just pay EC2 instance.
Downloading performance/continuity of session from s3 to your local intranet , are subject to your local internet connectivity setup. This can be restriction in your own firewall policies, your router, ISP, etc
I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.