I have developed a Tornado API which gets me the AWS S3 bucket contents, Below is the code snippet which run perfectly with Boto. However this doesn't work for the buckets in some different location.
The method returns a list(resp) which is consists filename, size, and file type.
Want to achieve similar using Boto3. Tried a lot but Boto3 methods returns the all contents of the s3 bucket with full path.
def post(self):
try:
resp = []
path = self.get_argument('path')
bucket_name = self.get_argument('bucket_name')
path_len = len(path)
conn = S3Connection()
bucket = conn.get_bucket(bucket_name)
folders = bucket.list(path, "/")
for folder in folders:
if folder.name == path:
continue
if str(folder.name).endswith("/"):
file_type = 'd'
file_name = str(folder.name)[path_len:-1]
else:
_file_size = self.filesize(folder.size)
file_type = 'f'
file_name = str(folder.name)[path_len:]
resp.append({"bucket": bucket_name, "path": path, "name": file_name, "type": file_type,
"size": _file_size if file_type == 'f' else ""})
self.write(json.dumps(resp))
Razvan Tudorica built a small replacement for Boto3's upload and delete methods which uses Tornado’s AsyncHTTPClient; he published a blog post here concerning the work and posted his code on GitHub.
As the original SO enquiry highlights that the code snippet supplied "doesn't work for the buckets in some different location", of specific interest here is Razvan's note that, "the main idea around [his] replacement is to use botocore to build the request (AWS wants the requests to be signed using different algorithms based on AWS zones and request data) and only to use the AsyncHTTPClient for the actual asynchronous call."
I hope Razvan's work still proves useful to you or, minimally, to others researching similar efforts (as I was recently).
Related
I am a complete beginner in using GCP functions/products.
I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.
import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage
list_city = []
with open("list_of_cities.txt", "r") as pointer:
for line in pointer:
list_city.append(line.strip())
API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="
yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")
table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"
credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
client = bigquery.Client()
try:
schema = [
bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
]
table = bigquery.Table(table_id, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="Date", # name of column to use for partitioning
)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
except:
print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))
def get_weather():
try:
x["location"]
except:
print(f"API could not call city {city_name}")
global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km
day = []
time = []
dailytemp = []
dailyhum = []
dailycond = []
chance_rain = []
Precipitation = []
Cloud_coverage = []
Visibility_km = []
for i in range(24):
dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
day.append(dayval.group()[:-1])
time.append(timeval.group()[1:])
dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
for i in range(len(time)):
time[i] = int(time[i][:2])
def main():
i = 0
while i < len(list_city):
try:
global city_name
city_name = list_city[i]
complete_URL = Base_URL + API_key + "&q=" + city_name + "&dt=" + Date
response = requests.get(complete_URL, timeout = 10)
global x
x = response.json()
get_weather()
table = client.get_table(table_id)
varlist = []
for j in range(24):
variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
varlist.append(variables)
client.insert_rows(table, varlist)
print(f"City {city_name}, ({i+1} out of {len(list_city)}) successfully inserted")
i += 1
except Exception as e:
print(e)
continue
In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.
This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.
I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.
For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.
You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity
This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.
If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.
If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()
https://cloud.google.com/bigquery/docs/authentication/service-account-file
You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)
There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.
It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:
Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run
Based on my understanding of your application design, the most suitable ones would be:
Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style
I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account PROJECT_ID#appspot.gserviceaccount.com under the IAM configuration section has proper access to needed APIs (BigQuery in your case).
In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.
I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.
I have a list of .zip files on S3 which is dynamically created and passed to flask view /download. My problem doesn't seem to be in looping through this list, but rather in returning a response so that all files from the list are downloaded to users computer. If I have a view return a response it only downloads the first file, as return closes the connection.
I have tried a number of things and looked at similar issues (like this one: Boto3 to download all files from a S3 Bucket ), but so far had no luck in resolving this. I have also looked at streaming (as in here: http://flask.pocoo.org/docs/1.0/patterns/streaming/ ) and tried creating a subfunction which is a generator, but the same issue persists as I still have to pass a return value to View function - here is that last code example:
#app.route('/download', methods=['POST'])
def download():
download=[]
download = request.form.getlist('checked')
def generate(result):
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(S3_BUCKET)
d_object = my_bucket.Object(result).get()
yield d_object['Body'].read()
for filename in download:
return Response (generate(filename), mimetype='application/zip', headers={'Content-Disposition': 'attachment;filename=' + filename})
What would be the best way of doing this so that all the files are downloaded?
Is there an easier way to pass a list to boto3 or any other flask module to download those files?
I am trying to simulate directory listing for my bucket on ASW S3. Currently I am creating "index.html" locally as follows:
for root, dirs, files in os.walk(job_dir):
objects = []
for obj in dirs+files:
m_time_epoch = os.stat(os.path.join(path,obj)).st_mtime
mtime = datetime.fromtimestamp(m_time_epoch).strftime('%c')
size = os.stat(os.path.join(path,obj)).st_size
type = 'dir' if os.path.isdir(os.path.join(path,obj)) else 'file'
objects.append({'name': obj,
'mtime': mtime,
'size': size,
'type': type})
generate_index(objects, dest_path)
And then passing it together with destination path (bucket URL) to a function which will create "index.html" using jinja template.
Is there better way to do it? I would like to avoid JavaScript though. I made some googling however so far did not find an elegant solution.
What would be the easiest alternative of "os.walk" using boto3 python client?
I found some snippets e.g. here:
How do I list directory contents of an S3 bucket using Python and Boto3?
But is not there a simpler solution?
Thanks...
I'd recommend using the list_objects_v2 method in boto3.
import boto3
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(
Bucket='MyBucket'
)
objects = []
for response in response_iterator:
for r in response['Contents']:
print("File is called {}".format(r['Key']))
While iterating through the objects in the bucket, you could build an object you could pass to a Jinja template to create the index.html page
How can I create a folder under a bucket using boto library for Amazon s3?
I followed the manual, and created the keys with permission, metadata etc, but no where in the boto's documentation it describes how to create folders under a bucket, or create a folder under folders in bucket.
There is no concept of folders or directories in S3. You can create file names like "abc/xys/uvw/123.jpg", which many S3 access tools like S3Fox show like a directory structure, but it's actually just a single file in a bucket.
Assume you wanna create folder abc/123/ in your bucket, it's a piece of cake with Boto
k = bucket.new_key('abc/123/')
k.set_contents_from_string('')
Or use the console
Use this:
import boto3
s3 = boto3.client('s3')
bucket_name = "YOUR-BUCKET-NAME"
directory_name = "DIRECTORY/THAT/YOU/WANT/TO/CREATE" #it's name of your folders
s3.put_object(Bucket=bucket_name, Key=(directory_name+'/'))
With AWS SDK .Net works perfectly, just add "/" at the end of the folder name string:
var folderKey = folderName + "/"; //end the folder name with "/"
AmazonS3 client = Amazon.AWSClientFactory.CreateAmazonS3Client(AWSAccessKey, AWSSecretKey);
var request = new PutObjectRequest();
request.WithBucketName(AWSBucket);
request.WithKey(folderKey);
request.WithContentBody(string.Empty);
S3Response response = client.PutObject(request);
Then refresh your AWS console, and you will see the folder
Tried many method above and adding forward slash / to the end of key name, to create directory didn't work for me:
client.put_object(Bucket="foo-bucket", Key="test-folder/")
You have to supply Body parameter in order to create directory:
client.put_object(Bucket='foo-bucket',Body='', Key='test-folder/')
Source: ryantuck in boto3 issue
Append "_$folder$" to your folder name and call put.
String extension = "_$folder$";
s3.putObject("MyBucket", "MyFolder"+ extension, new ByteArrayInputStream(new byte[0]), null);
see:
http://www.snowgiraffe.com/tech/147/creating-folders-programmatically-with-amazon-s3s-api-putting-babies-in-buckets/
Update for 2019, if you want to create a folder with path bucket_name/folder1/folder2 you can use this code:
from boto3 import client, resource
class S3Helper:
def __init__(self):
self.client = client("s3")
self.s3 = resource('s3')
def create_folder(self, path):
path_arr = path.rstrip("/").split("/")
if len(path_arr) == 1:
return self.client.create_bucket(Bucket=path_arr[0])
parent = path_arr[0]
bucket = self.s3.Bucket(parent)
status = bucket.put_object(Key="/".join(path_arr[1:]) + "/")
return status
s3 = S3Helper()
s3.create_folder("bucket_name/folder1/folder2)
It's really easy to create folders. Actually it's just creating keys.
You can see my below code i was creating a folder with utc_time as name.
Do remember ends the key with '/' like below, this indicates it's a key:
Key='folder1/' + utc_time + '/'
client = boto3.client('s3')
utc_timestamp = time.time()
def lambda_handler(event, context):
UTC_FORMAT = '%Y%m%d'
utc_time = datetime.datetime.utcfromtimestamp(utc_timestamp)
utc_time = utc_time.strftime(UTC_FORMAT)
print 'start to create folder for => ' + utc_time
putResponse = client.put_object(Bucket='mybucketName',
Key='folder1/' + utc_time + '/')
print putResponse
Although you can create a folder by appending "/" to your folder_name. Under the hood, S3 maintains flat structure unlike your regular NFS.
var params = {
Bucket : bucketName,
Key : folderName + "/"
};
s3.putObject(params, function (err, data) {});
S3 doesn't have a folder structure, But there is something called as keys.
We can create /2013/11/xyz.xls and will be shown as folder's in the console. But the storage part of S3 will take that as the file name.
Even when retrieving we observe that we can see files in particular folder (or keys) by using the ListObjects method and using the Prefix parameter.
Apparently you can now create folders in S3. I'm not sure since when, but I have a bucket in "Standard" zone and can choose Create Folder from Action dropdown.
This question is more relevant to the future, so adding this update.
I am using the upload_file method as shown below.
fold ='/my/system/filePath/tabmcq/Tables/auto/18.tsv'
s3_client.upload_file(
Filename = full/file/path/filename.extension,
Bucket = "tab-mcq-de",
Key = f"{fold.split('/')[-3]}/{fold.split('/')[-2]}/{fold.split('/')[-1]}"
)
Ideas is the "Filename" parameter requires the absolute file path of your system.
The "Key" parameter requires the relative file path from the source directory where your files are located
In case of this example, Key parameter has to contain "Tables/auto/18.tsv" value, for client to create the folders.
Hope this helps.
The following works using Python boto3
s3 = boto3.client("s3")
s3.put_object(Bucket="dest_bucket", Key='folder_name/')