How to read s3 rasters with accompanying ".aux.xml" metadata file using rasterio? - amazon-s3

Suppose a GeoTIFF raster on a S3 bucket which has - next to the raw TIF file - an associated .aux.xml metadata file:
s3://my_s3_bucket/myraster.tif
s3://my_s3_bucket/myraster.tif.aux.xml
I'm trying to load this raster directly from the bucket using rasterio:
fn = 's3://my_s3_bucket/myraster.tif'
with rasterio.Env(session, **rio_gdal_options):
with rasterio.open(fn) as src:
src_nodata = src.nodata
scales = src.scales
offsets = src.offsets
bands = src.tags()['bands']
And this seems to be a problem. The raster file itself is successfully opened, but because rasterio did not automatically load the associated .aux.xml, the metadata was never loaded. Therefore, no band tags, no proper scales and offsets.
I should add that doing exactly the same on a local file does work perfectly. The .aux.xml automatically gets picked up and all relevant metadata is correctly loaded.
Is there a way to make this work on s3 as well? And if not, could there be a workaround for this problem? Obviously, metadata was too large to get coded into the TIF file. Rasterio (GDAL under the hood) generated the .aux.xml automatically when creating the raster.

Finally got it to work. It appears to be essential that in the GDAL options passed to the rasterio.Env module, .xml is added as an allowed extension to CPL_VSIL_CURL_ALLOWED_EXTENSIONS:
The documentation of this option states:
Consider that only the files whose extension ends up with one that is listed in CPL_VSIL_CURL_ALLOWED_EXTENSIONS exist on the server.
And while almost all examples to be found online only set .tif as allowed extension because it can dramatically speed up file opening, any .aux.xml files are not seen by rasterio/GDAL.
So in case we expect there to be associated .aux.xml metadata files with the .tif files, we have to change our example to:
rio_gdal_options = {
    'AWS_VIRTUAL_HOSTING': False,
    'AWS_REQUEST_PAYER': 'requester',
    'GDAL_DISABLE_READDIR_ON_OPEN': 'FALSE',
    'CPL_VSIL_CURL_ALLOWED_EXTENSIONS': '.tif,.xml', # Adding .xml is essential!
    'VSI_CACHE': False
}
with rasterio.Env(session, **rio_gdal_options):
with rasterio.open(fn) as src: # The associated .aux.xml file will automatically be found and loaded now
src_nodata = src.nodata
scales = src.scales
offsets = src.offsets
bands = src.tags()['bands']

Related

Scrapy upload files to dynamically created directories in S3 based on field

I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.

How to reuse S3 image?

I uploaded images to S3 with carrier wave gem.(Ruby on Rails, Vue.js)
But I want to reuse uploaded image as a file.
I have no idea about how to reuse uploaded S3 images as a file.
To be specific,
I made a model "reaction"
and the image is saved as a column of "reaction".
I can reuse image object like #reaction.image
(#reaction is a "reaction"`s object)
No trial
I totally dont know how to deal with it.
Once you get the image url from your table "Reaction". We can basically downloading an image located at a given URL and saved it as a file locally.
def download_aws_s3(url_aws_s3, filename)
uri = URI(url_aws_s3)
response = Net::HTTP.get_response(uri)
File.open(filename, 'wb'){|f| f.write(response.body)}
end
You can use "open-uri" or "down" gem as an alternative.

Google Drive - use WebViewLink vs thumbnailLink

I'm using the Google Drive API where I can gain access to 2 pieces of data that I need to display a jpg file oin my program. WebViewLink is the "large" size image while thumbnailLink is the "thumb" smaller size of the same image.
I'm having an issue with downloading the WebViewLink that I do not have with the thumbnailLink. Part of my code calls either exif_imagetype($filename) or getimagesize($filename) so I can retrieve the type, height & width etc for the $filename. This is successful for the thumbnailView but not the WebViewLink...
code snippet...
$WebViewLink = "https://drive.google.com/a/treering.com/file/d/blablabla";
$type = exif_imagetype($WebViewLink);
--- results in the error
"PHP Warning: exif_imagetype(): stream does not support seeking..."
where as...
$thumbnailLink = "https://lh6.googleusercontent.com/blablabla";
$type = exif_imagetype($thumbnailLink);
--- successful
where $type = 2 // a .jpg file
Not sure what I need to do to gain a usable WebViewLink... maybe use the "export" function to copy to a file on my server that is accessible, then use that exported file for the functions that fail above?
Thanks for any help.
John
I think you are using the wrong property to get the image of the file.
WebViewLink
A link for opening the file in a relevant Google editor or viewer in a browser.
thumbnailLink
A short-lived link to the file's thumbnail, if available. Typically lasts on the order of hours.
You can try using the iconLink():
A static, unauthenticated link to the file's icon.
Sample image of thumbnailLink:
Sample image of a iconLink:
It will still show relevant image about the file.
Hope it helps!

How to create a http request that contains multiple FileHeaders?

I am trying to test a uploading service that supports multiple files uploading,and I found this:
golang POST data using the Content-Type multipart/form-data
that introduced how to create a request to upload a single file,but I need to upload multiple files,is there simple way to create this kind of request?
update:
please check line:38 and 39 in post:to support html5 multiple files uploading
line 38 files := m.File["myfiles"]
line 29 for i, _ := range files {
It seems that it needs to set single name for multiple file headers to stimulate the html5 multiple files uploading.
For each file, call CreateFormFile to create the header for the file. Call Write on the writer returned from CreateFormFile one or more times to write data to the file. When done with all files, close the multipart writer.
The top answer in the linked question uploads two files, one named "image" and one named "key". The data for the "image" is copied from a file. The data for "key" is simply the bytes "KEY".
The field name is the first argument to CreateFormFile. If you want to upload multiple files with the same name, use the same name each time you call CreateFormFile.

Locally calculate dropbox hash of files

Dropbox rest api, in function metatada has a parameter named "hash" https://www.dropbox.com/developers/reference/api#metadata
Can I calculate this hash locally without call any remote api rest function?
I need know this value to reduce upload bandwidth.
https://www.dropbox.com/developers/reference/content-hash explains how Dropbox computes their file hashes. A Python implementation of this is below:
import hashlib
import math
import os
DROPBOX_HASH_CHUNK_SIZE = 4*1024*1024
def compute_dropbox_hash(filename):
file_size = os.stat(filename).st_size
with open(filename, 'rb') as f:
block_hashes = b''
while True:
chunk = f.read(DROPBOX_HASH_CHUNK_SIZE)
if not chunk:
break
block_hashes += hashlib.sha256(chunk).digest()
return hashlib.sha256(block_hashes).hexdigest()
The "hash" parameter on the metadata call isn't actually the hash of the file, but a hash of the metadata. It's purpose is to save you having to re-download the metadata in your request if it hasn't changed by supplying it during the metadata request. It is not intended to be used as a file hash.
Unfortunately I don't see any way via the Dropbox API to get a hash of the file itself. I think your best bet for reducing your upload bandwidth would be to keep track of the hash's of your files locally and detect if they have changed when determining whether to upload them. Depending on your system you also likely want to keep track of the "rev" (revision) value returned on the metadata request so you can tell whether the version on Dropbox itself has changed.
This won't directly answer your question, but is meant more as a workaround; The dropbox sdk gives a simple updown.py example that uses file size and modification time to check the currency of a file.
an abbreviated example taken from updown.py:
dbx = dropbox.Dropbox(api_token)
...
# returns a dictionary of name: FileMetaData
listing = list_folder(dbx, folder, subfolder)
# name is the name of the file
md = listing[name]
# fullname is the path of the local file
mtime = os.path.getmtime(fullname)
mtime_dt = datetime.datetime(*time.gmtime(mtime)[:6])
size = os.path.getsize(fullname)
if (isinstance(md, dropbox.files.FileMetadata) and mtime_dt == md.client_modified and size == md.size):
print(name, 'is already synced [stats match]')
As far as I am concerned, No you can't.
The only way is using Dropbox API which is explained here.
The rclone go program from https://rclone.org has exactly what you want:
rclone hashsum dropbox localfile
rclone hashsum dropbox localdir
It can't take more than one path argument but I suspect that's something you can work with...
t0|todd#tlaptop/p8 ~/tmp|295$ echo "Hello, World!" > dropbox-hash-demo/hello.txt
t0|todd#tlaptop/p8 ~/tmp|296$ rclone copy dropbox-hash-demo/hello.txt dropbox-ttf:demo
t0|todd#tlaptop/p8 ~/tmp|297$ rclone hashsum dropbox dropbox-hash-demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt
t0|todd#tlaptop/p8 ~/tmp|298$ rclone hashsum dropbox dropbox-ttf:demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt