AWS S3 SDK to Google Cloud Storage - how to get uploaded VersionID? - amazon-s3

I have a versioning enabled bucket in GCS.
The application I'm working on uses the AWS S3 .NET SDK to connect to a S3-compatible object storage and has been working just fine for this use case.
Now we're also about to support GCS object storage. From my research and testing, GCS offers S3-compatibility through their XML-api. I've tested this and sure enough, GetObject/PutObject/multipart uploads/chunked downloads are all working fine with the code using the S3 library.
However, the compatibility seems to stop when I tried testing the versioning feature: our application makes heavy use of object storage versioned buckets, requesting non-current versions by their VersionID.
With the S3 library connecting to the S3 object storage, everything works fine: PutObject and multipart uploads (= CompleteMultipartUpload response) return the VersionID properly.
For GCS though, this does not return their version of a "VersionID" (= the object Generation).
Response looks like this:
I would have expected that GCS returns this Generation as the VersionID in these responses, since they are conceptually the same. But as you can see, VersionID is null (and the bucket is definitely versioning-enabled).
I would just write another implementation class that uses the GCS .NET SDK, but our application heavily relies on chunked uploading where we retrieve a chunk of data from an external source one by one (so we don't have the full Stream of data). This works well with S3's multipart upload (each chunk is uploaded in a separate UploadPart call), but GCS resumable upload expects a Stream that just has all the data right away. So it looks like we really need multipart upload functionality, that we can use through the S3 library with GCS's XML API. If anyone has suggestions on how to make this work with GCS whereby we can upload chunk per chunk in separate calls to construct an object like multipart upload, would also be greatly appreciated.
So my questions are: will receiving the VersionID after uploading just not work with the AWS S3 SDK to Google Cloud Storage or am I doing it wrong? Do I have to look elsewhere in the response for it? Configure some setting to get this properly returned?

Related

Is there a way to upload large file in chunks/blocks through Amazon Workdocs api or it's client provider?

I completed the upload file part in Workdocs via rest api's.
But in that, how does Workdocs handles large file upload, saw the InitiateDocumentVersionUpload api which doesnot indicate the restrictions on file size if any.

Google Cloud Storage: Alternative to signed URLs for folders

Our application data storage is backed by Google Cloud Storage (and S3 and Azure Blob Storage). We need to give access to this storage to random outside tools (upload from local disk using CLI tools, unload from analytical database like Redshift, Snowflake and others). The specific use case is that users need to upload multiple big files (you can think about it much like m3u8 playlists for streaming videos - it's m3u8 playlist and thousands of small video files). The tools and users MAY not be affiliated with Google in any way (may not have Google account). We also absolutely need to data transfer to be directly to the storage, outside of our servers.
In S3 we use federation tokens to give access to a part of the S3 bucket.
So model scenario on AWS S3:
customer requests some data upload via our API
we give customers S3 credentials, that are scoped to s3://customer/project/uploadId, allowing upload of new files
client uses any tool to upload the data
client uploads s3://customer/project/uploadId/file.manifest, s3://customer/project/uploadId/file.00001, s3://customer/project/uploadId/file.00002, ...
other data (be it other uploadId or project) in the bucket is safe because the given credentials are scoped
In ABS we use STS token for the same purpose.
GCS does not seem to have anything similar, except for Signed URLs. Signed URLs have a problem though that they refer to a single file. That would either require us to know in advance how many files will be uploaded (we don't know) or the client would need to request each file's signed URL separately (strain on our API and also it's slow).
ACL seemed to be a solution, but it's only tied to Google-related identities. And those can't be created on demand and fast. Service users are also and option, but their creation is slow and generally they are discouraged for this use case IIUC.
Is there a way to create a short lived credentials that are limited to a subset of the CGS bucket?
Ideal scenario would be that the service account we use in the app would be able to generate a short lived token that would only have access to a subset of the bucket. But nothing such seems to exist.
Unfortunately, no. For retrieving objects, signed URLs need to be for exact objects. You'd need to generate one per object.
Using the * wildcard will specify the subdirectory you are targeting and will identify all objects under it. For example, if you are trying to access objects in Folder1 in your bucket, you would use gs://Bucket/Folder1/* but the following command gsutil signurl -d 120s key.json gs://bucketname/folderName/** will create a SignedURL for each of the files inside your bucket but not a single URL for the entire folder/subdirectory
Reason : Since subdirectories are just an illusion of folders in a bucket and are actually object names that contain a ‘/’, every file in a subdirectory gets its own signed URL. There is no way to create a single signed URL for a specific subdirectory and allow its files to be temporarily available.
There is an ongoing feature request for this https://issuetracker.google.com/112042863. Please raise your concern here and look for further updates.
For now, one way to accomplish this would be to write a small App Engine app that they attempt to download from instead of directly from GCS which would check authentication according to whatever mechanism you're using and then, if they pass, generate a signed URL for that resource and redirect the user.
Reference : https://stackoverflow.com/a/40428142/15803365

Does Google Cloud Storage Bucket use Amazon S3?

I did set up a Google Cloud Storage Bucket with index.html and test.html and what I see, when I go to my domain:
See: doc.s3.amazonaws.com/2006-03-01 in:
<?xml version='1.0' encoding='UTF-8'?>
<ListBucketResult xmlns='http://doc.s3.amazonaws.com/2006-03-01'>
<Name>my-domain.com</Name>
<Prefix></Prefix>
<Marker></Marker>
<IsTruncated>false</IsTruncated>
<Contents><Key>index.html</Key>
<Generation>1555969892676799</Generation>
<MetaGeneration>1</MetaGeneration>
<LastModified>2019-04-22T21:51...</LastModified>
<ETag>"...."</ETag>
<Size>25</Size></Contents><Contents>
<Key>test.html</Key>
etc..
I do not have amazon account.
Despite that string being present in the namespace, the response to that request is not coming from AWS.
Google Cloud Storage (GCS) has two APIs. One is JSON-based and looks like most of Google's APIs (called the JSON API), and the other is XML-based and is designed to be interoperable with some cloud storage tools and libraries that work with S3. The idea is that, if you already use such a tool, such as the Python boto library, using GCS can be accomplished by changing the URL and credentials. Clients parsing XML responses likely validate XML namespaces, and so they expect to see something like the string "http://doc.s3.amazonaws.com/2006-03-01" as part of the protocol.
You're sending a request to the XML API (either via storage.googleapis.com, BUCKET_NAME.storage.googleapis.com, or via a CNAME DNS redirect to Cloud Storage) , and so the resulting message tries to provide an interoperable response.
If we look at the documentation for the XML API found here we see that everything here is as expected. What we are seeing is an XML document which has an XML namespace called http://doc.s3.amazonaws.com/2006-03-01. Think of this as a declaration of the usage of a named data type. It appears that this data type (a ListBucketResult) was specified by AWS (Amazon) and GCP decided to re-use this specification in its own implementation rather than just implement a completely new specification which likely would have been semantically identical to that which already existed. The re-use of interfaces is normally a good thing for all. It would likely mean easier portability and less vendor lock-in to be able to choose a different cloud provider should you need.
I'm going to guess that AWS was the first to provide cloud blob storage and had set precedent. It is quite common to see XML Namespaces that describe open standards. I am going to also guess that there is no current open-standards specification for what a cloud storage provider should provide. So Amazon has S3, Google has Google Cloud Storage and Azure has Azure Blob Storage.

Streaming Upload to AmazonS3 with boto or simples3 API

Is boto API (python) for amazons3 is streaming upload?
there is another API called Simples3. i think no body is heard of it.
http://pypi.python.org/pypi/simples3
it has a function call for streaming upload. but i would like to use boto if it has streaming upload option.
i know about Multipart in Boto. i dont want to use multipart because i do not want to split the files on disk and have one huge file and splits of it. i believe it's a wastage of space.
What would be the difference between boto and simples3
If by "streaming upload" you mean chunked transfer encoding, then I don't think boto, simples3 or any other package will support it because S3 doesn't support it. Boto has methods for streaming but they are only supported by Google Cloud Storage.

S3 Signed URL and multipart upload

I am accessing S3 using signed URLs and Jets3t library. I have built a cipherStream over inputstream so that encryption also happens on the fly during upload.
Can I use multipart upload feature where file is divided into multiple parts and uploaded to S3 in parallel.
Does Jets3t provide any support to handle such case?
You can use the multi-threaded ThreadedS3Service class as your service object. It has a method multipartStartUploads(java.lang.String bucketName, java.util.List objects) that does exactly what I think you are asking for.