In Influxdb 2.0, how to find a bucket's size in bytes on disk? - size

I can certainly find out the size on disk of ALL buckets (du -sh . from the data dir), but it's unclear how I do this for a single bucket because the data doesn't seem organized in the file system by bucket.
It would be nice if this info was visible in their web UI admin interface.

Related

Copy Files from S3 SignedURL to GCS Signed URL

I am developing a service in which two different cloud storage providers are involved. I am trying to copy data from S3 bucket to GCS.
To access the data I have been offered signedUrls, and to upload the data to GCS I also have signedUrls available which allow me to write content into a specified storage path;
Is there a possibility to move this data "in cloud"? Downloading from S3 and uploading the content to GCS will create bandwidth problems;
I must also mention that this is a on-demand job and it only moves a small number of files. I can not do a full bucket transfer;
Kind regards
You can use Skyplane to move data across clouds object stores. To move a single file from S3 to Google Cloud, you can use the command:
skyplane cp gcs://<BUCKET>/<FILE> s3://<BUCKET>/<FILE>

How to resolve this error in Google Data Fusion: "Stage x contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."

I need to move data from an parameterized S3 Bucket into Google Cloud Storage. Basic Data dump. I don't own the S3 bucket. It has the following syntax,
s3://data-partner-bucket/mykey/folder/date=2020-10-01/hour=0
I was able to transfer data at the hourly granularity using the Amazon S3 Client provided by Data Fusion. I wanted to bring over a days worth of data so I reset the path in the client to:
s3://data-partner-bucket/mykey/folder/date=2020-10-01
It seemed like it was working until it stopped. The status is "Stopped." When I review the logs just before it stopped I see a warning, "Stage 0 contains a task of very large size (2803 KB). The maximum recommended task size is 100 KB."
I examined the data in the S3 bucket. Each folder contains a series of log files. None of them are "big". The largest folder contains a total of 3MB of data.
I saw a similar question for this error, but the answer involved Spark coding that I don't have access to in Data Fusion.
Screenshot of Advanced Settings in Amazon S3 Client
These are the settings I see in the client. Maybe there is another setting somewhere I need to set? What do I need to do so that Data Fusion can import these files from S3 to GCS?
When you deploy the pipeline you are redirected to a new page with a Ribbon at the top. one of the tools in the Ribbon is Configure.
In the resources section of the Configure Modal you can specify the memory resources. Fiddled around with the numbers. 1000MB worked. 6MB was not enough. (For me.)
I processed 756K records in about 46 min.

Need suggestion on usage of cloud file storage system, where reads are more than writes

I need a cloud service for saving file objects for my project.
Requirements:
1. Pushing the files would be less compared to reading files from the storage.
2. Need to maintain versions of the file objects
3. File objects must be indexed for fast retrieval.
Note:
We have already considered using Amazon S3 bucket, but considering our project requirement, as reading the file objects would be 1 million times more than writing a file into storage.
As Amazon charges S3 usage more on the number of reads that writes it really is the last option for us to use.
Can anybody kindly provide suggestions on what can we use here?
Thanks!

Does S3 multipart upload actually create multiple objects in my bucket?

Here is an example for me trying to understand the under the hood mechanism.
I decide to upload a 2GB file onto my S3 bucket, and I decide to use the size of 128MB for the parts. Then I will have
(2 * 1024) / 128 => 16 parts
Here are my questions:
Am I going to see 16 128MB objects in my bucket or a single 2GB
object in my bucket?
How can S3 understand the order of the parts (1->2->...->16) and
reassemble them into a single 2GB file when I download them back? Is
there an extra 'meta' object (see the above question) that I need to download first to help the client to achieve this reassembling-needed information?
When the s3 client download the above in parallel, at what time does it write the file descriptor for this 2GB file in the local file system (I guess it does not know all the needed information before all the parts have been downloaded)?
While uploading the individual parts, there will be multiple uploads stored in Amazon S3 that you can view with the ListMultipartUploads command.
When completing a multipart upload with the CompleteMultipartUpload command, you must specify a list of the individual parts uploaded in the correct order. The uploads will then be combined into a single object.
Downloading depends upon the client/code you use -- you could download an object in parallel or just single-threaded.

What are the data size limitations when using the GET,PUT methods to get and store objects in an Amazon S3 cloud?

What is the size of data that can be sent using the GET PUT methods to store and retrieve data from amazon s3 cloud and I would also like to know where I can learn more about the APIs available for storage in Amazon S3 other than the documentation that is already provided.
The PUT method is addressed in the respective Amazon S3 FAQ How much data can I store?:
The total volume of data and number of objects you can store are
unlimited. Individual Amazon S3 objects can range in size from 1 byte
to 5 terabytes. The largest object that can be uploaded in a single
PUT is 5 gigabytes. For objects larger than 100 megabytes, customers
should consider using the Multipart Upload capability. [emphasis mine]
As mentioned, Uploading Objects Using Multipart Upload API is recommended for objects larger than 100MB already, and required for objects larger than 5GB.
The GET method is essentially unlimited. Please note that S3 supports the BitTorrent protocol out of the box, which (depending on your use case) might ease working with large files considerably, see Using BitTorrent with Amazon S3:
Amazon S3 supports the BitTorrent protocol so that developers can save
costs when distributing content at high scale. [...]