Can I load every item in a Google Cloud Storage bucket into a BigQuery Table without listing every filename? - google-bigquery

I write new log files to a Google Cloud Storage bucket every 2-3 minutes with data from my webserver (pipe-separated-values). I have thousands of ~1MB files in a single Google Cloud Storage bucket, and want to load all the files into a BigQuery table.
The "bq load" command seems to require individual files, and can't take an entire bucket, or bucket with prefix.
What's the best way to load thousands of files in a gs bucket? Do I really have to get the URI of every single file, as opposed to just specifying the bucket name or bucket and prefix to BigQuery?

You can use glob-style wildcards. E.g. gs://bucket/prefix*.txt.

Related

Importing data from S3 to BigQuery via Bigquery Omni (location restrictions)

I am trying to import some data from S3 bucket to bigQuery. And, I ended up seeing the bigQuery omni option.
However, when I try to connect to the S3 bucket, I see that I am given a set of regions to choose from. In my case, aws-us-east-1 and aws-ap-northeast-2 as in the attached screenshot.
My data on S3 bucket is on the region eu-west-2.
Wondering why BQ allows us to look for specific regions on S3.
What should I be doing so that I can query data from an S3 bucket in the region where the data is uploaded to?
The S3 service is unusual in that the names of buckets are globally unique and do not contain region information in the ARN of the bucket. However, buckets are located in regions but they can be accessed from any region.
My best guess here is that the connection location will be the S3 API endpoint that BigQuery will connect to when it attempts to get the data. If you don't see eu-west-2 as an option then try using us-east-1. From that endpoint it is always possible to find out the location information of the bucket and then make the appropriate S3 client.

Copy Files from S3 SignedURL to GCS Signed URL

I am developing a service in which two different cloud storage providers are involved. I am trying to copy data from S3 bucket to GCS.
To access the data I have been offered signedUrls, and to upload the data to GCS I also have signedUrls available which allow me to write content into a specified storage path;
Is there a possibility to move this data "in cloud"? Downloading from S3 and uploading the content to GCS will create bandwidth problems;
I must also mention that this is a on-demand job and it only moves a small number of files. I can not do a full bucket transfer;
Kind regards
You can use Skyplane to move data across clouds object stores. To move a single file from S3 to Google Cloud, you can use the command:
skyplane cp gcs://<BUCKET>/<FILE> s3://<BUCKET>/<FILE>

How to load large text file in Google BigQuery

I was going through Google BigQuery documentation and I see there is a limit of 5TB file capacity for unencrypted file load and 4TB for the encrypted file load in BigQuery, with 15TB per load job.
I have a hypothetical question - How can I load a text file larger than 16TB (assuming encryption will bring it in the range of 4TB)? I also see the GCS Cloud storage limit is 5TB per file.
I have never done it but here is how I think of possible approach but not sure and looking for confirmation. First, we will have to split the file. Next, we have to encrypt them and transfer them to GCS. Next, load them in the Google BigQuery table.
You are on the right track I guess. Split the file into smaller chunks, then go ahead and distribute them within 2 or 3 different GCS buckets.
Once the chunks are there in the buckets, you can go ahead and load them into BQ.
Hope it helps.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

what's the use of periodically scheduling a AWS Glue crawler. Running it once seems to be enough

I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.