TB of data need to uploaded to Bigquery - google-bigquery

We have TB of data need to uploaded to bigquery. I remember one of the video from Felipe Hoffa mentioning that we can send a hard drive overnight to Google and they can take care of it. Can Google Bigquery team provide more info on it is possible?

This is the Offline Import mechanism from Google Cloud Storage. You can read about it here:
https://developers.google.com/storage/docs/early-access
Basically, you'd use this mechanism to import to Google Cloud Storage, then run BigQuery import jobs to import to BigQuery from there.
Depending on how many TB of data you are importing, you might just be better off uploading directly to Google Cloud Storage. Gsutil and other tools can do resumable uploads.
If you are talking about 100s of TB or more, you might want to talk to a Google Cloud Support person about your scenarios in detail. They may be able to help you optimize your usage of BigQuery and Cloud Storage.

Related

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

Automatically detect changes in GCS for BigQuery

Now I have a BigQuery table whose data source is from some bucket at GCS(Google Cloud Storage).
The GCS is dynamic constantly with new files added in. So do we have any available mechanisms for BigQuery to automatically detect the changes in GCS and sync with the latest data?
Thanks!
There is a very cool beta feature you can use to do that. Check out BigQuery Cloud Storage Transfer. You can schedule transfers run backfill, and much more.
Read "limitations" to see if it can work for you.

How to track Amazon AWS S3 bucket downloads using Google Analytics Measurement Protocol?

I'm using AWS S3 as my CDN to store files. Often these are directly linked from places all over the world. I'd like to track the file downloads in the S3 bucket using Google Analytics. It appears Google Analytics Measurement Protocol may be able to do this. But since I'm new to both the AWS environment and GAMP, I was hoping I'm not the first to ever do this. Anyone know of a way this can be accomplished?
I doubt this is possible without you doing extra work on top.
You could create a proxy site that, when hit, records an event to Google Analytics and then redirects to the download page/bucket.
You could also maybe have some script/job/etc scrape events from the AWS dashboards and write them to Google Analytics, although this would probably be less than real-time.
You can turn on logging for the buckets you care about, then download the little logfile fragments that Amazon delivers and feed them into an off-the-shelf analytics package such as Webalizer. If you're willing to spend the time and effort to build a pipeline and massage the data so that it fits.
I've written about how to do that here:
https://www.expatsoftware.com/articles/2007/11/roll-your-own-web-stats-for-amazon-s3.html
If you just want the reports today, there are a handful of 3rd party services built around doing this for you, so if you have ~$10/month to spend that's probably the best solution.
S3stat (https://www.s3stat.com/) is my suggestion. But then it should be since it's also my product.

Loading data from a Google Persistent Disk into BigQuery?

What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.

Loading data from Bigquery to google storage bucket in CSV file format

I run a dataset in bigquery on a daily basis which i need to export to my google storage bucket. The dataset is greater than 10MB which means i'm unable to use app-scripts.
Essentially, I'd like to automate a data load using my bigquery script which exports the dataset as a CSV file to google storage.
Can anyone point me into the right direction in terms of which programme/method to use. Please also share your experiences.
Thanks
Here you can find some details on how to export data from BigQuery to Cloud Storage along with a sample written in Python.
https://cloud.google.com/bigquery/exporting-data-from-bigquery
You can implement a simple application running on App Engine that will contain cron job scheduled to run once a day and perform the steps described in the tutorial above.
https://cloud.google.com/appengine/docs/python/config/cron