Inserting realtime data into Bigquery with a file on compute engine? - google-bigquery

I'm downloading realtime data into a csv file on Google's Compute Engine instance and want to load this file into Bigquery for realtime analysis.
Is there a way for me to do this without first uploading the file to Cloud Storage?
I tried this: https://cloud.google.com/bigquery/streaming-data-into-bigquery but since my file isnt in JSON, this fails.

Have you tried the command line tool? You can upload CSVs from it.

Related

Loading data into Google Colab from a central data repository?

What's the best place to upload data for use in Google Colaboratory notebooks? I'm planning to make some notebooks that load netCDF data using python, and I'd like to be able to send the notebooks to other people and have them load the same data without difficulty.
I know I can load data from my own Google Drive, but if I sent other people the notebooks, then I'd have to send them the data files too, right?
Is it possible to have a central data repository that multiple people can load data from? The files that I'd like to use are ~10-100 MB. I only need to read data, not write it. Thanks!

Bigquery Unloading Large Data to a Single GZIP File

I'm using the BigQuery console and was planning to extract a table and put the results into Google Cloud Storage as a GZIP file but encountered an error asking to wilcard the filename as based on Google docs, it's like a limitation for large volume of data and extract needs to be splitted.
https://cloud.google.com/bigquery/docs/exporting-data#console
By any chance is there a workaround so I could have a single compressed file loaded to Google Cloud Storage instead of multiple files? I was using Redshift previously and this wasn't an issue.

Does Google BigQuery supports Parquet file format?

I was wondering if Google BigQuery currently supports Parquet file format or if there are plans to support it?
I know that it currently supports CSV and JSON formats.
** As of 1st March 2018, Support for loading Parquet 1.0 files is available.
In the BigQuery CLI, there is --source_format PARQUET option which is described in output of bq --help.
I never got to use it, because when I was experimenting with this feature, it was still invite-only, and I did not request the invite.
My usecase was that the Parquet file is half the size of the Avro file. I wanted to try something new and upload data efficiently (in this order).
% bq load --source_format PARQUET test.test3 data.avro.parquet schema.json
Upload complete.
Waiting on bqjob_r5b8a2b16d964eef7_0000015b0690a06a_1 ... (0s) Current
status: DONE
[...]
At this time BigQuery does not support Parquet file format. However, we are interested to hear more about your use case - are you interested in import, export or both ? How do you intend to use it ? Understanding the scenarios better will help BigQuery team to plan accordingly.
If you want to share a file format between BigQuery and Hadoop, you can use newline separated JSON records.
BigQuery supports these for import and export.
Hadoop supports this as well. Searching the internets finds many hits showing recipes for making it work. Here's one: Processing JSON using java Mapreduce
When you are dealing with hundreds of millions of rows and need to move data to an on-premise Hadoop cluster, this is, exporting from bigQuery, json is just not feasible option, avro not much better, the only efficient option today for such movement of data is gz which is unfortunately not possible to be read natively in Hadoop, Larquet is the only efficient way for this use case, we do not have any other efficient option
Eample (part-* is the secret sauce here):
bq load --source_format=PARQUET --replace=true abc.def gs://abc/def/part-*

upload multiple csv from google cloud to bigquery

I need to upload multiple CSV files from my google bucket. Tried pointing to the bucket when creating the dataset, but i received an error. also tried
gsutil load <projectID:dataset.table> gs://mybucket
it didn't work.
I need to upload multiple files at a time as my total data is 2-3 TB and there is a large number of files
You're close. Google Cloud Storage uses gsutil, but BigQuery's command-line utility is "bq". The command you're looking for is bq load <table> gs://mybucket/file.csv.
bq's documentation is over here: https://developers.google.com/bigquery/bq-command-line-tool

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.