Export Google Cloud Datastore and import to BigQuery programmatically - google-bigquery

I'm looking for a method to export my Cloud Datastore and import it into BigQuery daily. The manual way is described at google page. I do not find a clean way to automate it.

There isn't a simple way to do this, but you can separate out the two parts: creating appengine backups and loading them into bigquery.
You can use scheduled backups to create datastore backups periodically (https://cloud.google.com/appengine/articles/scheduled_backups).
You can then use Apps Script to automate the BigQuery portion (https://developers.google.com/apps-script/advanced/bigquery#load_csv_data) or use an AppEngine cron to do the same thing.

As of last week there's a proper way to automate this. The most important part is gcloud beta datastore export.
I created a script around it: https://github.com/chees/datastore2bigquery
You could run this in a cron job.
See here for a demo of how it works: https://www.youtube.com/watch?v=dGyQCE3bWkU

Building on #Jordan's answer above, the steps to do this would be:
1) Make a storage bucket
2) Export datastore entities to this bucket
3) Open Big Query Web UI, and load using the Google Cloud file path.
Full tutorial with images is available at this post.

It is possible using the following code. It basically uses App Engine Cron jobs and BigQuery API.
https://github.com/wenzhe/appengine_datastore_bigquery

Related

BigQuery: Export to GCS option disappeared from BigQuery UI

The option to export to a GCS bucket has disappeared from the BigQuery UI and was replaced with "Export to Google Drive". It's a feature I used a lot for large results and using the export to Drive is not useful at all. It takes very long and I can't work the same way with the file in Drive than I would in GCS. Is there any way I can still export to GCS from BigQuery UI?
The "workaround" for BigQuery UI is to save result as a table (or just have destination table set for query) and after result is available in the table - just use "Export to GCS" option which is "still" available in both Classic and New BQ UI

Google Dataflow - BigQuery job backed by GoogleSheets

I have a Google Dataflow batch job written in Java.
This Java code accesses Bigquery and performs a few transformations
and then outputs back into Bigquery.
This code can access the Bigquery tables just fine.
But, when I choose a table that is backed by a federated source like google sheets it doesn't work.
It says no OAuth token with Google Drive scope found.
Pipeline options
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p1 = Pipeline.create(options);
Any ideas?
Can you try:
gcloud auth login --enable-gdrive-access
before you launch the Dataflow job?
Answering my own question, but to get around this issue I'm going to use Google Apps Script to upload to Bigquery as a native table.
Please see this link.
I'm just going to modify the Load CSV data code snippet into BigQuery and then create an installable trigger to execute this function every night to upload to Bigquery.
Beware you can't execute triggers like onEdit, onOpen that require authorisation.

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Google Cloud Logging export to Big Query does not seem to work

I am using the the google cloud logging web ui to export google compute engine logs to a big query dataset. According to the docs, you can even create the big query dataset from this web ui (It simply asks to give the dataset a name). It also automatically sets up the correct permissions on the dataset.
It seems to save the export configuration without errors but a couple of hours have passed and I don't see any tables created for the dataset. According to the docs, exporting the logs will stream the logs to big query and will create the table with the following template:
my_bq_dataset.compute_googleapis_com_activity_log_YYYYMMDD
https://cloud.google.com/logging/docs/export/using_exported_logs#log_entries_in_google_bigquery
I can't think of anything else that might be wrong. I am the owner of the project and the dataset is created in the correct project (I only have one project).
I also tried exporting the logs to a google storage bucket and still no luck there. I set the permissions correctly using gsutil according to this:
https://cloud.google.com/logging/docs/export/configure_export#setting_product_name_short_permissions_for_writing_exported_logs
And finally I made sure that the 'source' I am trying to export actually has some log entries.
Thanks for the help!
Have you ingested any log entries since configuring the export? Cloud Logging only exports entries to BigQuery or Cloud Storage that arrive after the export configuration is set up. See https://cloud.google.com/logging/docs/export/using_exported_logs#exported_logs_availability.
You might not have given edit permission for 'cloud-logs#google.com' in the Big Query console. Refer this.

How to download all data in a Google BigQuery dataset?

Is there an easy way to directly download all the data contained in a certain dataset on Google BigQuery? I'm actually downloading "as csv", making one query after another, but it doesn't allow me to get more than 15k rows, and rows i need to download are over 5M.
Thank you
You can run BigQuery extraction jobs using the Web UI, the command line tool, or the BigQuery API. The data can be extracted
For example, using the command line tool:
First install and auth using these instructions:
https://developers.google.com/bigquery/bq-command-line-tool-quickstart
Then make sure you have an available Google Cloud Storage bucket (see Google Cloud Console for this purpose).
Then, run the following command:
bq extract my_dataset.my_table gs://mybucket/myfilename.csv
More on extracting data via API here:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Detailed step-by-step to download large query output
enable billing
You have to give your credit card number to Google to export the output, and you might have to pay.
But the free quota (1TB of processed data) should suffice for many hobby projects.
create a project
associate billing to a project
do your query
create a new dataset
click "Show options" and enable "Allow Large Results" if the output is very large
export the query result to a table in the dataset
create a bucket on Cloud Storage.
export the table to the created bucked on Cloud Storage.
make sure to click GZIP compression
use a name like <bucket>/prefix.gz.
If the output is very large, the file name must have an asterisk * and the output will be split into multiple files.
download the table from cloud storage to your computer.
Does not seem possible to download multiple files from the web interface if the large file got split up, but you could install gsutil and run:
gsutil -m cp -r 'gs://<bucket>/prefix_*' .
See also: Download files and folders from Google Storage bucket to a local folder
There is a gsutil in Ubuntu 16.04 but it is an unrelated package.
You must install and setup as documented at: https://cloud.google.com/storage/docs/gsutil
unzip locally:
for f in *.gz; do gunzip "$f"; done
Here is a sample project I needed this for which motivated this answer.
For python you can use following code,it will download data as a dataframe.
from google.cloud import bigquery
def read_from_bqtable(bq_projectname, bq_query):
client = bigquery.Client(bq_projectname)
bq_data = client.query(bq_query).to_dataframe()
return bq_data #return dataframe
bigQueryTableData_df = read_from_bqtable('gcp-project-id', 'SELECT * FROM `gcp-project-id.dataset-name.table-name` ')
yes steps suggested by Michael Manoochehri are correct and easy way to export data from Google Bigquery.
I have written a bash script so that you do not required to do these steps every time , just use my bash script .
below are the github url :
https://github.com/rajnish4dba/GoogleBigQuery_Scripts
scope :
1. export data based on your Big Query SQL.
2. export data based on your table name.
3. transfer your export file to SFtp server.
try it and let me know your feedback.
to help use ExportDataFromBigQuery.sh -h