Failed to import large data as dataframe, from Google BigQuery to Google Cloud DataLab - pandas

I tried 2 approaches to import a large table in Google BigQuery, about 50,000,000 rows, 18GB, into dataframe to Google Datalab, in order to do the machine learning using Tensorflow.
Firstly I use (all modules needed are imported) :
data = bq.Query('SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME`').execute().result().to_dataframe()
Then it keeps Running... until forever.
Even though I do LIMIT 1000000, it doesn't change.
Secondly I use:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')
It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:
Got page: 32; 45.0% done. Elapsed 293.1 s.
And I cannot find how to enable allowLargeResults in read_gbq().
As its document says, I try:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID', configuration = {'query': {'allowLargeResult': True}})
Then I get:
read_gbq() got an unexpected keyword argument 'configuration'
That's how I even failed to import 1,000,000 rows to Google Cloud Datalab.
I actually want to import 50 times the data size.
Any idea about it?
Thanks

Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.
There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.
Regarding the pandas.read_gbq() command that appears to be stuck. I would like to offer a few suggestions:
Open a new issue in the pandas-gbq repository here.
Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using pd.read_csv. Here are 2 methods to do this:
Using Google BigQuery/Cloud Storage CLI tools:
Using the bq command line tool and gsutil command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
Using Google Cloud Datalab
import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()
Regarding the error read_gbq() got an unexpected keyword argument 'configuration', the ability to pass arbitrary key word arguments (configuration) was added in version 0.20.0. I believe this error is caused the fact that pandas is not up to date. You can check the version of pandas installed by running
import pandas
pandas.__version__
To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq. This will also install pandas-gbq which is an optional dependency for pandas.
Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514
I hope this helps! Please let me know if you have any issues so I can improve this answer.
Anthonios Partheniou
Contributor at Cloud Datalab
Project Maintainer at pandas-gbq

Related

Why GCP gets killed when reading a partitioned parquet file from Google Storage but not locally?

I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM
I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas
It needs to be completely loaded in RAM and I can't use Dask compute for it.
Nonetheless, I tried loading through this route and it gave the same problem
When I download the file locally in the VM, I can read it with pd.read_parquet.
The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.
df = pd.read_parquet("../data/file.parquet",
engine="pyarrow")
When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.
df = pd.read_parquet("gs://path_to_file/file.parquet",
engine="pyarrow")
Some info on the packages versions
Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1
What could be causing it?
I found a thread where is explained how to execute this task on different way.
For your scenario use the GCSF service is a good option, for example:
import pyarrow.parquet as pq
import gcsfs
fs = gcsfs.GCSFileSystem(project=myprojectname)
f = fs.open('my_bucket/path.csv')
myschema = pq.ParquetFile(f).schema
print(schema)
If you want to know more about this service, take a look at this document
The problem was being caused by a deprecated image version on the VM.
According to GCP's support you can find if the image is deprecated by
Go to GCE and click on “VM instances”.
Click on the “VM instance” in question
Look for the section “Boot disk” and click on the Image link.
If the image has been Deprecated, there will be a field showing it.
The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem

Authentication into Google BigQuery without using Environment Variables

I'm making a simple script in Google Colabs (Jupyter Notebook) that can grab stuff from our big data environment (in BigQuery) and analyze it. I'm avoiding using environmental variables as most of the engineers won't know how to set it up. Ideally, i'm looking for a way to authenticate in using our Google username/password. Does anyone have any experience authenticating into GBQ this way? Thanks
The Colab docs contain an example showing how to issue an authenticated BigQuery query.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Then,
# Display query output immediately
%%bigquery --project yourprojectid
SELECT
COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

Import data to Google SQL from a gsheet

What is the best way to import data into google sql DB from a spreadsheet file?
I have to import two file with 4k rows each into a db.
I've tried to load 4k (one file) rows using Appscript and the result was:
Execution succeeded [294.336 seconds total runtime]
Ideas?
Code here
https://pastebin.com/3RiM1CNb
Depends a bit on how often you need this done. From your comment "No, this files will be uploaded two times for month in gdrive.", I think you mean 2 times/month.
If you need this done programmatically, I suggest to use a cronjob and have either App Engine or a local machine run that cronjob.
You can access the spreadsheet with a serviceaccount (add it to the users of that spreadsheet like any other user) using libraries (check the quickstarts on the side for your language of preference) and process the data with that. How to actually process the data depends on your language of choice, but that's simply inserting rows into MySQL.
The simplest option would be to export to CSV and import this into Cloud SQL. Note that you may need to reformat this into something Cloud SQL understands, but that depends on the source data in Google Sheets.
As for the error you're getting, you're exceeding the max allowed runtime for App Script, which is 6 minutes...

Google Cloud Logging export to Big Query does not seem to work

I am using the the google cloud logging web ui to export google compute engine logs to a big query dataset. According to the docs, you can even create the big query dataset from this web ui (It simply asks to give the dataset a name). It also automatically sets up the correct permissions on the dataset.
It seems to save the export configuration without errors but a couple of hours have passed and I don't see any tables created for the dataset. According to the docs, exporting the logs will stream the logs to big query and will create the table with the following template:
my_bq_dataset.compute_googleapis_com_activity_log_YYYYMMDD
https://cloud.google.com/logging/docs/export/using_exported_logs#log_entries_in_google_bigquery
I can't think of anything else that might be wrong. I am the owner of the project and the dataset is created in the correct project (I only have one project).
I also tried exporting the logs to a google storage bucket and still no luck there. I set the permissions correctly using gsutil according to this:
https://cloud.google.com/logging/docs/export/configure_export#setting_product_name_short_permissions_for_writing_exported_logs
And finally I made sure that the 'source' I am trying to export actually has some log entries.
Thanks for the help!
Have you ingested any log entries since configuring the export? Cloud Logging only exports entries to BigQuery or Cloud Storage that arrive after the export configuration is set up. See https://cloud.google.com/logging/docs/export/using_exported_logs#exported_logs_availability.
You might not have given edit permission for 'cloud-logs#google.com' in the Big Query console. Refer this.

How to download all data in a Google BigQuery dataset?

Is there an easy way to directly download all the data contained in a certain dataset on Google BigQuery? I'm actually downloading "as csv", making one query after another, but it doesn't allow me to get more than 15k rows, and rows i need to download are over 5M.
Thank you
You can run BigQuery extraction jobs using the Web UI, the command line tool, or the BigQuery API. The data can be extracted
For example, using the command line tool:
First install and auth using these instructions:
https://developers.google.com/bigquery/bq-command-line-tool-quickstart
Then make sure you have an available Google Cloud Storage bucket (see Google Cloud Console for this purpose).
Then, run the following command:
bq extract my_dataset.my_table gs://mybucket/myfilename.csv
More on extracting data via API here:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Detailed step-by-step to download large query output
enable billing
You have to give your credit card number to Google to export the output, and you might have to pay.
But the free quota (1TB of processed data) should suffice for many hobby projects.
create a project
associate billing to a project
do your query
create a new dataset
click "Show options" and enable "Allow Large Results" if the output is very large
export the query result to a table in the dataset
create a bucket on Cloud Storage.
export the table to the created bucked on Cloud Storage.
make sure to click GZIP compression
use a name like <bucket>/prefix.gz.
If the output is very large, the file name must have an asterisk * and the output will be split into multiple files.
download the table from cloud storage to your computer.
Does not seem possible to download multiple files from the web interface if the large file got split up, but you could install gsutil and run:
gsutil -m cp -r 'gs://<bucket>/prefix_*' .
See also: Download files and folders from Google Storage bucket to a local folder
There is a gsutil in Ubuntu 16.04 but it is an unrelated package.
You must install and setup as documented at: https://cloud.google.com/storage/docs/gsutil
unzip locally:
for f in *.gz; do gunzip "$f"; done
Here is a sample project I needed this for which motivated this answer.
For python you can use following code,it will download data as a dataframe.
from google.cloud import bigquery
def read_from_bqtable(bq_projectname, bq_query):
client = bigquery.Client(bq_projectname)
bq_data = client.query(bq_query).to_dataframe()
return bq_data #return dataframe
bigQueryTableData_df = read_from_bqtable('gcp-project-id', 'SELECT * FROM `gcp-project-id.dataset-name.table-name` ')
yes steps suggested by Michael Manoochehri are correct and easy way to export data from Google Bigquery.
I have written a bash script so that you do not required to do these steps every time , just use my bash script .
below are the github url :
https://github.com/rajnish4dba/GoogleBigQuery_Scripts
scope :
1. export data based on your Big Query SQL.
2. export data based on your table name.
3. transfer your export file to SFtp server.
try it and let me know your feedback.
to help use ExportDataFromBigQuery.sh -h