I'm making a simple script in Google Colabs (Jupyter Notebook) that can grab stuff from our big data environment (in BigQuery) and analyze it. I'm avoiding using environmental variables as most of the engineers won't know how to set it up. Ideally, i'm looking for a way to authenticate in using our Google username/password. Does anyone have any experience authenticating into GBQ this way? Thanks
The Colab docs contain an example showing how to issue an authenticated BigQuery query.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Then,
# Display query output immediately
%%bigquery --project yourprojectid
SELECT
COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`
Related
Goal
I'd like to have a credentials file for things like API keys stored in a place where someone I've shared the Colab notebook with can't access the file or the information.
Situation
I'm calling several APIs in a Colab Notebook, and have multiple keys for different APIs. I'd prefer a simpler approach, if there are different levels of complexity.
Current attempts
I'm storing the keys in the main Python notebook, as I'm researching the best way to approach this. I'm pretty new at authentication, so would prefer a simpler solution. I haven't seen any articles addressing this directly.
Greatly appreciate any input on this.
You can store the credential files in your Google Drive.
Only you can access them at /content/drive/MyDrive/ after mounting it. Other people need their own credential files in their own Drive.
Is there a way to check how many people are using google colab at the same time you are?
I tried to look up via google and other sources and couldn't find any concrete information regarding the number of Users using the GPU at once.
No, there's no way to view overall Colab usage.
You could add analytics reporting to individual notebooks using Python APIs like this one:
https://developers.google.com/analytics/devguides/reporting/core/v4/quickstart/service-py
But, that would only report usage for users of a given notebook who execute code rather than users of Colab overall.
I have a Google Dataflow batch job written in Java.
This Java code accesses Bigquery and performs a few transformations
and then outputs back into Bigquery.
This code can access the Bigquery tables just fine.
But, when I choose a table that is backed by a federated source like google sheets it doesn't work.
It says no OAuth token with Google Drive scope found.
Pipeline options
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p1 = Pipeline.create(options);
Any ideas?
Can you try:
gcloud auth login --enable-gdrive-access
before you launch the Dataflow job?
Answering my own question, but to get around this issue I'm going to use Google Apps Script to upload to Bigquery as a native table.
Please see this link.
I'm just going to modify the Load CSV data code snippet into BigQuery and then create an installable trigger to execute this function every night to upload to Bigquery.
Beware you can't execute triggers like onEdit, onOpen that require authorisation.
I tried 2 approaches to import a large table in Google BigQuery, about 50,000,000 rows, 18GB, into dataframe to Google Datalab, in order to do the machine learning using Tensorflow.
Firstly I use (all modules needed are imported) :
data = bq.Query('SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME`').execute().result().to_dataframe()
Then it keeps Running... until forever.
Even though I do LIMIT 1000000, it doesn't change.
Secondly I use:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')
It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:
Got page: 32; 45.0% done. Elapsed 293.1 s.
And I cannot find how to enable allowLargeResults in read_gbq().
As its document says, I try:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID', configuration = {'query': {'allowLargeResult': True}})
Then I get:
read_gbq() got an unexpected keyword argument 'configuration'
That's how I even failed to import 1,000,000 rows to Google Cloud Datalab.
I actually want to import 50 times the data size.
Any idea about it?
Thanks
Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.
There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.
Regarding the pandas.read_gbq() command that appears to be stuck. I would like to offer a few suggestions:
Open a new issue in the pandas-gbq repository here.
Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using pd.read_csv. Here are 2 methods to do this:
Using Google BigQuery/Cloud Storage CLI tools:
Using the bq command line tool and gsutil command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
Using Google Cloud Datalab
import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()
Regarding the error read_gbq() got an unexpected keyword argument 'configuration', the ability to pass arbitrary key word arguments (configuration) was added in version 0.20.0. I believe this error is caused the fact that pandas is not up to date. You can check the version of pandas installed by running
import pandas
pandas.__version__
To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq. This will also install pandas-gbq which is an optional dependency for pandas.
Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514
I hope this helps! Please let me know if you have any issues so I can improve this answer.
Anthonios Partheniou
Contributor at Cloud Datalab
Project Maintainer at pandas-gbq
I am using the the google cloud logging web ui to export google compute engine logs to a big query dataset. According to the docs, you can even create the big query dataset from this web ui (It simply asks to give the dataset a name). It also automatically sets up the correct permissions on the dataset.
It seems to save the export configuration without errors but a couple of hours have passed and I don't see any tables created for the dataset. According to the docs, exporting the logs will stream the logs to big query and will create the table with the following template:
my_bq_dataset.compute_googleapis_com_activity_log_YYYYMMDD
https://cloud.google.com/logging/docs/export/using_exported_logs#log_entries_in_google_bigquery
I can't think of anything else that might be wrong. I am the owner of the project and the dataset is created in the correct project (I only have one project).
I also tried exporting the logs to a google storage bucket and still no luck there. I set the permissions correctly using gsutil according to this:
https://cloud.google.com/logging/docs/export/configure_export#setting_product_name_short_permissions_for_writing_exported_logs
And finally I made sure that the 'source' I am trying to export actually has some log entries.
Thanks for the help!
Have you ingested any log entries since configuring the export? Cloud Logging only exports entries to BigQuery or Cloud Storage that arrive after the export configuration is set up. See https://cloud.google.com/logging/docs/export/using_exported_logs#exported_logs_availability.
You might not have given edit permission for 'cloud-logs#google.com' in the Big Query console. Refer this.