How can I use BigQuery's dataset in Data Lab - google-bigquery

I have some datasets in BigQuery, I wonder if there is a way to use the same datasets in Data Lab? As the datasets are big, I can't download it and reload it in Data Lab.
Thank you very much.

The BigQuery Python client library support querying data stored in BigQuery. To load the commands from the client library, paste the following code into the first cell of the notebook:
%load_ext google.cloud.bigquery
%load_ext is one of the many Jupyter built-in magic commands.
The BigQuery client library provides a %%bigquery cell, which runs a SQL query and returns the results as a Pandas DataFrame.
You can query data from a public dataset or from the datasets in your project:
%%bigquery
SELECT *
FROM `MY_PROJECT.MY_DATASET.MY_TABLE`
LIMIT 50
I was able to successfully get data from the dataset without any issues.
You can follow this tutorial. I hope it helps.

Related

Enable sync between Big Query and snowflake

We are using BigQuery and SNOWFLAKE(Azure hosted) and we often export data from big query and import to SNOWFLAKE and vice versa. is there any easy way to integrate both systems like automatically sync big query table to SNOWFLAKE rather than exporting to file and importing ?
You should have a look on Change Data Capture Solutions for automate sync.
Some of them got native Big Query and Snowflake connectors.
Some examples :
HVR
Qlik Replicate
Striim
...
There are many ways to implement this, and the best one will depend on the nature of your data.
For example, if every day you have new data in BigQuery, then all you need to do is set up a daily export of the new data from BigQuery to GCS. Then it's easy to set up Snowflake to read new data in GCS whenever it shows up with Snowpipe:
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-gcs.html
But then how often do you want to sync this data? Is it append only, or does it need to account for past data changing? How do you solve conflicts when the same row changes in different ways on both sides? Etc.
I have the same scenario. I've built this template in Jupyter Notebook. I've done a gap analysis after a few days and, at least in our case, it seems that Firebase/Google Analytics adds more rows to the already compiled daily tables even a few days later. We have about 10% more rows in a BigQuery older Daily than what was captured in Snowflake so be mindful of the gap. To this date, the template I've created is not able to handle the missing rows. For us it works because we look at aggregated values (daily active users, retention...etc) and the gap there is minimal.
You could use Sling, which I worked on. It is a tool that allows to copy data between databases (including BQ source and SF destination) using bulk loading methodologies. There is a free CLI version and a Cloud (hosted) version. I actually wrote a blog entry about this in detail (albeit AWS destination, similar logic), but essentially, if you use the CLI version, you can run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded

Creating table in big query by uploading csv

I am new to big query . I am trying o create a table by uploading csv. Its size if 290 kb. Even if I fill all the required information the thee dots beside create table keeps moving(like loading ) but even after waiting for a long time, the table doesn't get created.
You can upload the CSV in a bucket and then reference it from BigQuery creation panel.
Here is the official guide from Google, with the screenshot. Should be rather simple. ( https://cloud.google.com/bigquery/docs/schema-detect)
On step 4 of the image below, select the path to the file and CSV format.
On step 5 you can either keep everything like it is or select "External Table" (which I recommend), in order to delete the table in case of error and not lose the CSV.
BigQuery should automatically handle the rest. Please, share more detailed information in case of error.
There are couple of ways through which you can upload CSV file to Bigquery as given below :-
Write an Apache beam code (Python/Java) and get data loaded to Bigquery. Sample code for reading and writing you can combine it.
Write a python script which is responsible for loading data to Bigquery.
import pandas as pd
from pandas.io import gbq
import os
import numpy as np
dept_dt=pd.read_csv('dept_data')
#print(dept_dt)
# Replace with Project Id
project = 'xxxxx-aaaaa-de'
#Replace with service account path
path_service_account = 'xxxxx-aaaa-jhhhhh9874.json'
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_service_account
dept_dt.to_gbq(destination_table='test1.Emp_data1',project_id='xxxxxx-ccvd-err',if_exists='fail')

Google BigQuery Share Dataset Access - how can I know who have been given the shared access to these datasets/tables

We can share individual dataset access, through this step.
https://cloud.google.com/bigquery/docs/dataset-access-controls
Select the dataset
Share the dataset
We can do this one by one. And we can even share the individual BigQuery table this way.
But how can I know who have been given the shared access to all these datasets/tables of my GCP project?
Instead of go to the each dataset and each table, check "share" link manually?
Thanks!
I personally use a python function that shows that information for all datasets.
from google.cloud import bigquery
import os
client = bigquery.Client()
datasets = list(client.list_datasets())
project = client.project
print("Datasets in project {}:".format(project))
for dataset in datasets:
os.system('bq show --project_id {} {}'.format(project,dataset.dataset_id))
You can do:
bq show --project_id PROJECT DATASET

Imported data into Big Query but can only access the 'table' via job history and can't see it in datasets

I've imported some data into Big Query, however I can only query the table from Job History but can't seem to add it as a dataset.
What do I need to do in order to convert this as a dataset?
How I imported the data: It was done via a third party app in which had access to my Google Analytics (StitchData).
Here are some more additional import details.
From your screenshot, "Destination table" should be in format: [DATASET].[TABLE].
Also "Table Info"."Table ID" should have same info.
Guess you already have a dataset, just need a way to see it.
If so, this video may help you to locate "dataset" in BigQuery Classic UI.

How to save Google Cloud Datalab output into BigQuery using R

I am using R in Google Cloud Datalab and I want to save my output, which is a table containing Strings that is created in the code itself, to BigQuery. I know there is a way to do it with Python by using bqr_create_table so I am looking for the equivalent in R.
I have found this blog post from Gus Class on Google Cloud Platform which uses this code to write to BigQuery:
# Install BigRQuery if you haven't already...
# install.packages("devtools")
# devtools::install_github("rstats-db/bigrquery")
# library(bigrquery)
insert_upload_job("your-project-id", "test_dataset", "stash", stash)
Where "test_dataset" is the dataset in BigQuery, "stash" is the table inside the dataset and stash is any dataframe you have define with your data.
There is more information on how to authorize with bigrquery