I would like to use public data from bigquery on datalab, and then into a pandas dataframe. How would I go about doing that. I have tried 3 different versions:
from google.cloud import bigquery
client = bigquery.Client()
QUERY = (
'SELECT pickup_datetime, dropoff_datetime FROM `bigquery-public-
data.new_york.tlc_yellow_trips_20*`') --also tried without the ` and wildcard
query = client.run_sync_query('%s LIMIT 100' % QUERY)
query.timeout_ms = 10000
query.run()
Error: BadRequest
import pandas as pd
df=pd.io.gbq.read_gbq("""
SELECT pickup_datetime, dropoff_datetime
FROM bigquery-public-data.new_york.tlc_yellow_trips_20*
LIMIT 10
""", project_id='bigquery-public-data')
Error: I am asked to give access to pandas, but when I agree, I get This site can’t be reached localhost refused to connect.
%%bq query
SELECT pickup_datetime, dropoff_datetime
FROM bigquery-public-data.new_york.tlc_yellow_trips_20*
LIMIT 10
Error: Just keeps Running
Any help on what I am doing wrong would be appreciated.
The codes above should work after some minor changes and after you granted google access to your local machine with your email using gcloud, install and initialize.
Get the project ID by typing bq after you initialized gcloud with gcloud init.
In my first code above use client = bigquery.Client(project_id='your project id')
Since you granted access, the second code should work as well, just update your project ID. If you dont use the limit function, then this may take a long time to load since pandas will transform the data to a dataframe.
And the third code will work as well.
Related
I am trying to load a CSV into BQ using a custom operator in Airflow.
My custom operator is using
load_job_config = bigquery.LoadJobConfig(
schema=self.schema_fields,
skip_leading_rows=self.skip_leading_rows,
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
'gs://' + self.source_bucket + "/" + self.source_object, self.dsp_tmp_dataset_table,
job_config=load_job_config
) #
The issue I am facing is that I always get errors
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table nonprod-cloud-composer:dsp_data_transformation.tremorvideo_daily_datafeed. Field Date has changed type from TIMESTAMP to DATE
The exact same code when run outside of Airflow as a stand alone python works fine.
I am using the exactly same schema object , same source CSV file just that the environment is different.
Below is the high level steps followed
Created table in BQ
Using the
LOAD DATA OVERWRITE XXXX
FROM FILES (
format = 'CSV',
uris = ['gs://xxx.csv']);
This worked fine and the data was loaded into the table.
3. Truncated the table and tried to run the custom operator what has the code above listed. Then faced errors.
4. Created a simple python program with to test the bq load job and that works fine too.
Its just that when ever the same load job is triggered using Airflow the schema detection fails and leads to all sorts of errors.
I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)
We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1
Fairly new to Python so apologies if this is a simple ask. I have browsed other answered questions but can't seem to get it functioning consistently.
I found the below script which prints the top result from google for a set of defined terms. It will work the first few times that I run it but will display the following error when I have searched 20 or so terms:
Traceback (most recent call last):
File "term2url.py", line 28, in <module>
results = json['responseData']['results']
TypeError: 'NoneType' object has no attribute '__getitem__'
From what I can gather, this indicates that one of the attributes does not have a defined value (potentially a result of google blocking me?). I attempted to solve the issue by adding in the else clause though I still run into the same problem.
Any help would be greatly appreciated; I have pasted the full code below.
Thanks!
#
# This is a quick and dirty script to pull the most likely url and description
# for a list of terms. Here's how you use it:
#
# python term2url.py < {a txt file with a list of terms} > {a tab delimited file of results}
#
# You'll must install the simpljson module to use it
#
import urllib
import urllib2
import simplejson
import sys
# Read the terms we want to convert into URL from info redirected from the command line
terms = sys.stdin.readlines()
for term in terms:
# Define the query to pass to Google Search API
query = urllib.urlencode({'q' : term.rstrip("\n")})
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s" % (query)
# Fetch the results and convert to JSON format
search_results = urllib2.urlopen(url)
json = simplejson.loads(search_results.read())
# Process the results by pulling the first record, which has the best match
results = json['responseData']['results']
for r in results[:1]:
if results is not None:
url = r['url']
desc = r['content'].encode('ascii', 'replace')
else:
url = "none"
desc = "none"
# Print the results to stdout. Use redirect to capture the output
print "%s\t%s" % (term.rstrip("\n"), url)
import time
time.sleep(1)
Here are some Python details for you first:
None is a valid object in Python, of the type NoneType:
print(type(None))
Produces:
< class 'NoneType' >
And the no attribute error you got is normal when you try to access some method or attribute of an object that doesn't have that attribute. In this case, you were attempting to use the __getitem__ syntax (object[item_index]), which NoneType objects don't support because it doesn't have the __getitem__ method.
The point of the previous explanation is that your assumption about what your error means is correct: your results object is essentially empty.
As for why you're hitting this in the first place, I believe you are running up against Google's API limits. It looks like you're using the old API that is now deprecated. The number of search results (not queries) used to be limited to around 64 per query, and there used to be no rate or per-day limit. However, since it's been deprecated for over 5 years now, there may be new undocumented limits.
I don't think it necessarily has anything to do with SPAM, but I do believe it is an undocumented limit.
I need to write a "standalone" script in Python to upload sales taxes to the account_tax table in the database using ONLY the ORM module of OpenERP. What I would like to do is something like the pseudo code below.
Can someone provide me a more details on the following:
1) what sys.path's do I need to set
2) what modules do I need to import before importing the "account" module. Currently when I import the "account" module I get the following error:
AssertionError: The report "report.custom" already exists!
3) What is the proper way to get my database cursor. In the code below I am simply calling psycopg2 directly to get a cursor.
If this approach cannot work, can anyone suggest an alternative approach other than writing XML files to load the data from the OpenERP application itself. This process needs to run outside of the the standard OpenERP application.
PSEUDO CODE:
import sys
# set Python paths to access openerp modules
sys.path.append("./openerp")
sys.path.append("./openerp/addons")
# import OpenERP
import openerp
# import the account addon modules that contains the tables
# to be populated.
import account
# define connection string
conn_string2 = "dbname='test2' user='xyz' password='password'"
# get a db connection
conn = psycopg2.connect(conn_string2)
# conn.cursor() will return a cursor object
cursor = conn.cursor()
# and finally use the ORM to insert data into table.
If you wanna do it via web service then have look at the OpenERP XML-RPC Web services
Example code top work with OpenERP Web Services :
import xmlrpclib
username = 'admin' #the user
pwd = 'admin' #the password of the user
dbname = 'test' #the database
# OpenERP Common login Service proxy object
sock_common = xmlrpclib.ServerProxy ('http://localhost:8069/xmlrpc/common')
uid = sock_common.login(dbname, username, pwd)
#replace localhost with the address of the server
# OpenERP Object manipulation service
sock = xmlrpclib.ServerProxy('http://localhost:8069/xmlrpc/object')
partner = {
'name': 'Fabien Pinckaers',
'lang': 'fr_FR',
}
#calling remote ORM create method to create a record
partner_id = sock.execute(dbname, uid, pwd, 'res.partner', 'create', partner)
More clearly you can also use the OpenERP Client lib
Example Code with client lib :
import openerplib
connection = openerplib.get_connection(hostname="localhost", database="test", \
login="admin", password="admin")
user_model = connection.get_model("res.users")
ids = user_model.search([("login", "=", "admin")])
user_info = user_model.read(ids[0], ["name"])
print user_info["name"]
You see both way are good but when you use the client lib, code is less and easy to understand while using xmlrpc proxy is lower level calls that you will handle
Hope this will help you.
As per my view one must go for XMLRPC or NETSVC services provided by Open ERP for such needs.
You don't need to import accounts module of Open ERP, there are possibilities that other modules have inherited accounts.tax object and had altered its behaviour as per your business needs.
Eventually if you feed data by calling those methods manually without using Open ERP Web service its possible you'll get undesired result / unexpected failures / inconsistent database state.
You can use Erppeek to browse data, but not sure if you can really upload data to DB, personally I use/prefer XMLRPC
Why don't you use the xmlrpc call of openerp.
it will not need to import account or openerp . and even you can have all orm functionality.
You can use python library to access openerp server using xmlrpc service.
Please check https://github.com/OpenERP/openerp-client-lib
It is officially supported by OpenERP SA.
If you want to interacti directly with the DB, you could just import psycopg2 and:
conn = psycopg2.connect(dbname='dbname', user='dbuser', password='dbpassword', host='dbhost')
cur = conn.cursor()
cur.execute('select * from table where id = %d' % table_id)
cur.execute('insert into table(column1, column2) values(%d, %d)' % (value1, value2))
cur.close()
conn.close()
Why you want to fix it like that?! You should create a localization module and define data in XML files. This is the standard way to fix such a problem in OpenERP.
You want to insert sales taxes for which country? Explain more plz.
from openerp.modules.registry import RegistryManager
registry = RegistryManager.get("databasename")
with registry.cursor() as cr:
user = registry.get('res.users').browse(cr, userid, listids)
print user