Making a Google BigQuery from Python on Windows - sql

I am trying to do something which is very simple in other data services. I am trying to make a relatively simple SQL query and return it as a dataframe in python. I am on Windows 10 and using Phython 2.7 (specifically Canopy 1.7.4)
Typically this would be done with pandas.read_sql_query but due to some specifics with BigQuery they require a different method pandas.io.gbq.read_gbq
This method works fine unless you want to make a Big Query. If you make a Big Query on BigQuery you get the error
GenericGBQException: Reason: responseTooLarge, Message: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
This was asked and answered before in this ticket but neither of the solutions are relevant for my case
Python BigQuery allowLargeResults with pandas.io.gbq
One solution is for python 3 so it is a nonstarter. The other is giving an error due to me being unable to set my credentials as an environment variable in windows.
ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
I was able to download the JSON credentials file and I have set it as an environment variable in the few ways I know how but I still get the above error. Do I need to load this in some way in python? It seems to be looking for it but unable to find is correctly. Is there a special way to set it as an environment variable in this case?

You can do it in Python 2.7 by changing the default dialect from legacy to standard in pd.read_gbq function.
pd.read_gbq(query, 'my-super-project', dialect='standard')
Indeed, you can read in Big Query documentation for the parameter AllowLargeResults:
AllowLargeResults: For standard SQL queries, this flag is
ignored and large results are always allowed.

I have found two ways of directly importing the JSON credentials file. Both based on the original answer in Python BigQuery allowLargeResults with pandas.io.gbq
1) Credit to Tim Swast
First
pip install google-api-python-client
pip install google-auth
pip install google-cloud-core
then
replace
credentials = GoogleCredentials.get_application_default()
in create_service() with
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/file.json')
2)
Set the environment variable manually in the code like
import os,os.path
os.environ['GOOGLE_APPLICATION_CREDENTIALS']=os.path.expanduser('path/file.json')
I prefer method 2 since it does not require new modules to be installed and is also closer to the intended use of the JSON credentials.
Note:
You must create a destinationTable and add the information to run_query()

Here is a code that fully works within python 2.7 on Windows:
import pandas as pd
my_qry="<insert your big query here>"
### Here Put the data from your credentials file of the service account - all fields are available from there###
my_file="""{
"type": "service_account",
"project_id": "cb4recs",
"private_key_id": "<id>",
"private_key": "<your private key>\n",
"client_email": "<email>",
"client_id": "<id>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "<x509 url>"
}"""
df=pd.read_gbq(qry,project_id='<your project id>',private_key=my_file)
That's it :)

Related

Is there a way to automate this Python script in GCP?

I am a complete beginner in using GCP functions/products.
I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.
import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage
list_city = []
with open("list_of_cities.txt", "r") as pointer:
for line in pointer:
list_city.append(line.strip())
API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="
yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")
table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"
credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
client = bigquery.Client()
try:
schema = [
bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
]
table = bigquery.Table(table_id, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="Date", # name of column to use for partitioning
)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
except:
print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))
def get_weather():
try:
x["location"]
except:
print(f"API could not call city {city_name}")
global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km
day = []
time = []
dailytemp = []
dailyhum = []
dailycond = []
chance_rain = []
Precipitation = []
Cloud_coverage = []
Visibility_km = []
for i in range(24):
dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
day.append(dayval.group()[:-1])
time.append(timeval.group()[1:])
dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
for i in range(len(time)):
time[i] = int(time[i][:2])
def main():
i = 0
while i < len(list_city):
try:
global city_name
city_name = list_city[i]
complete_URL = Base_URL + API_key + "&q=" + city_name + "&dt=" + Date
response = requests.get(complete_URL, timeout = 10)
global x
x = response.json()
get_weather()
table = client.get_table(table_id)
varlist = []
for j in range(24):
variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
varlist.append(variables)
client.insert_rows(table, varlist)
print(f"City {city_name}, ({i+1} out of {len(list_city)}) successfully inserted")
i += 1
except Exception as e:
print(e)
continue
In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.
This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.
I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.
For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.
You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity
This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.
If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.
If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()
https://cloud.google.com/bigquery/docs/authentication/service-account-file
You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)
There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.
It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:
Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run
Based on my understanding of your application design, the most suitable ones would be:
Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style
I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account PROJECT_ID#appspot.gserviceaccount.com under the IAM configuration section has proper access to needed APIs (BigQuery in your case).
In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.

Acessing a database using zeep

I am trying to programmatically retrieve information from a database(BRENDA) using Zeep.
The following is the code.
import zeep
import hashlib
wsdl = "https://www.brenda-enzymes.org/soap/brenda.wsdl"
password = hashlib.sha256("xx".encode('utf-8')).hexdigest()
parameters = "xxx," + password + ",ecNumber*{}#organism*{}#".format("2.7.1.2", "Homo sapiens")
client = zeep.Client(wsdl=wsdl)
print(client)
km_string = client.getKmValue(parameters)
However, I get the following error
AttributeError: 'Client' object has no attribute 'getKmValue'
Could someone help me with this?
The above code works fine while using SOAPpy library in python 2. However, I couldn't successfully install SOAPpy in python 3, therefore I tried Zeep.
The sample code that shows SOAP implementation is available here
We fixed the webservice. It should work, now. Please have a look at the SOAP documentation on our website.
not the resolution but some hints.
1) with zeep you need to put .service between client and the name of the method. the correct syntax is client.service.getKmValue(parameters) (take a look at documentation)
anyway for zeep, getKmValue doesn't exists (but it exists on the wsdl schema and SoapUi see it).
you can also try py-suds,
but for some reason i obtain a 403 calling the wsdl.
from suds.client import Client
import hashlib
client = Client("https://www.brenda-enzymes.org/soap/brenda.wsdl")

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

How do I make a Bigquery dataset public using command line tool or Python?

I'm making an open data website powered by BigQuery. How do I make a Bigquery dataset public using command line tool or Python?
Note I tried to make every dataset in my project public but got an unexplained error. In project permission settings via WebUI under "Add members" I put
allAuthenticatedUsers and did the permission Data Viewer. The error was "Error
Sorry, there’s a problem. If you entered information, check it and try again. Otherwise, the problem might clear up on its own, so check back later."
I wasn't able to find any command line examples for updating permissions. I also can't find a JSON string to pass to https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update
To achieve this programatically, you need to use a dataset patch request and use the specialGroup item with the value allAuthenticatedUsers, like so:
{
"datasetReference":{
"projectId":"<removed>",
"datasetId":"<removed>"
},
"access":[
... //other access roles
{
"specialGroup":"allAuthenticatedUsers",
"role":"READER"
}
]
}
Note: You should use a read-modify-write cycle as described here & here:
Note about arrays: Patch requests that contain arrays replace the existing array with the one you provide. You cannot modify, add, or delete items in an array in a piecemeal fashion.

getting a "need project id error" in Keen

I get the following error:
Keen.delete(:iron_worker_analytics, filters: [{:property_name => 'start_time', :operator => 'eq', :property_value => '0001-01-01T00:00:00Z'}])
Keen::ConfigurationError: Keen IO Exception: Project ID must be set
However, when I set the value, I get the following:
warning: already initialized constant KEEN_PROJECT_ID
iron.io/env.rb:36: warning: previous definition of KEEN_PROJECT_ID was here
Keen works fine when I run the app and load the values from a env.rb file but from the console I cannot get past this.
I am using the ruby gem.
I figured it out. The documentation is confusing. Per the documentation:
https://github.com/keenlabs/keen-gem
The recommended way to set keys is via the environment. The keys you
can set are KEEN_PROJECT_ID, KEEN_WRITE_KEY, KEEN_READ_KEY and
KEEN_MASTER_KEY. You only need to specify the keys that correspond to
the API calls you'll be performing. If you're using foreman, add this
to your .env file:
KEEN_PROJECT_ID=aaaaaaaaaaaaaaa
KEEN_MASTER_KEY=xxxxxxxxxxxxxxx
KEEN_WRITE_KEY=yyyyyyyyyyyyyyy KEEN_READ_KEY=zzzzzzzzzzzzzzz If not,
make a script to export the variables into your shell or put it before
the command you use to start your server.
But I had to set it explicitly as Keen.project_id after doing a Keen.methods.
It's sort of confusing since from the docs, I assumed I just need to set the variables. Maybe I am misunderstanding the docs but it was confusing at least to me.