I am trying to access Big Query with python in Anaconda. It has been working. However, I used a new project id and I am now getting "Access Denied: Project id: User does not have bigquery.jobs.create permission in project id".
Thank you in advance
The same code has been working with a different project id.
import pandas as pd
sql = "SELECT instance_index, collection_id, machine_id, start_time, end_time FROM `google.com:google-cluster-data.clusterdata_2019_a.instance_usage` WHERE machine_id = 102894374053"
project_id = 'vibrant-victory-370720'
a_102894374053_df = pd.read_gbq(sql, project_id = project_id, dialect='standard', use_bqstorage_api=True)
I don't think it's the problem with the new project since it is working on Google colab, just not on Anaconda.
Should I somehow clean the cache of the authentication of the previous project_id?
Got it to work by setting the credential with a service account
https://pandas-gbq.readthedocs.io/en/latest/howto/authentication.html
from google.oauth2 import service_account
import pandas as pd
credentials = service_account.Credentials.from_service_account_file(
'vibrant-victory-370720-3bafb9420314.json')
a_102894374053_df = pd.read_gbq(sql, project_id = project_id, dialect='standard', credentials=credentials, use_bqstorage_api=True)
Related
I'm trying to execute the following DAG in Airflow Composer on google cloud and I keep getting the same error:
The conn_id hard_coded_project_name isn't defined
Maybe someone can point me to the right direction?
from airflow.models import DAG
import os
from airflow.operators.dummy import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import datetime
import pandas as pd
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.providers.google.cloud.operators import bigquery
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
default_args = {
'start_date': datetime.datetime(2020, 1, 1),
}
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
def list_dates_in_df():
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID,
use_legacy_sql=False)
bq_client = bigquery.Client(project = hook._get_field("project"),
credentials = hook._get_credentials())
query = "select count(*) from LP_RAW.DIM_ACCOUNT;"
df = bq_client.query(query).to_dataframe()
with DAG(
'df_test',
schedule_interval=None,
catchup = False,
default_args=default_args
) as dag:
list_dates = PythonOperator(
task_id ='list_dates',
python_callable = list_dates_in_df
)
list_dates
It means that PROJECT_ID as seen in line
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
was assigned value hard_coded_project_name since GCP_PROJECT_ID has no value.
Then at line
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID...
the string hard_coded_project_name is automatically associated with a connection id in Airflow and it does not have a value or it does not exist.
To avoid this error you can do either steps to fix this.
Create a connection id for both GCP_PROJECT_ID and hard_coded_project_name just so we are sure that both have values. But if we don't want to create a connection for GCP_PROJECT_ID, make sure that hard_coded_project_name has a value so there will be a fallback option. You can do this by
Opening your Airflow instance.
Click "Admin" > "Connections"
Click "Create"
Fill up "Conn Id", "Conn Type" as "hard_coded_project_name" and "Google Cloud Platform" respectively.
Fill up "Project Id" with your actual project id value
Do these steps another time to create GCP_PROJECT_ID
The connection should look like this (at minimum, providing the projectID will work. But feel free to add the keyfile or its content and scope so you won't be having problems on authentication moving forward):
You can use bigquery_default instead of hard_coded_project_name so by default it will point to the project that runs the Airflow instance.
Your updated PROJECT_ID assignment code will be
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "bigquery_default")
Also when testing your code you might encounter an error at line
bq_client = bigquery.Client(project = hook._get_field("project")...
since Client() does not exist on from airflow.providers.google.cloud.operators import bigquery you should use from google.cloud import bigquery instead.
Here is a snippet of the test where I only created hard_coded_project_name so PROJECT_ID will use this connection.I got the count of a table of mine and it worked:
Here is a snippet of the test I made when I used bigquery_default where I got the count of a table of mine and it worked:
I am having trouble using the google bigquery package in pandas. I have installed the google-api-python-client as well as the pandas-gbq packages. But for some reason when I go to query a table I get a DistributionNotFound: The 'google-api-python-client' distribution was not found and is required by the application error. Here is a snippet of my code:
import pandas as pd
from pandas.io import gbq
count_block = gbq.read_gbq('SELECT count(int64_field_0) as count_blocks FROM Data.bh', projectid)
Using a virtual environment in this scenario can allow you to rule out problems with your library installations
So I created a model for storing credentials from Gmail users.
I wanted to make migrations but it says that there is no such table:
django.db.utils.OperationalError: no such table: mainApp_credentialsmodel
My models:
from django.db import models
# Create your models here.
from django.contrib.auth.models import User
from django.db import models
import json
class CredentialsModel(models.Model):
id = models.ForeignKey(User, primary_key=True,on_delete=models.CASCADE)
credential = models.CharField(max_length=1000)
Calling that model for checking authorization:
SCOPES = 'https://www.googleapis.com/auth/gmail.readonly'
store = CredentialsModel.objects.all()
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('mainApp/client_secret.json', SCOPES)
creds = tools.run_flow(flow, store)
service = build('gmail', 'v1', http=creds.authorize(Http()))
python manage.py makemigrations
If that error keep happening, check your migrations folder and check the files inside. Also check If your database is online, in case you have a database online, I've got this problem last week, but it was a problem with azure.
In last case I would create the table (model) again, changing the name to something similar, but If you have a significant amount of data in that table, then I think you can't do that.
It looks like your authorization code - including the query on CredentialsModel - is at module level. This means it runs when the module is imported, which happens before the migration has had a chance to run.
You must ensure that any database-accessing code is inside a function or method and is not invoked globally.
I would like to use public data from bigquery on datalab, and then into a pandas dataframe. How would I go about doing that. I have tried 3 different versions:
from google.cloud import bigquery
client = bigquery.Client()
QUERY = (
'SELECT pickup_datetime, dropoff_datetime FROM `bigquery-public-
data.new_york.tlc_yellow_trips_20*`') --also tried without the ` and wildcard
query = client.run_sync_query('%s LIMIT 100' % QUERY)
query.timeout_ms = 10000
query.run()
Error: BadRequest
import pandas as pd
df=pd.io.gbq.read_gbq("""
SELECT pickup_datetime, dropoff_datetime
FROM bigquery-public-data.new_york.tlc_yellow_trips_20*
LIMIT 10
""", project_id='bigquery-public-data')
Error: I am asked to give access to pandas, but when I agree, I get This site can’t be reached localhost refused to connect.
%%bq query
SELECT pickup_datetime, dropoff_datetime
FROM bigquery-public-data.new_york.tlc_yellow_trips_20*
LIMIT 10
Error: Just keeps Running
Any help on what I am doing wrong would be appreciated.
The codes above should work after some minor changes and after you granted google access to your local machine with your email using gcloud, install and initialize.
Get the project ID by typing bq after you initialized gcloud with gcloud init.
In my first code above use client = bigquery.Client(project_id='your project id')
Since you granted access, the second code should work as well, just update your project ID. If you dont use the limit function, then this may take a long time to load since pandas will transform the data to a dataframe.
And the third code will work as well.
I am new at python, I just learnt how to create an api using flask restless and flask sql-alchemy. I however would like to seed the database with random values. How do I achieve this? Please help.
Here is the api code...
import flask
import flask.ext.sqlalchemy
import flask.ext.restless
import datetime
DATABASE = 'sqlite:///tmp/test.db'
#Create the Flask application and the FLask-SQLALchemy object
app = flask.Flask(__name__)
app.config ['DEBUG'] = True
app.config['SQLALCHEMY_DATABASE_URI'] = DATABASE
db = flask.ext.sqlalchemy.SQLAlchemy(app)
#create Flask-SQLAlchemy models
class TodoItem(db.Model):
id = db.Column(db.Integer, primary_key = True)
todo = db.Column(db.Unicode)
priority = db.Column(db.SmallInteger)
due_date = db.Column(db.Date)
#Create database tables
db.create_all()
#Create Flask restless api manager
manager = flask.ext.restless.APIManager(app, flask_sqlalchemy_db = db)
#Create api end points
manager.create_api(TodoItem, methods = ['GET','POST','DELETE','PUT'], results_per_page = 20)
#Start flask loop
app.run()
I had a similar question and did some research, found something that worked.
The pattern I am seeing is based on registering a Flask CLI custom command, something like: flask seed.
This would look like this given your example. First, import the following into your api code file (let's say you have it named server.py):
from flask.cli import with_appcontext
(I see you do import flask but I would just add you should change these to from flask import what_you_need)
Next, create a function that does the seeding for your project:
#with_appcontext
def seed():
"""Seed the database."""
todo1 = TodoItem(...).save()
todo2 = TodoItem(...).save()
todo3 = TodoItem(...).save()
Finally, register these command with your flask application:
def register_commands(app):
"""Register CLI commands."""
app.cli.add_command(seed)
After you've configured you're application, make sure you call register_commands to register the commands:
register_commands(app)
At this point, you should be able to run: flask seed. You can add more functions (maybe a flask reset) using the same pattern.
From another newbie, the forgerypy and forgerypy3 libraries are available for this purpose (though they look like they haven't been touched in a bit).
A simple example of how to use them by adding them to your model:
class TodoItem(db.Model):
....
#staticmethod
def generate_fake_data(records=10):
import forgery_py
from random import randint
for record in records:
todo = TodoItem(todo=forgery_py.lorem_ipsum.word(),
due_date=forgery_py.date.date(),
priority=randint(1,4))
db.session.add(todo)
try:
db.session.commit()
except:
db.session.rollback()
You would then call the generate_fake_data method in a shell session.
And Miguel Grinberg's Flask Web Development (the O'Reilly book, not blog) chapter 11 is a good resource for this.