Airflow - select bigquery table data into a dataframe - google-bigquery

I'm trying to execute the following DAG in Airflow Composer on google cloud and I keep getting the same error:
The conn_id hard_coded_project_name isn't defined
Maybe someone can point me to the right direction?
from airflow.models import DAG
import os
from airflow.operators.dummy import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import datetime
import pandas as pd
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.providers.google.cloud.operators import bigquery
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
default_args = {
'start_date': datetime.datetime(2020, 1, 1),
}
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
def list_dates_in_df():
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID,
use_legacy_sql=False)
bq_client = bigquery.Client(project = hook._get_field("project"),
credentials = hook._get_credentials())
query = "select count(*) from LP_RAW.DIM_ACCOUNT;"
df = bq_client.query(query).to_dataframe()
with DAG(
'df_test',
schedule_interval=None,
catchup = False,
default_args=default_args
) as dag:
list_dates = PythonOperator(
task_id ='list_dates',
python_callable = list_dates_in_df
)
list_dates

It means that PROJECT_ID as seen in line
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
was assigned value hard_coded_project_name since GCP_PROJECT_ID has no value.
Then at line
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID...
the string hard_coded_project_name is automatically associated with a connection id in Airflow and it does not have a value or it does not exist.
To avoid this error you can do either steps to fix this.
Create a connection id for both GCP_PROJECT_ID and hard_coded_project_name just so we are sure that both have values. But if we don't want to create a connection for GCP_PROJECT_ID, make sure that hard_coded_project_name has a value so there will be a fallback option. You can do this by
Opening your Airflow instance.
Click "Admin" > "Connections"
Click "Create"
Fill up "Conn Id", "Conn Type" as "hard_coded_project_name" and "Google Cloud Platform" respectively.
Fill up "Project Id" with your actual project id value
Do these steps another time to create GCP_PROJECT_ID
The connection should look like this (at minimum, providing the projectID will work. But feel free to add the keyfile or its content and scope so you won't be having problems on authentication moving forward):
You can use bigquery_default instead of hard_coded_project_name so by default it will point to the project that runs the Airflow instance.
Your updated PROJECT_ID assignment code will be
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "bigquery_default")
Also when testing your code you might encounter an error at line
bq_client = bigquery.Client(project = hook._get_field("project")...
since Client() does not exist on from airflow.providers.google.cloud.operators import bigquery you should use from google.cloud import bigquery instead.
Here is a snippet of the test where I only created hard_coded_project_name so PROJECT_ID will use this connection.I got the count of a table of mine and it worked:
Here is a snippet of the test I made when I used bigquery_default where I got the count of a table of mine and it worked:

Related

Access to Big Query denied

I am trying to access Big Query with python in Anaconda. It has been working. However, I used a new project id and I am now getting "Access Denied: Project id: User does not have bigquery.jobs.create permission in project id".
Thank you in advance
The same code has been working with a different project id.
import pandas as pd
sql = "SELECT instance_index, collection_id, machine_id, start_time, end_time FROM `google.com:google-cluster-data.clusterdata_2019_a.instance_usage` WHERE machine_id = 102894374053"
project_id = 'vibrant-victory-370720'
a_102894374053_df = pd.read_gbq(sql, project_id = project_id, dialect='standard', use_bqstorage_api=True)
I don't think it's the problem with the new project since it is working on Google colab, just not on Anaconda.
Should I somehow clean the cache of the authentication of the previous project_id?
Got it to work by setting the credential with a service account
https://pandas-gbq.readthedocs.io/en/latest/howto/authentication.html
from google.oauth2 import service_account
import pandas as pd
credentials = service_account.Credentials.from_service_account_file(
'vibrant-victory-370720-3bafb9420314.json')
a_102894374053_df = pd.read_gbq(sql, project_id = project_id, dialect='standard', credentials=credentials, use_bqstorage_api=True)

Recursively copy s3 objects from one s3 prefix to another in airflow

I am trying to copy files that I receive hourly into my incoming bucket with the below format
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename1.csv
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename2.csv
s3://input-bucket/source_system1/prod/2022-09-27-01/input_folder/filename3.csv
s3://input-bucket/source_system1/prod/2022-09-27-11/input_folder/filename3.csv
I want to copy the objects into a destination folder with a single airflow task for a specific source system.
I tried -
s3_copy = S3CopyObjectOperator(
task_id=f"copy_s3_objects_{TC_ENV.lower()}",
source_bucket_key="s3://input-bucket/source_system1/prod/2022-09-27-*",
dest_bucket_name="destination-bucket",
dest_bucket_key=f"producers/prod/event_type=source_system/execution_date={EXECUTION_DATE}",
aws_conn_id=None
)
The problem with the above is, I am not able to use wildcards for the input source_bucket. It needs to be a specific complete prefix of the s3 object. I also tried using the combination of S3ListOperator and S3FileTransformOperator. But all of them created a single task for each object. But I need 1 airflow task for 1 source system thus able to copy all the data with this wildcard pattern-
s3://input-bucket/source_system1/prod/2022-09-27-*
How can I achieve this?
If you wish to achieve this in one specific task I recommend utilizing the PythonOperator to interact with the S3Hook as follows:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from datetime import datetime
def s3_copy(**kwargs):
hook = S3Hook(aws_conn_id='aws_default')
source_bucket = Variable.get('source_bucket')
keys = hook.list_keys(bucket_name=source_bucket, prefix='')
for key in keys:
hook.copy_object(
source_bucket_name=source_bucket,
dest_bucket_name=Variable.get('dest_bucket'),
source_bucket_key=key,
dest_bucket_key=key,
acl_policy='bucket-owner-full-control'
)
pass
with DAG('example_dag',
schedule_interval='0 1 * * *',
start_date=datetime(2023, 1, 1),
catchup=False
):
e0 = EmptyOperator(task_id='start')
t1 = PythonOperator(
task_id='example_copy',
python_callable=s3_copy
)
e0 >> t1
You could make improvements on the base logic to be more performant or do some filtering, etc

Data overwrite google sheet - Jupyter connection

I created a connection between my Jupyter notebook and google sheet.
My idea was to create a log so everytime I run the notebook it would update my google sheet with the new data but I dont want to overwrite the existing data, I want to add. I tried many solutions but it didnt work
Currently my code is:
## Connect to our service account
scope =["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
credentials = ServiceAccountCredentials.from_json_keyfile_name('jupyter-and-gsheet-303208-63903bea8f5d.json', scope)
gc = gspread.authorize(credentials)
spreadsheet_key = '1RbPnMdJ-EcJHbly280vrJxc8UvqwiBPkUTFLyo4efEA'
from df2gspread import df2gspread as d2g
wks_name = 'Data04'
d2g.upload(df_apn1, spreadsheet_key, wks_name, credentials=credentials)
It works perfectly but always overwriting the existing data.
Does anybody know how I can add instead of replace?
thank you
df2gspread document for upload() indicates that
if spreadsheet already exists, all data of provided worksheet(or first as default) will be replaced with data of given DataFrame, make sure that this is what you need!.
Another workaround is to convert your dataframe to a list and use gspread append_rows.
Example:
Code:
import gspread
import pandas as pd
gc = gspread.service_account()
sh = gc.open_by_key("someid").sheet1
df = pd.DataFrame({'Name': ['Bea', 'Andrew', 'Mike'], 'Age': [20, 19, 23]})
values = df.values.tolist()
sh.append_rows(values)
Before append:
After append:
You may also check the following libraries:
gspread-pandas
gspread-dataframe
Reference:
gspread

How to create a new table in Postgresql from a .json file using Python and psycopg2?

I am new to postgresql. I would like to insert information from .json and create a new table in Postgresql using python/psycopg2. I have looked over some StackOverflow posts and psychopg2 documentation without getting much further.
The closest question is here, from which I derived the following:
The test .json file is as follows (which only has 1-level i.e. no nested .json structure):
[{"last_update": "2019-02-01"}]
Attempted python code:
import psycopg2
from psycopg2.extras import Json
from psycopg2 import Error
from unipath import Path
import io
def insert_into_table(json_data):
try:
with psycopg2.connect( user = "thisuser",
password = "somePassword",
host = "127.0.0.654165",
port = '5455',
database = "SqlTesting") as conn:
cursor = conn.cursor()
read_json = io.open(data_path, encoding='utf-8')
read_json_all = read_json.readlines()
query = "INSERT INTO new_table VALUES (%s)"
cursor.executemany(query, (read_json_all,))
conn.commit()
print("Json data import successful")
except (Exception, psycopg2.Error) as error:
print("Failed json import: {}".format(error))
insert_into_table(data_path)
The above code didn't work regardless whether new_table didn't exist or if it was created manually as a place-holder.
Rather, it produced the following error message:
Failed json import: relation "new_table" does not exist
LINE 1: INSERT INTO new_table VALUES ('[{"last_update": "2019-02-01"...
During debugging, I saw:
for i in read_json:
print (i)
# will result
# [{"last_update": "2019-02-01"}]
And
print (read_json_all)
# Will result
# ['[{"last_update": "2019-02-01"}]']
I think you might want to use sqlalchemy to put your data into the postgres DB. Below, I used a very simple json file, and created a Pandas DataFrame. I then used sqlalchemy to place it into the DB. Check the code here. It should get you where you want to go.
import psycopg2
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
import json
from pandas.io.json import json_normalize
with open('example_1.json') as data_file:
d = json.load(data_file)
def create_table():
conn=psycopg2.connect("dbname='SqlTesting' user='thisuser' password='somePassword' host='localhost' port='5432' ")
cur=conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS new_table (color TEXT, fruit TEXT, size TEXT)")
conn.commit()
conn.close()
create_table()
df = json_normalize(d)
engine = create_engine("postgresql+psycopg2://thisuser:somePassword#localhost:5432/SqlTesting")
df.to_sql("new_table", engine, index=False, if_exists='append')
print("Done")

How do I seed a flask sql-alchemy database

I am new at python, I just learnt how to create an api using flask restless and flask sql-alchemy. I however would like to seed the database with random values. How do I achieve this? Please help.
Here is the api code...
import flask
import flask.ext.sqlalchemy
import flask.ext.restless
import datetime
DATABASE = 'sqlite:///tmp/test.db'
#Create the Flask application and the FLask-SQLALchemy object
app = flask.Flask(__name__)
app.config ['DEBUG'] = True
app.config['SQLALCHEMY_DATABASE_URI'] = DATABASE
db = flask.ext.sqlalchemy.SQLAlchemy(app)
#create Flask-SQLAlchemy models
class TodoItem(db.Model):
id = db.Column(db.Integer, primary_key = True)
todo = db.Column(db.Unicode)
priority = db.Column(db.SmallInteger)
due_date = db.Column(db.Date)
#Create database tables
db.create_all()
#Create Flask restless api manager
manager = flask.ext.restless.APIManager(app, flask_sqlalchemy_db = db)
#Create api end points
manager.create_api(TodoItem, methods = ['GET','POST','DELETE','PUT'], results_per_page = 20)
#Start flask loop
app.run()
I had a similar question and did some research, found something that worked.
The pattern I am seeing is based on registering a Flask CLI custom command, something like: flask seed.
This would look like this given your example. First, import the following into your api code file (let's say you have it named server.py):
from flask.cli import with_appcontext
(I see you do import flask but I would just add you should change these to from flask import what_you_need)
Next, create a function that does the seeding for your project:
#with_appcontext
def seed():
"""Seed the database."""
todo1 = TodoItem(...).save()
todo2 = TodoItem(...).save()
todo3 = TodoItem(...).save()
Finally, register these command with your flask application:
def register_commands(app):
"""Register CLI commands."""
app.cli.add_command(seed)
After you've configured you're application, make sure you call register_commands to register the commands:
register_commands(app)
At this point, you should be able to run: flask seed. You can add more functions (maybe a flask reset) using the same pattern.
From another newbie, the forgerypy and forgerypy3 libraries are available for this purpose (though they look like they haven't been touched in a bit).
A simple example of how to use them by adding them to your model:
class TodoItem(db.Model):
....
#staticmethod
def generate_fake_data(records=10):
import forgery_py
from random import randint
for record in records:
todo = TodoItem(todo=forgery_py.lorem_ipsum.word(),
due_date=forgery_py.date.date(),
priority=randint(1,4))
db.session.add(todo)
try:
db.session.commit()
except:
db.session.rollback()
You would then call the generate_fake_data method in a shell session.
And Miguel Grinberg's Flask Web Development (the O'Reilly book, not blog) chapter 11 is a good resource for this.