Recursively copy s3 objects from one s3 prefix to another in airflow - amazon-s3

I am trying to copy files that I receive hourly into my incoming bucket with the below format
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename1.csv
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename2.csv
s3://input-bucket/source_system1/prod/2022-09-27-01/input_folder/filename3.csv
s3://input-bucket/source_system1/prod/2022-09-27-11/input_folder/filename3.csv
I want to copy the objects into a destination folder with a single airflow task for a specific source system.
I tried -
s3_copy = S3CopyObjectOperator(
task_id=f"copy_s3_objects_{TC_ENV.lower()}",
source_bucket_key="s3://input-bucket/source_system1/prod/2022-09-27-*",
dest_bucket_name="destination-bucket",
dest_bucket_key=f"producers/prod/event_type=source_system/execution_date={EXECUTION_DATE}",
aws_conn_id=None
)
The problem with the above is, I am not able to use wildcards for the input source_bucket. It needs to be a specific complete prefix of the s3 object. I also tried using the combination of S3ListOperator and S3FileTransformOperator. But all of them created a single task for each object. But I need 1 airflow task for 1 source system thus able to copy all the data with this wildcard pattern-
s3://input-bucket/source_system1/prod/2022-09-27-*
How can I achieve this?

If you wish to achieve this in one specific task I recommend utilizing the PythonOperator to interact with the S3Hook as follows:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from datetime import datetime
def s3_copy(**kwargs):
hook = S3Hook(aws_conn_id='aws_default')
source_bucket = Variable.get('source_bucket')
keys = hook.list_keys(bucket_name=source_bucket, prefix='')
for key in keys:
hook.copy_object(
source_bucket_name=source_bucket,
dest_bucket_name=Variable.get('dest_bucket'),
source_bucket_key=key,
dest_bucket_key=key,
acl_policy='bucket-owner-full-control'
)
pass
with DAG('example_dag',
schedule_interval='0 1 * * *',
start_date=datetime(2023, 1, 1),
catchup=False
):
e0 = EmptyOperator(task_id='start')
t1 = PythonOperator(
task_id='example_copy',
python_callable=s3_copy
)
e0 >> t1
You could make improvements on the base logic to be more performant or do some filtering, etc

Related

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

Airflow - select bigquery table data into a dataframe

I'm trying to execute the following DAG in Airflow Composer on google cloud and I keep getting the same error:
The conn_id hard_coded_project_name isn't defined
Maybe someone can point me to the right direction?
from airflow.models import DAG
import os
from airflow.operators.dummy import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
import datetime
import pandas as pd
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.providers.google.cloud.operators import bigquery
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
default_args = {
'start_date': datetime.datetime(2020, 1, 1),
}
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
def list_dates_in_df():
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID,
use_legacy_sql=False)
bq_client = bigquery.Client(project = hook._get_field("project"),
credentials = hook._get_credentials())
query = "select count(*) from LP_RAW.DIM_ACCOUNT;"
df = bq_client.query(query).to_dataframe()
with DAG(
'df_test',
schedule_interval=None,
catchup = False,
default_args=default_args
) as dag:
list_dates = PythonOperator(
task_id ='list_dates',
python_callable = list_dates_in_df
)
list_dates
It means that PROJECT_ID as seen in line
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "hard_coded_project_name")
was assigned value hard_coded_project_name since GCP_PROJECT_ID has no value.
Then at line
hook = BigQueryHook(bigquery_conn_id=PROJECT_ID...
the string hard_coded_project_name is automatically associated with a connection id in Airflow and it does not have a value or it does not exist.
To avoid this error you can do either steps to fix this.
Create a connection id for both GCP_PROJECT_ID and hard_coded_project_name just so we are sure that both have values. But if we don't want to create a connection for GCP_PROJECT_ID, make sure that hard_coded_project_name has a value so there will be a fallback option. You can do this by
Opening your Airflow instance.
Click "Admin" > "Connections"
Click "Create"
Fill up "Conn Id", "Conn Type" as "hard_coded_project_name" and "Google Cloud Platform" respectively.
Fill up "Project Id" with your actual project id value
Do these steps another time to create GCP_PROJECT_ID
The connection should look like this (at minimum, providing the projectID will work. But feel free to add the keyfile or its content and scope so you won't be having problems on authentication moving forward):
You can use bigquery_default instead of hard_coded_project_name so by default it will point to the project that runs the Airflow instance.
Your updated PROJECT_ID assignment code will be
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "bigquery_default")
Also when testing your code you might encounter an error at line
bq_client = bigquery.Client(project = hook._get_field("project")...
since Client() does not exist on from airflow.providers.google.cloud.operators import bigquery you should use from google.cloud import bigquery instead.
Here is a snippet of the test where I only created hard_coded_project_name so PROJECT_ID will use this connection.I got the count of a table of mine and it worked:
Here is a snippet of the test I made when I used bigquery_default where I got the count of a table of mine and it worked:

Load data only once on the RAM using Python

Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)

Denormalize a GCS file before uploading to BigQuery

I have written a Cloud Run API in .Net Core that reads files from a GCS location and then is supposed to denormalize (i.e. add more information for each row to include textual descriptions) and then write that to a BigQuery table. I have two options:
My cloud run API could create denormalized CSV files and write them to another GCS location. Then another cloud run API could pick up those denormalized CSV files and write them straight to BigQuery.
My cloud run API could read the original CSV file, denormalize them in memory (filestream) and then somehow write from the in memory filestream straight to the BigQuery table.
What is the best way to write to BigQuery in this scenario if performance (speed) and cost (monetary) is my goal. These files are roughly 10KB each before denormalizing. Each row is roughly 1000 characters. After denormalizing it is about three times as much. I do not need to keep denormalized files after they are successfully loaded in BigQuery. I am concerned about performance, as well as any specific BigQuery daily quotas around inserts/writes. I don't think there are any unless you are doing DML statements but correct me if I'm wrong.
I would use Cloud Functions that are triggered when you upload a file to a bucket.
It is so common that Google has a repo a tutorial just for this for JSON files Streaming data from Cloud Storage into BigQuery using Cloud Functions.
Then, I would modify the example main.py file from:
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
db_ref = DB.document(u'streaming_files/%s' % file_name)
if _was_already_ingested(db_ref):
_handle_duplication(db_ref)
else:
try:
_insert_into_bigquery(bucket_name, file_name)
_handle_success(db_ref)
except Exception:
_handle_error(db_ref)
To this that accepts CSV files:
import json
import csv
import logging
import os
import traceback
from datetime import datetime
from google.api_core import retry
from google.cloud import bigquery
from google.cloud import storage
import pytz
PROJECT_ID = os.getenv('GCP_PROJECT')
BQ_DATASET = 'fromCloudFunction'
BQ_TABLE = 'mytable'
CS = storage.Client()
BQ = bigquery.Client()
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
newRows = postProcessing(bucket_name, file_name)
# It is recommended that you save
# what you process for debugging reasons.
destination_bucket = 'post-processed' # gs://post-processed/
destination_name = file_name
# saveRowsToBucket(newRows,destination_bucket,destination_name)
rowsInsertIntoBigquery(newRows)
class BigQueryError(Exception):
'''Exception raised whenever a BigQuery error happened'''
def __init__(self, errors):
super().__init__(self._format(errors))
self.errors = errors
def _format(self, errors):
err = []
for error in errors:
err.extend(error['errors'])
return json.dumps(err)
def postProcessing(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
my_str = blob.download_as_string().decode('utf-8')
csv_reader = csv.DictReader(my_str.split('\n'))
newRows = []
for row in csv_reader:
modified_row = row # Add your logic
newRows.append(modified_row)
return newRows
def rowsInsertIntoBigquery(rows):
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,rows)
if errors != []:
raise BigQueryError(errors)
It would be still necesssary to define your map(row->newRow) and the function saveRowsToBucket if you needed it.

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.
What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.
I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)
In my case I needed table names in Glue Job Script console
Finally I used boto library and retrieved database and table names with Glue client:
import boto3
client = boto3.client('glue',region_name='us-east-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
tableList = responseGetTables['TableList']
for tableDict in tableList:
tableName = tableDict['Name']
print '\n-- tableName: '+tableName
Important thing is to setup the region properly
Reference:
get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases
get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables
Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.
def get_glue_tables(database=None):
next_token = ""
while True:
response = glue_client.get_tables(
DatabaseName=database,
NextToken=next_token
)
for table in response.get('TableList'):
print(table.get('Name'))
next_token = response.get('NextToken')
if next_token is None:
break
The boto3 api also supports pagination, so you could use the following instead:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
DatabaseName='database_name'
)
for page in page_iterator:
print(page['TableList'])
That way you don't have to mess with while loops or the next token.