How to list all databases and tables in AWS Glue Catalog? - apache-spark-sql

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.
What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.

I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)
In my case I needed table names in Glue Job Script console
Finally I used boto library and retrieved database and table names with Glue client:
import boto3
client = boto3.client('glue',region_name='us-east-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
tableList = responseGetTables['TableList']
for tableDict in tableList:
tableName = tableDict['Name']
print '\n-- tableName: '+tableName
Important thing is to setup the region properly
Reference:
get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases
get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables

Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.
def get_glue_tables(database=None):
next_token = ""
while True:
response = glue_client.get_tables(
DatabaseName=database,
NextToken=next_token
)
for table in response.get('TableList'):
print(table.get('Name'))
next_token = response.get('NextToken')
if next_token is None:
break

The boto3 api also supports pagination, so you could use the following instead:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
DatabaseName='database_name'
)
for page in page_iterator:
print(page['TableList'])
That way you don't have to mess with while loops or the next token.

Related

Recursively copy s3 objects from one s3 prefix to another in airflow

I am trying to copy files that I receive hourly into my incoming bucket with the below format
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename1.csv
s3://input-bucket/source_system1/prod/2022-09-27-00/input_folder/filename2.csv
s3://input-bucket/source_system1/prod/2022-09-27-01/input_folder/filename3.csv
s3://input-bucket/source_system1/prod/2022-09-27-11/input_folder/filename3.csv
I want to copy the objects into a destination folder with a single airflow task for a specific source system.
I tried -
s3_copy = S3CopyObjectOperator(
task_id=f"copy_s3_objects_{TC_ENV.lower()}",
source_bucket_key="s3://input-bucket/source_system1/prod/2022-09-27-*",
dest_bucket_name="destination-bucket",
dest_bucket_key=f"producers/prod/event_type=source_system/execution_date={EXECUTION_DATE}",
aws_conn_id=None
)
The problem with the above is, I am not able to use wildcards for the input source_bucket. It needs to be a specific complete prefix of the s3 object. I also tried using the combination of S3ListOperator and S3FileTransformOperator. But all of them created a single task for each object. But I need 1 airflow task for 1 source system thus able to copy all the data with this wildcard pattern-
s3://input-bucket/source_system1/prod/2022-09-27-*
How can I achieve this?
If you wish to achieve this in one specific task I recommend utilizing the PythonOperator to interact with the S3Hook as follows:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from datetime import datetime
def s3_copy(**kwargs):
hook = S3Hook(aws_conn_id='aws_default')
source_bucket = Variable.get('source_bucket')
keys = hook.list_keys(bucket_name=source_bucket, prefix='')
for key in keys:
hook.copy_object(
source_bucket_name=source_bucket,
dest_bucket_name=Variable.get('dest_bucket'),
source_bucket_key=key,
dest_bucket_key=key,
acl_policy='bucket-owner-full-control'
)
pass
with DAG('example_dag',
schedule_interval='0 1 * * *',
start_date=datetime(2023, 1, 1),
catchup=False
):
e0 = EmptyOperator(task_id='start')
t1 = PythonOperator(
task_id='example_copy',
python_callable=s3_copy
)
e0 >> t1
You could make improvements on the base logic to be more performant or do some filtering, etc

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:
s3 = boto3.client('s3')
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print("Filename: ", csv_filename)
csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
file_content = csv_fileObj["Body"].read().decode("utf-8").split()
I have tried put my csv contents into a list but didnt work.
results = []
for row in csv.DictReader(file_content):
results.append(row.values())
print(results)
print(file_content)
return {
'statusCode': 200,
'body': json.dumps('S3 file processed')
}
Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.
Later I would follow this approach to load the data into RDS
import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
Can anyone suggest how to go about it?
You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.
You could do this without using Dataframes:
Loop through the Event Records (multiple can be passed-in)
For each object:
Download the object to /tmp/
Use the Python CSVReader to loop through the contents of the file
Generate INSERT statements to insert the data into the SQL Server table
You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

AWS S3 and PowerBI

Has anyone been successful in connecting PowerBI to AWS S3? Is it possible? Please provide any helpful insight as to how to accomplish this.
I have seen a couple posts about an AWS S3 API. I have no familiarity to APIs so I don't know where to begin. I have also tried using the Web connector in PowerBI Desktop thinking that's where I should begin...
In Power BI, from Get Data you can select Python script and then use the boto3, an example function to download a .csv file from s3 is given below:
import io
import boto3
import pandas as pd
ACCESS_KEY_ID = 'your key id here'
SECRET_ACCESS_KEY = 'your access key here'
s3 = boto3.client('s3', aws_access_key_id = ACCESS_KEY_ID, aws_secret_access_key = SECRET_ACCESS_KEY)
def read_csv_file_from_s3(s3_url):
assert s3_url.startswith('s3://'), 'Url does not starts with s3://'
bucket_name, key_name = s3_url[5:].split('/', 1)
response = s3.get_object(Bucket=bucket_name, Key=key_name)
return pd.read_csv(io.BytesIO(response['Body'].read()))
s3_url = 's3://yourbucket/example.csv'
df = read_csv_file_from_s3(s3_url)
df will appear in data section in Power BI. Also some other examples of using boto3 for importing data to Power BI are given here and here.
Note: You can check and change the python interpreter Power BI is using from Options -> Global -> Python Scripting and install the required libraries/modules accordingly.

Denormalize a GCS file before uploading to BigQuery

I have written a Cloud Run API in .Net Core that reads files from a GCS location and then is supposed to denormalize (i.e. add more information for each row to include textual descriptions) and then write that to a BigQuery table. I have two options:
My cloud run API could create denormalized CSV files and write them to another GCS location. Then another cloud run API could pick up those denormalized CSV files and write them straight to BigQuery.
My cloud run API could read the original CSV file, denormalize them in memory (filestream) and then somehow write from the in memory filestream straight to the BigQuery table.
What is the best way to write to BigQuery in this scenario if performance (speed) and cost (monetary) is my goal. These files are roughly 10KB each before denormalizing. Each row is roughly 1000 characters. After denormalizing it is about three times as much. I do not need to keep denormalized files after they are successfully loaded in BigQuery. I am concerned about performance, as well as any specific BigQuery daily quotas around inserts/writes. I don't think there are any unless you are doing DML statements but correct me if I'm wrong.
I would use Cloud Functions that are triggered when you upload a file to a bucket.
It is so common that Google has a repo a tutorial just for this for JSON files Streaming data from Cloud Storage into BigQuery using Cloud Functions.
Then, I would modify the example main.py file from:
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
db_ref = DB.document(u'streaming_files/%s' % file_name)
if _was_already_ingested(db_ref):
_handle_duplication(db_ref)
else:
try:
_insert_into_bigquery(bucket_name, file_name)
_handle_success(db_ref)
except Exception:
_handle_error(db_ref)
To this that accepts CSV files:
import json
import csv
import logging
import os
import traceback
from datetime import datetime
from google.api_core import retry
from google.cloud import bigquery
from google.cloud import storage
import pytz
PROJECT_ID = os.getenv('GCP_PROJECT')
BQ_DATASET = 'fromCloudFunction'
BQ_TABLE = 'mytable'
CS = storage.Client()
BQ = bigquery.Client()
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
newRows = postProcessing(bucket_name, file_name)
# It is recommended that you save
# what you process for debugging reasons.
destination_bucket = 'post-processed' # gs://post-processed/
destination_name = file_name
# saveRowsToBucket(newRows,destination_bucket,destination_name)
rowsInsertIntoBigquery(newRows)
class BigQueryError(Exception):
'''Exception raised whenever a BigQuery error happened'''
def __init__(self, errors):
super().__init__(self._format(errors))
self.errors = errors
def _format(self, errors):
err = []
for error in errors:
err.extend(error['errors'])
return json.dumps(err)
def postProcessing(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
my_str = blob.download_as_string().decode('utf-8')
csv_reader = csv.DictReader(my_str.split('\n'))
newRows = []
for row in csv_reader:
modified_row = row # Add your logic
newRows.append(modified_row)
return newRows
def rowsInsertIntoBigquery(rows):
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,rows)
if errors != []:
raise BigQueryError(errors)
It would be still necesssary to define your map(row->newRow) and the function saveRowsToBucket if you needed it.

export big query table locally

I have a big query table that I would like to run on using pandas DataFrame. The table is big and using the: pd.read_gpq() function gets stuck and does not manage to retrieve the data.
I implemented a chunk mechanism using pandas that works, but it takes a long time to fetch (an hour for 9M rows). So im looking into a new sulotion.
I would like to download the table to as a csv file and then read it. I saw this code in the google cloud docs:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
but all the URIs shown in the examples are google cloud buckets URIs and not local, and I didn't manage to download it (tried to put a local URI which gave me an error).
Is there a way to download the table's data as csv file without using a bucket?
As mentioned here
The limitation with bigquery export is - You cannot export data to a local file or to Google Drive, but you can save query results to a local file. The only supported export location is Cloud Storage.
Is there a way to download the table's data as csv file without using a bucket?
So now as we know that we can store query result to local file so you can use something like this :
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT * FROM `project_name.dataset_name.table_name`')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
this rows variable will have all the table rows and you can either directly use it or can write it to a local file.