How to change the name of the Athena results stored in S3? - sql

The results of Athena query is saved by the query id (a long string) in S3. I was wondering if there's a way to save the results of the query with a pre-specified name? (that can later be easily looked up)

You can do so by a simple AWS Lambda function.
Change names of AWS Athena results stored in S3 bucket
client = boto3.client('athena')
s3 = boto3.resource("s3")
#run query
queryStart = client.start_query_execution(
QueryString = '
#PUT_YOUR_QUERY_HERE
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
',
QueryExecutionContext = {
'Database': "covid_data" //YOUR_ATHENA_DATABASE_NAME
},
ResultConfiguration = {
#query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
#executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
#copies newly generated csv file with appropriate name
#query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
#destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
#deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

unfortunately no (at least not yet)! the best way to do this as of now is to write a script to go through all the results of each run and rename (moving+deleting) all the files in that s3 bucket!

For named queries your results location will be structured as follows:
s3://athena-query-results-<account>-<region>/<query-name>/<year>/<month>/<day>/<UUID>.csv
I don't know any method of specifying the UUID by the client. But you could look for the newest file within the s3 folder of your named query.
Alternatively, you could use the s3 API or the aws cli to copy the result into a location of your choice.
Does this answer your question?

def delete_metadata():
s3 = boto3.resource('s3')
client_s3 = boto3.client('s3')
bucket = s3.Bucket('testing')
for obj in bucket.objects.filter(Prefix='prepared/'):
if obj.key.endswith('.metadata'):
print(obj.key)
client_s3.delete_object(Bucket=bucket.name,Key=obj.key)

Related

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:
s3 = boto3.client('s3')
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print("Filename: ", csv_filename)
csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
file_content = csv_fileObj["Body"].read().decode("utf-8").split()
I have tried put my csv contents into a list but didnt work.
results = []
for row in csv.DictReader(file_content):
results.append(row.values())
print(results)
print(file_content)
return {
'statusCode': 200,
'body': json.dumps('S3 file processed')
}
Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.
Later I would follow this approach to load the data into RDS
import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
Can anyone suggest how to go about it?
You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.
You could do this without using Dataframes:
Loop through the Event Records (multiple can be passed-in)
For each object:
Download the object to /tmp/
Use the Python CSVReader to loop through the contents of the file
Generate INSERT statements to insert the data into the SQL Server table
You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

export big query table locally

I have a big query table that I would like to run on using pandas DataFrame. The table is big and using the: pd.read_gpq() function gets stuck and does not manage to retrieve the data.
I implemented a chunk mechanism using pandas that works, but it takes a long time to fetch (an hour for 9M rows). So im looking into a new sulotion.
I would like to download the table to as a csv file and then read it. I saw this code in the google cloud docs:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
but all the URIs shown in the examples are google cloud buckets URIs and not local, and I didn't manage to download it (tried to put a local URI which gave me an error).
Is there a way to download the table's data as csv file without using a bucket?
As mentioned here
The limitation with bigquery export is - You cannot export data to a local file or to Google Drive, but you can save query results to a local file. The only supported export location is Cloud Storage.
Is there a way to download the table's data as csv file without using a bucket?
So now as we know that we can store query result to local file so you can use something like this :
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT * FROM `project_name.dataset_name.table_name`')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
this rows variable will have all the table rows and you can either directly use it or can write it to a local file.

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.
What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.
I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)
In my case I needed table names in Glue Job Script console
Finally I used boto library and retrieved database and table names with Glue client:
import boto3
client = boto3.client('glue',region_name='us-east-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
tableList = responseGetTables['TableList']
for tableDict in tableList:
tableName = tableDict['Name']
print '\n-- tableName: '+tableName
Important thing is to setup the region properly
Reference:
get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases
get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables
Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.
def get_glue_tables(database=None):
next_token = ""
while True:
response = glue_client.get_tables(
DatabaseName=database,
NextToken=next_token
)
for table in response.get('TableList'):
print(table.get('Name'))
next_token = response.get('NextToken')
if next_token is None:
break
The boto3 api also supports pagination, so you could use the following instead:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
DatabaseName='database_name'
)
for page in page_iterator:
print(page['TableList'])
That way you don't have to mess with while loops or the next token.

How to rename objects boto3 S3?

I have about 1000 objects in S3 which named after
abcyearmonthday1
abcyearmonthday2
abcyearmonthday3
...
want to rename them to
abc/year/month/day/1
abc/year/month/day/2
abc/year/month/day/3
how could I do it through boto3. Is there easier way of doing this ?
As explained in Boto3/S3: Renaming an object using copy_object
you can not rename an object in S3 you have to copy object with a new name and then delete the Old object
s3 = boto3.resource('s3')
s3.Object('my_bucket','my_file_new').copy_from(CopySource='my_bucket/my_file_old')
s3.Object('my_bucket','my_file_old').delete()
There is not direct way to rename S3 object.
Below two steps need to perform :
Copy the S3 object at same location with new name.
Then delete the older object.
I had the same problem (in my case I wanted to rename files generated in S3 using the Redshift UNLOAD command). I solved creating a boto3 session and then copy-deleting file by file.
Like
import boto3
session = boto3.session.Session(aws_access_key_id=my_access_key_id,aws_secret_access_key=my_secret_access_key).resource('s3')
# Save in a list the tuples of filenames (with prefix): [(old_s3_file_path, new_s3_file_path), ..., ()] e.g. of tuple ('prefix/old_filename.csv000', 'prefix/new_filename.csv')
s3_files_to_rename = []
s3_files_to_rename.append((old_file, new_file))
for pair in s3_files_to_rename:
old_file = pair[0]
new_file = pair[1]
s3_session.Object(s3_bucket_name, new_file).copy_from(CopySource=s3_bucket_name+'/'+old_file)
s3_session.Object(s3_bucket_name, old_file).delete()

Amazon S3 boto - how to create a folder?

How can I create a folder under a bucket using boto library for Amazon s3?
I followed the manual, and created the keys with permission, metadata etc, but no where in the boto's documentation it describes how to create folders under a bucket, or create a folder under folders in bucket.
There is no concept of folders or directories in S3. You can create file names like "abc/xys/uvw/123.jpg", which many S3 access tools like S3Fox show like a directory structure, but it's actually just a single file in a bucket.
Assume you wanna create folder abc/123/ in your bucket, it's a piece of cake with Boto
k = bucket.new_key('abc/123/')
k.set_contents_from_string('')
Or use the console
Use this:
import boto3
s3 = boto3.client('s3')
bucket_name = "YOUR-BUCKET-NAME"
directory_name = "DIRECTORY/THAT/YOU/WANT/TO/CREATE" #it's name of your folders
s3.put_object(Bucket=bucket_name, Key=(directory_name+'/'))
With AWS SDK .Net works perfectly, just add "/" at the end of the folder name string:
var folderKey = folderName + "/"; //end the folder name with "/"
AmazonS3 client = Amazon.AWSClientFactory.CreateAmazonS3Client(AWSAccessKey, AWSSecretKey);
var request = new PutObjectRequest();
request.WithBucketName(AWSBucket);
request.WithKey(folderKey);
request.WithContentBody(string.Empty);
S3Response response = client.PutObject(request);
Then refresh your AWS console, and you will see the folder
Tried many method above and adding forward slash / to the end of key name, to create directory didn't work for me:
client.put_object(Bucket="foo-bucket", Key="test-folder/")
You have to supply Body parameter in order to create directory:
client.put_object(Bucket='foo-bucket',Body='', Key='test-folder/')
Source: ryantuck in boto3 issue
Append "_$folder$" to your folder name and call put.
String extension = "_$folder$";
s3.putObject("MyBucket", "MyFolder"+ extension, new ByteArrayInputStream(new byte[0]), null);
see:
http://www.snowgiraffe.com/tech/147/creating-folders-programmatically-with-amazon-s3s-api-putting-babies-in-buckets/
Update for 2019, if you want to create a folder with path bucket_name/folder1/folder2 you can use this code:
from boto3 import client, resource
class S3Helper:
def __init__(self):
self.client = client("s3")
self.s3 = resource('s3')
def create_folder(self, path):
path_arr = path.rstrip("/").split("/")
if len(path_arr) == 1:
return self.client.create_bucket(Bucket=path_arr[0])
parent = path_arr[0]
bucket = self.s3.Bucket(parent)
status = bucket.put_object(Key="/".join(path_arr[1:]) + "/")
return status
s3 = S3Helper()
s3.create_folder("bucket_name/folder1/folder2)
It's really easy to create folders. Actually it's just creating keys.
You can see my below code i was creating a folder with utc_time as name.
Do remember ends the key with '/' like below, this indicates it's a key:
Key='folder1/' + utc_time + '/'
client = boto3.client('s3')
utc_timestamp = time.time()
def lambda_handler(event, context):
UTC_FORMAT = '%Y%m%d'
utc_time = datetime.datetime.utcfromtimestamp(utc_timestamp)
utc_time = utc_time.strftime(UTC_FORMAT)
print 'start to create folder for => ' + utc_time
putResponse = client.put_object(Bucket='mybucketName',
Key='folder1/' + utc_time + '/')
print putResponse
Although you can create a folder by appending "/" to your folder_name. Under the hood, S3 maintains flat structure unlike your regular NFS.
var params = {
Bucket : bucketName,
Key : folderName + "/"
};
s3.putObject(params, function (err, data) {});
S3 doesn't have a folder structure, But there is something called as keys.
We can create /2013/11/xyz.xls and will be shown as folder's in the console. But the storage part of S3 will take that as the file name.
Even when retrieving we observe that we can see files in particular folder (or keys) by using the ListObjects method and using the Prefix parameter.
Apparently you can now create folders in S3. I'm not sure since when, but I have a bucket in "Standard" zone and can choose Create Folder from Action dropdown.
This question is more relevant to the future, so adding this update.
I am using the upload_file method as shown below.
fold ='/my/system/filePath/tabmcq/Tables/auto/18.tsv'
s3_client.upload_file(
Filename = full/file/path/filename.extension,
Bucket = "tab-mcq-de",
Key = f"{fold.split('/')[-3]}/{fold.split('/')[-2]}/{fold.split('/')[-1]}"
)
Ideas is the "Filename" parameter requires the absolute file path of your system.
The "Key" parameter requires the relative file path from the source directory where your files are located
In case of this example, Key parameter has to contain "Tables/auto/18.tsv" value, for client to create the folders.
Hope this helps.
The following works using Python boto3
s3 = boto3.client("s3")
s3.put_object(Bucket="dest_bucket", Key='folder_name/')