AWS S3 and PowerBI

AWS S3 and PowerBI - amazon-s3

Has anyone been successful in connecting PowerBI to AWS S3? Is it possible? Please provide any helpful insight as to how to accomplish this.
I have seen a couple posts about an AWS S3 API. I have no familiarity to APIs so I don't know where to begin. I have also tried using the Web connector in PowerBI Desktop thinking that's where I should begin...

In Power BI, from Get Data you can select Python script and then use the boto3, an example function to download a .csv file from s3 is given below:
import io
import boto3
import pandas as pd
ACCESS_KEY_ID = 'your key id here'
SECRET_ACCESS_KEY = 'your access key here'
s3 = boto3.client('s3', aws_access_key_id = ACCESS_KEY_ID, aws_secret_access_key = SECRET_ACCESS_KEY)
def read_csv_file_from_s3(s3_url):
assert s3_url.startswith('s3://'), 'Url does not starts with s3://'
bucket_name, key_name = s3_url[5:].split('/', 1)
response = s3.get_object(Bucket=bucket_name, Key=key_name)
return pd.read_csv(io.BytesIO(response['Body'].read()))
s3_url = 's3://yourbucket/example.csv'
df = read_csv_file_from_s3(s3_url)
df will appear in data section in Power BI. Also some other examples of using boto3 for importing data to Power BI are given here and here.
Note: You can check and change the python interpreter Power BI is using from Options -> Global -> Python Scripting and install the required libraries/modules accordingly.

Related

How can I set GOOGLE_APPLICATION_CREDENTIALS in AWS Glue?

I have an Glue script where I desire to use the GoogleCloud API py to manage the data of my BigQuery tables and combine them with the S3 catalog table.
I added to the script the Job parameter:
Key --additional-python-modules
Value s3://MYS3PATH/google_cloud_bigquery-2.34.0-py2.py3-none-any.whl
The code of my job uses:
from google.cloud import bigquery
path_s3_google_credential = "s3://MYS3PATH/service_key_google.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_s3_google_credential
project_id = "myproj-google-bigquery"
client_bq = bigquery.Client(project=project_id)
When I run the code, I receive:
DefaultCredentialsError: File s3://MYS3PATH/service_key_google.json was not found.
I also have in AWS Secrets Manager the service account already set with the same service key defined inside my service_key_google.json.
How can I proceed to allow my job to connect to BigQuery?

Denormalize a GCS file before uploading to BigQuery

I have written a Cloud Run API in .Net Core that reads files from a GCS location and then is supposed to denormalize (i.e. add more information for each row to include textual descriptions) and then write that to a BigQuery table. I have two options:
My cloud run API could create denormalized CSV files and write them to another GCS location. Then another cloud run API could pick up those denormalized CSV files and write them straight to BigQuery.
My cloud run API could read the original CSV file, denormalize them in memory (filestream) and then somehow write from the in memory filestream straight to the BigQuery table.
What is the best way to write to BigQuery in this scenario if performance (speed) and cost (monetary) is my goal. These files are roughly 10KB each before denormalizing. Each row is roughly 1000 characters. After denormalizing it is about three times as much. I do not need to keep denormalized files after they are successfully loaded in BigQuery. I am concerned about performance, as well as any specific BigQuery daily quotas around inserts/writes. I don't think there are any unless you are doing DML statements but correct me if I'm wrong.

I would use Cloud Functions that are triggered when you upload a file to a bucket.
It is so common that Google has a repo a tutorial just for this for JSON files Streaming data from Cloud Storage into BigQuery using Cloud Functions.
Then, I would modify the example main.py file from:
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
db_ref = DB.document(u'streaming_files/%s' % file_name)
if _was_already_ingested(db_ref):
_handle_duplication(db_ref)
else:
try:
_insert_into_bigquery(bucket_name, file_name)
_handle_success(db_ref)
except Exception:
_handle_error(db_ref)
To this that accepts CSV files:
import json
import csv
import logging
import os
import traceback
from datetime import datetime
from google.api_core import retry
from google.cloud import bigquery
from google.cloud import storage
import pytz
PROJECT_ID = os.getenv('GCP_PROJECT')
BQ_DATASET = 'fromCloudFunction'
BQ_TABLE = 'mytable'
CS = storage.Client()
BQ = bigquery.Client()
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
newRows = postProcessing(bucket_name, file_name)
# It is recommended that you save
# what you process for debugging reasons.
destination_bucket = 'post-processed' # gs://post-processed/
destination_name = file_name
# saveRowsToBucket(newRows,destination_bucket,destination_name)
rowsInsertIntoBigquery(newRows)
class BigQueryError(Exception):
'''Exception raised whenever a BigQuery error happened'''
def __init__(self, errors):
super().__init__(self._format(errors))
self.errors = errors
def _format(self, errors):
err = []
for error in errors:
err.extend(error['errors'])
return json.dumps(err)
def postProcessing(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
my_str = blob.download_as_string().decode('utf-8')
csv_reader = csv.DictReader(my_str.split('\n'))
newRows = []
for row in csv_reader:
modified_row = row # Add your logic
newRows.append(modified_row)
return newRows
def rowsInsertIntoBigquery(rows):
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,rows)
if errors != []:
raise BigQueryError(errors)
It would be still necesssary to define your map(row->newRow) and the function saveRowsToBucket if you needed it.

AWS Rekognition call index-faces on all images in an S3 bucket instead of one by one?

I am testing Image Recognition from was. So far good. What I am having problems with is indexing faces in the CLI. I can index one at the time, but, I would like to tell AWS to index all faces in a bucket. To index a face one at the time I call this:
aws rekognition index-faces --image "S3Object={Bucket=bname,Name=123.jpg}" --collection-id "myCollection" --detection-attributes "ALL" --external-image-id "myImgID"
How do I tell it to index all images in the "name" bucket?
I tried this:
aws rekognition index-faces --image "S3Object={Bucket=bname}" --collection-id "myCollection" --detection-attributes "ALL" --external-image-id "myImgID"
no luck.

You currently can't index multiple faces in one index-faces call. A script that calls get-objects on the bucket and then loops through the results would accomplish what you want.

In case it helps anyone in the future, I had a similar need, so I wrote this Python 3.6 script to do exactly what #chris-adzima recommends, and I executed it from a lambda function.
import boto3
import concurrent.futures
bucket_name = "MY_BUCKET_NAME"
collection_id = "MY_COLLECTION_ID"
rekognition = boto3.client('rekognition')
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
def handle_image(key):
rekognition.index_faces(
CollectionId=collection_id,
Image={
'S3Object': {
'Bucket': bucket_name,
'Name': key
}
}
)
def lambda_handler(event, context):
pic_keys = [o.key for o in bucket.objects.all() if o.key.endswith('.png')]
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(handle_image, pic_keys)

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.
What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.

I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)
In my case I needed table names in Glue Job Script console
Finally I used boto library and retrieved database and table names with Glue client:
import boto3
client = boto3.client('glue',region_name='us-east-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
tableList = responseGetTables['TableList']
for tableDict in tableList:
tableName = tableDict['Name']
print '\n-- tableName: '+tableName
Important thing is to setup the region properly
Reference:
get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases
get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables

Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.
def get_glue_tables(database=None):
next_token = ""
while True:
response = glue_client.get_tables(
DatabaseName=database,
NextToken=next_token
)
for table in response.get('TableList'):
print(table.get('Name'))
next_token = response.get('NextToken')
if next_token is None:
break

The boto3 api also supports pagination, so you could use the following instead:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
DatabaseName='database_name'
)
for page in page_iterator:
print(page['TableList'])
That way you don't have to mess with while loops or the next token.

AWS SDK Boto3 : boto3.exceptions.unknownapiversionerror

I am trying to upload content on amazon s3 but I am getting this error:
boto3.exceptions.unknownapiversionerror: The 's3' resource does not an
API Valid API versions are: 2006-03-01
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object(**KEY**)
obj.upload_fileobj(**FILE OBJECT**)

The error is caused by exception raised on "DataNotFound" as in the
boto3.Session source code. Perhaps the developer didn't realize people make the mistake for NOT passing the correct object.
If you read the boto3 documentation example, this is the correct way to upload data.
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object("prefix/object_key_name")
# You must pass the file object !
with open('filename', 'rb') as fileobject:
obj.upload_fileobj(fileobject)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS S3 and PowerBI - amazon-s3

Related

How can I set GOOGLE_APPLICATION_CREDENTIALS in AWS Glue?

Denormalize a GCS file before uploading to BigQuery

AWS Rekognition call index-faces on all images in an S3 bucket instead of one by one?

How to list all databases and tables in AWS Glue Catalog?

AWS SDK Boto3 : boto3.exceptions.unknownapiversionerror

Categories

Resources