How can I set GOOGLE_APPLICATION_CREDENTIALS in AWS Glue? - google-bigquery

I have an Glue script where I desire to use the GoogleCloud API py to manage the data of my BigQuery tables and combine them with the S3 catalog table.
I added to the script the Job parameter:
Key --additional-python-modules
Value s3://MYS3PATH/google_cloud_bigquery-2.34.0-py2.py3-none-any.whl
The code of my job uses:
from google.cloud import bigquery
path_s3_google_credential = "s3://MYS3PATH/service_key_google.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_s3_google_credential
project_id = "myproj-google-bigquery"
client_bq = bigquery.Client(project=project_id)
When I run the code, I receive:
DefaultCredentialsError: File s3://MYS3PATH/service_key_google.json was not found.
I also have in AWS Secrets Manager the service account already set with the same service key defined inside my service_key_google.json.
How can I proceed to allow my job to connect to BigQuery?

Related

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

AWS S3 and PowerBI

Has anyone been successful in connecting PowerBI to AWS S3? Is it possible? Please provide any helpful insight as to how to accomplish this.
I have seen a couple posts about an AWS S3 API. I have no familiarity to APIs so I don't know where to begin. I have also tried using the Web connector in PowerBI Desktop thinking that's where I should begin...
In Power BI, from Get Data you can select Python script and then use the boto3, an example function to download a .csv file from s3 is given below:
import io
import boto3
import pandas as pd
ACCESS_KEY_ID = 'your key id here'
SECRET_ACCESS_KEY = 'your access key here'
s3 = boto3.client('s3', aws_access_key_id = ACCESS_KEY_ID, aws_secret_access_key = SECRET_ACCESS_KEY)
def read_csv_file_from_s3(s3_url):
assert s3_url.startswith('s3://'), 'Url does not starts with s3://'
bucket_name, key_name = s3_url[5:].split('/', 1)
response = s3.get_object(Bucket=bucket_name, Key=key_name)
return pd.read_csv(io.BytesIO(response['Body'].read()))
s3_url = 's3://yourbucket/example.csv'
df = read_csv_file_from_s3(s3_url)
df will appear in data section in Power BI. Also some other examples of using boto3 for importing data to Power BI are given here and here.
Note: You can check and change the python interpreter Power BI is using from Options -> Global -> Python Scripting and install the required libraries/modules accordingly.

Nexus 3 Repository Manager Create (Or Run Pre-generated) Task Without Using User Interface

This question arose when I was trying to reboot my Nexus3 container on a weekly schedule and connect to an S3 bucket I have. I have my container set up to connect to the S3 bucket just fine (it creates a new [A-Z,0-9]-metrics.properties file each time) but the previous artifacts are not found when looking though the UI.
I used the Repair - Reconcile component database from blob store task from the UI settings and it works great!
But... all the previous steps are done automatically through scripts and I would like the same for the final step of Reconciling the blob store.
Connecting to the S3 blob store is done with reference to examples from nexus-book-examples. As below:
Map<String, String> config = new HashMap<>()
config.put("bucket", "nexus-artifact-storage")
blobStore.createS3BlobStore('nexus-artifact-storage', config)
AWS credentials are provided during the docker run step so the above is all that is needed for the blob store set up. It is called by a modified version of provision.sh, which is a script from the nexus-book-examples git page.
Is there a way to either:
Create a task with a groovy script? or,
Reference one of the task types and run the task that way with a POST?
depending on the specific version of repository manager that you are using, there may be REST endpoints for listing and running scheduled tasks. This was introduced in 3.6.0 according to this ticket: https://issues.sonatype.org/browse/NEXUS-11935. For more information about the REST integration in 3.x, check out the following: https://help.sonatype.com/display/NXRM3/Tasks+API
For creating a scheduled task, you will have to add some groovy code. Perhaps the following would be a good start:
import org.sonatype.nexus.scheduling.TaskConfiguration
import org.sonatype.nexus.scheduling.TaskInfo
import org.sonatype.nexus.scheduling.TaskScheduler
import groovy.json.JsonOutput
import groovy.json.JsonSlurper
class TaskXO
{
String typeId
Boolean enabled
String name
String alertEmail
Map<String, String> properties
}
TaskXO task = new JsonSlurper().parseText(args)
TaskScheduler scheduler = container.lookup(TaskScheduler.class.name)
TaskConfiguration config = scheduler.createTaskConfigurationInstance(task.typeId)
config.enabled = task.enabled
config.name = task.name
config.alertEmail = task.alertEmail
task.properties?.each { key, value -> config.setString(key, value) }
TaskInfo taskInfo = scheduler.scheduleTask(config, scheduler.scheduleFactory.manual())
JsonOutput.toJson(taskInfo)

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.
What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.
I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)
In my case I needed table names in Glue Job Script console
Finally I used boto library and retrieved database and table names with Glue client:
import boto3
client = boto3.client('glue',region_name='us-east-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
tableList = responseGetTables['TableList']
for tableDict in tableList:
tableName = tableDict['Name']
print '\n-- tableName: '+tableName
Important thing is to setup the region properly
Reference:
get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases
get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables
Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.
def get_glue_tables(database=None):
next_token = ""
while True:
response = glue_client.get_tables(
DatabaseName=database,
NextToken=next_token
)
for table in response.get('TableList'):
print(table.get('Name'))
next_token = response.get('NextToken')
if next_token is None:
break
The boto3 api also supports pagination, so you could use the following instead:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
DatabaseName='database_name'
)
for page in page_iterator:
print(page['TableList'])
That way you don't have to mess with while loops or the next token.