using django s3 sqlite package for using sqlite db - amazon-s3

I am deploying my django app with aws lambda using zappa.
I am trying to store my sqlite db on aws s3-bucket, using django_s3_sqlite (following those instructions: https://github.com/FlipperPA/django-s3-sqlite), but when I run the 'zappa manage [instance] s3_sqlite_vacuum' command, i get this message:
'Task timed out after 30.03 seconds'. Does anyone know this? thank you!

In this case, it might be hitting the zappa lambda function timeout limit, one can set "timeout_seconds": 300 in the zappa_settings.json which will allow the lambda to run for longer.
Thank you FlipperPA for the django-s3-sqlite package!

Related

sqlite3.OperationalError: When trying to connect to S3 Airflow Hook

I'm currently exploring implementing hooks in some of my DAGs. For instance, in one dag, I'm trying to connect to s3 to send a csv file to a bucket, which then gets copied to a redshift table.
I have a custom module written which I import to run this process. I am trying to currently set up an S3Hook to undergo this process instead. But I'm a little confused in setting up the connection, and how everything works.
First, I input the hook
from airflow.hooks.S3_hook import S3Hook
Then I try to make the hook instance
s3_hook = S3Hook(aws_conn_id='aws-s3')
Next I try to set up the client
s3_client = s3_hook.get_conn()
However when I run the client line above, I received this error
OperationalError: (sqlite3.OperationalError)
no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?
LIMIT ? OFFSET ?]
[parameters: ('aws-s3', 1, 0)]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
I'm trying to diagnose the error, but the tracebook is long. I'm a little confused on why sqlite3 is involved here, when I'm trying to utilize s3 here. Can anyone unpack this? Why is this error being thrown when trying to set up the client?
Thanks
Airflow is not just a library - it's also an application.
To execute Airflow code you must have airflow instance running this mean also having a database with the needed schema.
To create the tables you must execute airflow init db.
Edit:
After the discussion in comments. Your issue is that you have working Airflow application inside docker but your DAGs are written on your local disk. Docker is closed environment if you want Airflow to recognize your dags you must move the files to the DAG folder in the docker.

Karate JDBC Connection

In Karate script is there a way to cache the DB connections ? . To be more specific the DB connections are through a Java program , every time we make DB calls the connection call is also
def dbDemo=Java.type('tests.DataBaseAssertions')
The above line of code is used in all the feature files . Is there a way to cache this object so that all script can refer to that .
Application level
Sounds like you are looking for the callSingle() syntax, please refer to the docs:
https://github.com/intuit/karate#hooks
var result = karate.callSingle('classpath:jdbc.feature');

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

Making a Google BigQuery from Python on Windows

I am trying to do something which is very simple in other data services. I am trying to make a relatively simple SQL query and return it as a dataframe in python. I am on Windows 10 and using Phython 2.7 (specifically Canopy 1.7.4)
Typically this would be done with pandas.read_sql_query but due to some specifics with BigQuery they require a different method pandas.io.gbq.read_gbq
This method works fine unless you want to make a Big Query. If you make a Big Query on BigQuery you get the error
GenericGBQException: Reason: responseTooLarge, Message: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
This was asked and answered before in this ticket but neither of the solutions are relevant for my case
Python BigQuery allowLargeResults with pandas.io.gbq
One solution is for python 3 so it is a nonstarter. The other is giving an error due to me being unable to set my credentials as an environment variable in windows.
ApplicationDefaultCredentialsError: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
I was able to download the JSON credentials file and I have set it as an environment variable in the few ways I know how but I still get the above error. Do I need to load this in some way in python? It seems to be looking for it but unable to find is correctly. Is there a special way to set it as an environment variable in this case?
You can do it in Python 2.7 by changing the default dialect from legacy to standard in pd.read_gbq function.
pd.read_gbq(query, 'my-super-project', dialect='standard')
Indeed, you can read in Big Query documentation for the parameter AllowLargeResults:
AllowLargeResults: For standard SQL queries, this flag is
ignored and large results are always allowed.
I have found two ways of directly importing the JSON credentials file. Both based on the original answer in Python BigQuery allowLargeResults with pandas.io.gbq
1) Credit to Tim Swast
First
pip install google-api-python-client
pip install google-auth
pip install google-cloud-core
then
replace
credentials = GoogleCredentials.get_application_default()
in create_service() with
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/file.json')
2)
Set the environment variable manually in the code like
import os,os.path
os.environ['GOOGLE_APPLICATION_CREDENTIALS']=os.path.expanduser('path/file.json')
I prefer method 2 since it does not require new modules to be installed and is also closer to the intended use of the JSON credentials.
Note:
You must create a destinationTable and add the information to run_query()
Here is a code that fully works within python 2.7 on Windows:
import pandas as pd
my_qry="<insert your big query here>"
### Here Put the data from your credentials file of the service account - all fields are available from there###
my_file="""{
"type": "service_account",
"project_id": "cb4recs",
"private_key_id": "<id>",
"private_key": "<your private key>\n",
"client_email": "<email>",
"client_id": "<id>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "<x509 url>"
}"""
df=pd.read_gbq(qry,project_id='<your project id>',private_key=my_file)
That's it :)

Accessing S3 directly from EMR map/reduce task

I am trying to figure out how to write directly from a EMR map task to the s3 bucket. I would like to run a python streaming job which would get some data from the internet and save it to s3 - without returning it back to reduce job. Can anyone help me with that?
Why don't you just set the output of your MR job to be a s3 directory and tell it that there is no reducer:
./elastic-mapreduce ..... --output s3n://bucket/outputfiles --reducer NONE
That should do what you want it to.
Then your script can do something like this (sorry, ruby):
STDIN.each do |url|
puts extract_data(url)
end