sqlite3.OperationalError: When trying to connect to S3 Airflow Hook - amazon-s3

I'm currently exploring implementing hooks in some of my DAGs. For instance, in one dag, I'm trying to connect to s3 to send a csv file to a bucket, which then gets copied to a redshift table.
I have a custom module written which I import to run this process. I am trying to currently set up an S3Hook to undergo this process instead. But I'm a little confused in setting up the connection, and how everything works.
First, I input the hook
from airflow.hooks.S3_hook import S3Hook
Then I try to make the hook instance
s3_hook = S3Hook(aws_conn_id='aws-s3')
Next I try to set up the client
s3_client = s3_hook.get_conn()
However when I run the client line above, I received this error
OperationalError: (sqlite3.OperationalError)
no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?
LIMIT ? OFFSET ?]
[parameters: ('aws-s3', 1, 0)]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
I'm trying to diagnose the error, but the tracebook is long. I'm a little confused on why sqlite3 is involved here, when I'm trying to utilize s3 here. Can anyone unpack this? Why is this error being thrown when trying to set up the client?
Thanks

Airflow is not just a library - it's also an application.
To execute Airflow code you must have airflow instance running this mean also having a database with the needed schema.
To create the tables you must execute airflow init db.
Edit:
After the discussion in comments. Your issue is that you have working Airflow application inside docker but your DAGs are written on your local disk. Docker is closed environment if you want Airflow to recognize your dags you must move the files to the DAG folder in the docker.

Related

using django s3 sqlite package for using sqlite db

I am deploying my django app with aws lambda using zappa.
I am trying to store my sqlite db on aws s3-bucket, using django_s3_sqlite (following those instructions: https://github.com/FlipperPA/django-s3-sqlite), but when I run the 'zappa manage [instance] s3_sqlite_vacuum' command, i get this message:
'Task timed out after 30.03 seconds'. Does anyone know this? thank you!
In this case, it might be hitting the zappa lambda function timeout limit, one can set "timeout_seconds": 300 in the zappa_settings.json which will allow the lambda to run for longer.
Thank you FlipperPA for the django-s3-sqlite package!

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

How to test Luigi with FakeS3?

I'm trying to test my Luigi pipelines inside a vagrant machine using FakeS3 to simulate my S3 endpoints. For boto to be able to interact with FakeS3 the connection must be setup with the OrdinaryCallingFormat as in:
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
conn = S3Connection('XXX', 'XXX', is_secure=False,
port=4567, host='localhost',
calling_format=OrdinaryCallingFormat())
but when using Luigi this connection is buried in the s3 module. I was able to pass most of the options by modifying my luigi.cfg and adding an s3 section as in
[s3]
host=127.0.0.1
port=4567
aws_access_key_id=XXX
aws_secret_access_key=XXXXXX
is_secure=0
but I don't know how to pass the required object for the calling_format.
Now I'm stuck and don't know how to proceed. Options I can think of:
Figure out how to pass the OrdinaryCallingFormat to S3Connection through luigi.cfg
Figure out how to force boto to always use this calling format in my Vagrant machine, by setting an unknown option to me either in .aws/config or boto.cfg
Make FakeS3 to accept the default calling_format used by boto that happens to be SubdomainCallingFormat (whatever it means).
Any ideas about how to fix this?
Can you not pass it into the constructor as kwargs for the S3Client?
client = S3Client(aws_access_key, aws_secret_key,
{'calling_format':OrdinaryCallingFormat()})
target = S3Target('s3://somebucket/test', client=client)
I did not encounter any problem when using boto3 connect to fakeS3.
import boto3
s3 = boto3.client(
"s3", region_name="fakes3",
use_ssl=False,
aws_access_key_id="",
aws_secret_access_key="",
endpoint_url="http://localhost:4567"
)
no specially calling method required.
Perhaps I am wrong that you really need OrdinaryCallingFormat, If my code doesn't work, please go through the github topic boto3 support on :
https://github.com/boto/boto3/issues/334
You can set it with the calling_format parameter. Here is a configuration example for fake-s3:
[s3]
aws_access_key_id=123
aws_secret_access_key=abc
host=fake-s3
port=4569
is_secure=0
calling_format=boto.s3.connection.OrdinaryCallingFormat

Hadoop: wrong classpath in map reduce job

I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:
error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found
So, it seems that the map process doesnt find the class. So I tried this:
1) add the hbase.jar to the HADOOP_CLASSPATH on every node
2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code
nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.
If I take a look into the job.xml I see the following entry:
name=tmpjars value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar
This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?
Here is the java code I try to execute:
configuration = new Configuration(false);
configuration.set("mapred.job.tracker", "192.168.2.41:8021");
configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
configuration.set("hbase.zookeeper.property.clientPort", "2181");
Job job = new Job(configuration, "HBase Bulk Import for "
+ tablename);
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path("myfile1"));
FileOutputFormat.setOutputPath(job, new Path("myfile2"));
job.waitForCompletion(true);
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
configuration);
loader.doBulkLoad(new Path("myFile3"), hTable);
EDIT:
I tried a little bit more and its totaly strange. I add the following line to the java code:
job.setJarByClass(HFileOutputFormat.class);
after I executed this, the error is gone, but another class not found exception appear:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found
HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.
After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.
While searching for the eroor I saw the following strcuture in the the job staging direcotry:
and inside the libjars direcotry I saw the relevant jar files
so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?
I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:
Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.
Submitting the job
Meaning getting the classpath right locally for when your driver runs:
$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....
On the server
In your hadoop-env.sh
export HADOOP_CLASSPATH=$(hbase claspath)
or use
TableMapReduceUtil.addDependencyJars
I found a "hacked" solution which worked for me, but I'm not happy with it because it's not really practicable.
My "hacked" solution:
create one big Jar with all necessary class files, I called it "big.jar" and add it to the local (eclipse) classpath
add the line: job.setJarByClass(MyMapperClass.class) ... the MyMapperClass has to be in the big.jar
When I execute this the big.jar will be copied for every job to the filesystem. No errors anymore. The problem is, that the jar is 80mb in size and have to be copied every time.
If anywone knows a better way I would be tahnkful if he could tell me how.
EDIT:
Now I try to execute jobs with Apache Pig and have exactly the same problem. My hacked soultion doesn't work in this case because pig creats the jobs automaticly. Here is the pig error:
java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found

How to give input Mapreduce jobs which use s3 data from java code

I know that we can normally give the parameters while running the jar file in EC2 instance
But how do we give inputs through code?
I am trying this because I am trying to call my java code from a jsp, so in the java code ,I want to directly pick up data from s3 and proceed , I tries like this but in vain:
DataExtractor.getRelevantData("s3n://syamk/revanthinput/", "999999", "94645", "20120606",
"s3n://revanthufl/gen/testoutput" + "interm");
here s3n://syamk/revanthinput/ I was using input and instead of s3n://revanthufl/gen/testoutput.
I was using output and in the parameters I am using the same strings(s3n://syamk/revanthinput/ and s3n://revanthufl/gen/testoutput) to run the jar.But doing like this from code is throwing and exception,
[java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).] with root cause
Based on my usage of flume, it would appear that you need to format your URL like s3n://AWS_ACCESS_KEY:AWS_SECRET_KEY#syamk/revanthinput/ when calling s3 from within code.