How to give input Mapreduce jobs which use s3 data from java code

How to give input Mapreduce jobs which use s3 data from java code - input

I know that we can normally give the parameters while running the jar file in EC2 instance
But how do we give inputs through code?
I am trying this because I am trying to call my java code from a jsp, so in the java code ,I want to directly pick up data from s3 and proceed , I tries like this but in vain:
DataExtractor.getRelevantData("s3n://syamk/revanthinput/", "999999", "94645", "20120606",
"s3n://revanthufl/gen/testoutput" + "interm");
here s3n://syamk/revanthinput/ I was using input and instead of s3n://revanthufl/gen/testoutput.
I was using output and in the parameters I am using the same strings(s3n://syamk/revanthinput/ and s3n://revanthufl/gen/testoutput) to run the jar.But doing like this from code is throwing and exception,
[java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).] with root cause

Based on my usage of flume, it would appear that you need to format your URL like s3n://AWS_ACCESS_KEY:AWS_SECRET_KEY#syamk/revanthinput/ when calling s3 from within code.

Related

sqlite3.OperationalError: When trying to connect to S3 Airflow Hook

I'm currently exploring implementing hooks in some of my DAGs. For instance, in one dag, I'm trying to connect to s3 to send a csv file to a bucket, which then gets copied to a redshift table.
I have a custom module written which I import to run this process. I am trying to currently set up an S3Hook to undergo this process instead. But I'm a little confused in setting up the connection, and how everything works.
First, I input the hook
from airflow.hooks.S3_hook import S3Hook
Then I try to make the hook instance
s3_hook = S3Hook(aws_conn_id='aws-s3')
Next I try to set up the client
s3_client = s3_hook.get_conn()
However when I run the client line above, I received this error
OperationalError: (sqlite3.OperationalError)
no such table: connection
[SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted
FROM connection
WHERE connection.conn_id = ?
LIMIT ? OFFSET ?]
[parameters: ('aws-s3', 1, 0)]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
I'm trying to diagnose the error, but the tracebook is long. I'm a little confused on why sqlite3 is involved here, when I'm trying to utilize s3 here. Can anyone unpack this? Why is this error being thrown when trying to set up the client?
Thanks

Airflow is not just a library - it's also an application.
To execute Airflow code you must have airflow instance running this mean also having a database with the needed schema.
To create the tables you must execute airflow init db.
Edit:
After the discussion in comments. Your issue is that you have working Airflow application inside docker but your DAGs are written on your local disk. Docker is closed environment if you want Airflow to recognize your dags you must move the files to the DAG folder in the docker.

How to save result from OS Process Sampler to a variable?

JMeter OS Process Sampler is set up, works fine and saves result (a token as result of powershell srcipt execution) to a file.
Is it possible somehow to save result from powershell script directly into a JMeter variable instead?
What should I add for that?

Normally you should be using JMeter Post-Processors in order to extract data from Sampler's responses
If the token is the only thing that your powershell script returns you can extract it using i.e. Boundary Extractor, just provide desired variable name and leave everything else empty
Demo:
If there is some other text surrounding the token - adjust the boundaries accordingly or go for Regular Expression Extractor

Unable to figure out why my code is printing "None". even though it should bring values

I am trying to access S3 object and when i am printing the data objects, it prints "None" basically I am unable to move beyond that point since my "if" condition is failing. Could anyone please assist
if s3_resource.Bucket(s3_bucket).creation_date:
print("UTKARSH")

this piece of code which you have specified is correct. Check whether you are able to access the object first and then go for if condition.
s3_object = boto.resource('s3').Bucket("Your Bucket Name")
or else you can also get the bucket info using s3 client like:
busket_list = boto.client('s3').list_buckets()
try to get the details first then you can filter the details according to your need.
Creation_date is one of the property of s3 bucket which stores the date of creation of that particular bucket.

What is your s3_bucket? is that String? Check my code.
import boto3
session = boto3.session.Session(region_name='ap-northeast-2')
s3 = session.resource('s3')
if s3.Bucket('seoul-dev-datalake').creation_date:
print(s3.Bucket('my-bucket-name').name, 'Done')
It gives the correct result.
my-bucket-name Done

Qlik Sense: how to specify path in Google Drive?

I have a Google drive account divided into some folders (say, Folder1, Folder2, etc.), with some subfolders in it.
I successfully managed to connect my Qlik Sense app to it.
I need to make it look for files only in a given subfolder.
At the moment, I read as follows ([...] is the location)
(URL IS [[...]connectorID=GoogleDriveConnector&table=ListSpreadsheets&appID=], qvx);
It works and reloads successfully, but I need it to filter the Spreadsheets properly. How could I get what I need?

To connect to Google Drive in fact you use web connector. Once web connector is installed it can be initialized as service or manually from its folder.
Once it i installed (recent version can be downloaded from https://qliksupport.force.com/apex/QS_Home_Page but it seems that you've got it as Google Drive is part of it ) it is much nicer to configure connection to online drives there.
You just go to http://localhost:5555/web and generate ready code.
In my implementation I used following options step by step to get data which I wanted:
1) CanAuthenticate to generate permanent token
2) ListSpreadsheets
3) ListWorksheets
4) GetWorksheet

You can't just specify path. But it's possible to retrieve the path from QWC services. Please use algorithm like that:
Use tables like ListFiles/ListWorksheets
Iter through every row with 'for' cycle:
FOR i=0 to (NoOfRows('Google_ListWorksheets')-1);
Let vWorksheetKey = Peek('worksheetKey', $(i), 'Google_ListWorksheets');
Let vTitle = left(Peek('title', $(i), 'Google_ListWorksheets'),3);
Using 'if' statement find desired folder id/worksheet key by its name (stored in vTitle variable) and use it:
load * FROM [$(vQwcConnectionName)]
(URL IS [http://localhost:5555/data?connectorID=GoogleDriveConnector&table=GetWorksheet&worksheetKey=$(vWorksheetKey)&appID=], qvx);
At the end you will get your files by their location.

How to read values dynamically from a file for a property in updateAttribute?

I added some custom properties in the 'updateAttribute' processor using the '+' button. For example: I declared a property 'DBConnectionURL' and gave the value as 'jdbc:mysql://localhost:3306/test'. Then, in the 'DBCPConnectionPool' service controller, I simple used the value'${DBConnectionURL}' for 'Database Connection URL' property. But, I manually gave the value for 'DBConnectionURL' property.I want a way where I can feed the value dynamically from a file, so that i just need to change the value in the file and the value for 'DBConnectionURL' changes dynamically based on the value present in the file. Is there a way to do it?

Rishab,
You have to use nifi variable registry.
In conf/nifi.properties, you could configure the below configuration in it for dynamically update a value in your data flow.
nifi.variable.registry.properties=./dynamic.properties
You can give your variables in that file dynamic.properties it should present in conf directory.
For an example, If dynamic.properties files contains below values
DBCPURL= jdbc://<host>:<port>
you can use that in your data flow by using ${DBCPURL}
Note: You should restart nifi services if you change any configuration in conf/nifi.properties.Otherwise your changes not worked in dataflow.
Feel free to accept it be answer if it worked for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to give input Mapreduce jobs which use s3 data from java code - input

Based on my usage of flume, it would appear that you need to format your URL like s3n://AWS_ACCESS_KEY:AWS_SECRET_KEY#syamk/revanthinput/ when calling s3 from within code.

Related

sqlite3.OperationalError: When trying to connect to S3 Airflow Hook

How to save result from OS Process Sampler to a variable?

Unable to figure out why my code is printing "None". even though it should bring values

Qlik Sense: how to specify path in Google Drive?

How to read values dynamically from a file for a property in updateAttribute?

Categories

Resources