Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets - amazon-s3

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use Hcatalog loader to read hive tables schema. So, if I create Athena tables on those S3 buckets, is there a way to read schema from those Athena tables inside the pig scripts? using some sort of loader similar to hcatloader?
Current: Below code works, but I have to define schema inside the pig script
%default SOURCE_LOC 's3://s3bucket/input/abc'
inp_data = LOAD '$SOURCE_LOC' USING PigStorage('\001') AS
(id: bigint, val_id: int, provision: chararray);
Want:
Read from a Athena table instead
Athena table: database_name.abc (schema as id:bigint, val_id:int, provision:string)
So, looking for something like below: so I do not have to define schema inside the pig script
%default SOURCE_LOC 'database_name.abc'
inp_data = LOAD '$SOURCE_LOC' USING athenaloader();
Is there a loader utility to read Athena? or is there an alternate solution to my need. please help

Related

Query S3 Bucket With Amazon Athena and modify values

I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?
Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena

How to dynamically create table in Snowflake getting schema from parquet file which stored in AWS

Could you help me to load a couple of parquet files to Snowflake.
I've got about 250 parquet-files which stored in AWS stage.
250 files = 250 different tables.
I'd like to dynamically load them into Snowflake tables.
So, I need:
Get schema from parquet file... I've read that I could get the schema from parquet file using parquet-tools (Apache).
Create table using schema from the parquet file
Load data from parquet-file to this table.
Could anyone help me how to do that? Does exist the most efficient way to realize it? (by using GUI Snowflake, for example). Can't find it.
Thanks.
If the schema of the files is same you can put them in a single stage and use the Infer-Schema function. This will give you the schema of the parquet files.
https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
In case all files have different schema then I'm afraid you have to infer the schema on each file.

using glue to get data from ec2 mysql to redshift

I'm trying to pull a table from a mysql database on an ec2 instance through to s3 to query in redshift. My current pipeline is I crawl the mysql database table with aws glue crawler to get the schema in the data catalog. Then I set up an aws etl job to pull the data in to an s3 bucket. Then I again crawl the data in the s3 bucket with another crawler to get the schema for the data in the s3 bucket in to the data catalog and then run the script below in redshift query window to pull the schema in to redshift. It seems like a lot of steps. Is there a more efficient way to do this? For example is there a way to re-use the schema from the first crawler so I don't have to crawl the data twice. It's the same table and columns.
script:
create external schema schema1
from data catalog database 'database1'
iam_role 'arn:aws:iam::228276746111:role/sfada'
region 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Athena Write to DynamoDB

What are my options to programmatically run a SQL query on Athena and store the result set in DynamoDB as items or views.
I am looking for a built-in AWS Athena function/API which takes a Query as input and outputs the result set into DynamoDB.
Today Athena API can run given query and store the result set to S3. I am looking for Athena to DynamoDB API
There is no built-in AWS functionality that allows you to get data from Athena to DynamoDB directly.
What you can do instead is to have AWS Data Pipeline running Python script that uses boto3 package to connect to Athena, get the data - and maybe do some pre-processing - and then post it to DynamoDB.

Running HIVE queries directly from S3 input files

I am using Interative Hive Session in Elastice Map Reduce to run Hive. Previously I was loading data from S3 into Hive tables.Now, I want to run some scripts on S3 input files without loading data into Hive Tables.
Is this possible?If yes then how can this be achieved?
You can run queries on data right in S3.
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n' LOCATION 's3n://mys3bucket/';
or similar