Can Matillion ETL convert multiple CSV files to Parquet? - matillion

I have the following folder hierarchy in an S3 bucket:
January/10
16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, 16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, …
In other words, every folder represents a calendar day.
I would like Matillion ETL to do the following transform:
January/10
AsingleParquetFile.Parquet
How can I implements this in Matillion ETL?

Matillion ETL would do this by using the CDW it's attached to. The exact answer would depend which CDW you are using, but would typically involve an external table and an unload component.
For example using Matillion ETL for Redshift:
followed by:

Related

DynamoDB data to S3 in Kinesis Firehose output format

Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension
I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.
Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.
However, the problem is that Glue creates Hive-style partitions which look like :
s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead.
I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.
Leaving the solution I ended up implementing here in case it helps anyone.
I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.
The Lambda used Copy and delete function from boto3 s3Client to implement the same.
It worked like a charm even though I had like > 10^6 output files split across different partitions.

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use Hcatalog loader to read hive tables schema. So, if I create Athena tables on those S3 buckets, is there a way to read schema from those Athena tables inside the pig scripts? using some sort of loader similar to hcatloader?
Current: Below code works, but I have to define schema inside the pig script
%default SOURCE_LOC 's3://s3bucket/input/abc'
inp_data = LOAD '$SOURCE_LOC' USING PigStorage('\001') AS
(id: bigint, val_id: int, provision: chararray);
Want:
Read from a Athena table instead
Athena table: database_name.abc (schema as id:bigint, val_id:int, provision:string)
So, looking for something like below: so I do not have to define schema inside the pig script
%default SOURCE_LOC 'database_name.abc'
inp_data = LOAD '$SOURCE_LOC' USING athenaloader();
Is there a loader utility to read Athena? or is there an alternate solution to my need. please help

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

Loading or pointing to multiple parquet paths for data analysis with hive or prestodb

I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby