Kinesis Firehose to S3 in Parquet format (and Snappy compression) - amazon-s3

I have set up AWS DMS and taken data from my SQLServer RDS into Kinesis Streams and then into Kinesis Firehose and now want to write the data into S3 in Parquet format compressed (Snappy).
On the Console for Firehose it says I need to specify a schema for the source records. Is this necessary?...if so how do I take the source data definition and create a Glue Catalog to use at this point? Also, if I add to the table columns with additional optional columns that Kinesis Streams can add (eg the time of the transaction), will these need defining in the catalog also?

On the Console for Firehose it says I need to specify a schema for the source records. Is this necessary?
Yes. Parquet is columnar in nature. Kinesis needs to know the schema to convert to Parquet format.
if I add to the table columns with additional optional columns that Kinesis Streams can add (eg the time of the transaction), will these need defining in the catalog also?
Yes. All columns you send that are not defined in schema will be ignored and lost. They need to be defined in the schema

Related

DynamoDB data to S3 in Kinesis Firehose output format

Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension
I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.
Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.
However, the problem is that Glue creates Hive-style partitions which look like :
s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead.
I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.
Leaving the solution I ended up implementing here in case it helps anyone.
I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.
The Lambda used Copy and delete function from boto3 s3Client to implement the same.
It worked like a charm even though I had like > 10^6 output files split across different partitions.

Read hive table (or HDFS data in parquet format) in Streamsets DC

Is it possible to read hive table (or HDFS data in parquet format) in Streamsets Data collector? I don't want to use Transformer for this.
Reading the raw files in parquet is counter to the way that data collector works so that would be a better use case for transformer.
But I have successfully used the jdbc origin either from Impala or hive to achieve this, there are some additional hurdles to jump with the jdbc source.

Are the relations preserved when transferring data from a relational DB to S3?

There are options for transferring a DB snapshot from a relational database to S3 in AWS.
But S3 is an object store, so it only stores files (e.g. parquet).
Are the relationships (like keys) between tables in the relational DB somehow carried over to S3? Can queries still be made against the files in S3 that would allow joins to be made between tables?
There are no "keys" like foreign key, primary key in the exported parquet files in S3, but you can still query the the exported data directly through tools like Amazon Athena or Amazon Redshift Spectrum. For more information on using Athena to read Parquet data, see Parquet SerDe in the Amazon Athena User Guide. For more information on using Redshift Spectrum to read Parquet data, see COPY from columnar data formats in the Amazon Redshift Database Developer Guide.
The time it takes for the export to complete depends on the data stored in the database. For example, tables with well distributed numeric primary key or index columns will export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column will take longer because the export uses a slower single threaded process. For example if a table got a numeric pk and got 100,000 rows, during export data will be "partitioned" in a few portion, each portion are a directory in the S3 bucket, so that when you query data in Athena/Redshift spectrum with that id, AWS know what buckets to scan to get the data and thus improve performance and speed.
In summary, after data exported as columnar format like parquet in S3, you can do inplace query by Athena, load the data to redshift or data store for more analytics, etc..

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?
Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.