flume with multiple Avro schemas - schema

I am using Flume to combine small Avro files (each containing a couple of Avro events) into larger files to be stored in HDFS. I am using Spool source and HDFS sink, with Avro Serializer. My Spool directory contains files with 3 different schemas. Is it possible to configure Flume in such a way that it combines Avro files with each different schema into different sink files ?
Thanks in advance

Yes in fact it is. What actually flume actually does is to wrap your Avro objects into another Avro container object of type Event, which consists of a headers and a body. That body actually contains your Avro objects.
In order to have those files spooled to different directories in hdfs, you will have to set headers, which you can reference in your path, e.g.:
agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode/%{avro_type}/
avro_type being the name of the header you set.
In order to set that header, you need to use an interceptor. An interceptor is a custom class which implements org.apache.flume.interceptor.Interceptor. In the public Event intercept(Event event) method, you will have determine the type of the avro object and call event.getHeaders().put("avro_type", <something>);
That's basically it.

Related

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

Read hive table (or HDFS data in parquet format) in Streamsets DC

Is it possible to read hive table (or HDFS data in parquet format) in Streamsets Data collector? I don't want to use Transformer for this.
Reading the raw files in parquet is counter to the way that data collector works so that would be a better use case for transformer.
But I have successfully used the jdbc origin either from Impala or hive to achieve this, there are some additional hurdles to jump with the jdbc source.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Naming a Parquet File in Glue JOB

How to assign a predefined name to a parquet files in a AWS glue job ?
For example after my job runs a parquet file gets stored in the specific folder with a name like:
part-00000-fc95461f-00da-437a-9396-93c7ea473720.sn​appy.parquet,
part-00000-tc95431f-00ds-437b-9396-93c7ea473720.sn​appy.parquet
I want the file to be stored in Predefined or a structured format like :
part-00000-12Jan2018.sn​appy.parquet,
part-00000-13Jan2018.sn​appy.parquet
etc.
Due to the nature of how spark works, we can't name the files to our liking at present.
An alternate approach would be to rename the files as soon as they are written to s3/data lake.
I found these answers to be helpful.

Validate avro files in java

Is there any API to validate a avro file?To make sure that file is not corrupt.
Currently i am using DataFileWriter.getSchema() to check if the avro is not corrupt. But only checking schemna doesnt ensure that file is not corrupt.
I am left with option of reading every record. Is there any other way to validate the avro file?
Thanks
Nope, there is no other way. Avro files can be thought of as a delimited file with a header. The header specifies both the fields and their types, and the records are just binary delimited data. Just as text delimited data, the only way to be 100% sure is to check every record.