Kafka-connect without schema registry - amazon-s3

I have a kafka-topic and I would like to feed it with AVRO data (currently in JSON). I know the "proper" way to do it is to use schema-registry but for testing purposes I would like to make it work without it.
So I am sending AVRO data as Array[Byte] as opposed to regular Json objects:
val writer = new SpecificDatumWriter[GenericData.Record]("mySchema.avsc")
val out = new ByteArrayOutputStream
val encoder = EncoderFactory.get.binaryEncoder(out, null)
writer.write(myAvroData, encoder)
encoder.flush
out.close
out.toByteArray
The schema is embarked within each data; how can I make it work with kafka-connect? The kafka-connect configuration currently exhibits the following properties (data is written to s3 as json.gz files), and I want to write Parquet files:
{
"name": "someName",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "120",
"topics": "user_sync",
"s3.region": "someRegion",
"s3.bucket.name": "someBucket",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"filename.offset.zero.pad.width": "20",
"flush.size": "5000",
"rotate.interval.ms": "600000",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "YYYY/MM/dd/HH",
"timezone" : "UTC",
"locale": "en",
"partition.duration.ms": "600000",
"timestamp.extractor": "RecordField",
"timestamp.field" : "ts",
"schema.compatibility": "NONE"
I suppose I need to change "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat? But is it enough?
Thanks a lot!

JsonConverter will be unable to consume Avro encoded data since the binary format contains a schema ID from the registry that's needed to be extracted before the converter can determine what the data looks like
You'll want to use the registryless-avro-converter, which will create a Structured object, and then should be able to converted to a Parquet record.

Related

Read JSON Key Value and store it XSLT Variable

Please help me for XSLT code which will work in DataPower for following input
Input: {
"Timestamp": "2018-12-19T10:52:21.0870605-05:00",
"ResponseType": "Success",
"Name": [
{
"Code": "1001",
"Description": "ABC",
"Number": "123"
},
{
"Code": "1002",
"Description": "XYZ",
"Number": "123"
},
{
"Code": "1003",
"Description": "PQA",
"Number": "123"
},
{
"Code": "1004",
"Description": "MNO",
"Number": "123"
}
]
}
Output:
XSLT Variable
xsl:variable_code = 1001,1002,1003,1004
xsl:variable_Name : ABC,XYZ,PQA,MNO
XSLT will not work with this format natively (XSLT input is always XML, but output can be whatever).
There are ways to get around this.
1 - Use Gatewayscript transformation instead. You can find example on your own Datapower "sample" folders". The files are ending with ".js"
2 - You can still do it in XSLT, but need to auto-convert the JSON to XML using the input settings and a special, hidden, magic variable.
How-to:
In your object (XML Firewall, or Multi-Protocol Gwy), Specify the input as "JSON"
At the step in the rule where you want to use XSLT to interpret this input, do not use the vairable "PIPE" or "INPUT" as input, but "__JSONASJSONX". MORE INFO HERE.
This will allow you to navigate the JSON file after conversion to XML.
Here is an example of the conversion.
The rest is just normal XSLT programming on the Datapower... you can create a JSON or XML output... your choice !

Handle lags in Kafka S3 Connector

We'are using Kafka Connect [distributed, confluence 4.0].
It works very well, except that there always remain an uncommitted messages in the topic that connector listens to. The behavior probably related to the S3 connector configuration the "flush.size": "20000". The lags in the topic are always below the flush-size.
Our data comes in batches, I don't want to wait till next batch arrive, nor reduce the flush.size and create tons of files.
Is there away to set timeout where S3 connector will flush the data even if it didn't reach 20000 events?
thanks!
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"topics": "event",
"tasks.max": "3",
"topics.dir": "connect",
"s3.region": "some_region",
"s3.bucket.name": "some_bucket",
"s3.part.size": "5242880",
"flush.size": "20000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "FULL",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'\''day_ts'\''=YYYYMMdd/'\''hour_ts'\''=H",
"partition.duration.ms": "3600000",
"locale": "en_US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "time"
}
}
To flush outstanding records periodically on low-volume topics with the S3 Connector you may use the configuration property:
rotate.schedule.interval.ms
(Complete list of configs here)
Keep in mind that by using the property above you might see duplicate messages in the event of reprocessing or recovery from errors, regardless of which partitioner you are using.

Populating Django model sql tables with a json file

I would like to use a json file to populate an instance of a Django model. I have essentially flatten the structure in the json to a few table/classes. How do you map the json data to the Django tables?
What is the most efficient ways of doing this?
Thanks.
$ python manage.py loaddata yourjsonfile.json
Let's say you want to populate the standard django user table with 2 users: John Lennon and Yoko Ono. Your json will something like:
[
{
"pk": 1,
"model": "auth.user",
"fields": {
"username": "john",
"first_name": "John",
"last_name": "Lennon",
"is_active": true,
"is_superuser": true,
"is_staff": true,
"last_login": "2015-06-03T14:07:31.392Z",
"groups": [],
"user_permissions": [],
"password": "pbaasdf_sha256$12001$9Ser7lc1k1pWQFqk0x3u/T6I3",
"email": "john#lennon.com",
"date_joined": "2015-03-10T15:38:34.406Z"
}
},
{
"pk": 2,
"model": "auth.user",
"fields": {
"username": "yoko",
"first_name": "Yoko",
"last_name": "Ono",
"is_active": true,
"is_superuser": false,
"is_staff": false,
"last_login": "2015-05-19T13:36:58.444Z",
"groups": [],
"user_permissions": [],
"password": "baasdf_sha256$12cJskLs9Ser7lc1k1pWQFqk0x3u/T6I3",
"email": "yoko#ono.com",
"date_joined": "2014-05-19T13:36:58.444Z"
}
}
]
"Providing initial data for models"
It’s sometimes useful to pre-populate your database with hard-coded data when you’re first setting up an app. You can provide initial data via fixtures.
A fixture is a collection of data that Django knows how to import into a database. The most straightforward way of creating a fixture if you’ve already got some data is to use the manage.py dumpdata command. Or, you can write fixtures by hand; fixtures can be written as JSON, XML or YAML (with PyYAML installed) documents. The serialization documentation has more details about each of these supported serialization formats.

Multistorage with avro?

I have a single file containing multiple avro records. Each record contains a unique "name". How do I load and store files such that each file represents a record that corresponds with a given name?
Here is my avro schema:
{
"type": "records",
"name": "XXItem",
"namespace": "com.xxx.xxx",
"fields": [
{
"name": "data",
"type": {"type": "map", "values" : ["string", "long", "int"]}
}
]
}
A quick check seems to indicate that avro, is simply using JSON for data storage.
By looking for solutions for handling JSON in general, you should be able to come up with something that works for you.
This could be a starting point: Hadoop for JSON files

AWS Data pipeline CSV data from S3 to DynamoDB

I am trying to transfer CSV data from S3 bucket to DynamoDB using AWS pipeline, following is my pipe line script, it is not working properly,
CSV file structure
Name, Designation,Company
A,TL,C1
B,Prog, C2
DynamoDb : N_Table, with Name as hash value
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": "DataFormatId1"
},
"type": "S3DataNode"
},
{
"id": "ScheduleId639",
"startDateTime": "2013-08-03T00:00:00",
"name": "ImportSchedule",
"period": "1 Hours",
"type": "Schedule",
"endDateTime": "2013-08-04T00:00:00"
},
{
"id": "EmrActivityId637",
"input": {
"ref": "S3DataNodeId643"
},
"schedule": {
"ref": "ScheduleId639"
},
"name": "MyImportJob",
"runsOn": {
"ref": "EmrClusterId636"
},
"maximumRetries": "0",
"myDynamoDBWriteThroughputRatio": "0.25",
"attemptTimeout": "24 hours",
"type": "EmrActivity",
"output": {
"ref": "DynamoDBDataNodeId635"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{output.tableName},-d,S3_INPUT_BUCKET=#{input.directoryPath},-d,DYNAMODB_WRITE_PERCENT=#{myDynamoDBWriteThroughputRatio},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "DataFormatId1",
"name": "DefaultDataFormat1",
"column": [
"Name",
"Designation",
"Company"
],
"columnSeparator": ",",
"recordSeparator": "\n",
"type": "Custom"
}
]
}
Out of four steps while executing the pipeline, two are getting finished, but it is not executing completely
Currently (2015-04) default import pipeline template does not support importing CSV files.
If your CSV file is not too big (under 1GB or so) you can create a ShellCommandActivity to convert CSV to DynamoDB JSON format first and the feed that to EmrActivity that imports the resulting JSON file into your table.
As a first step you can create sample DynamoDB table including all the field types you need, populate with dummy values and then export the records using pipeline (Export/Import button in DynamoDB console). This will give you the idea about the format that is expected by Import pipeline. The type names are not obvious, and the Import activity is very sensitive about the correct case (e.g. you should have bOOL for boolean field).
Afterwards it should be easy to create an awk script (or any other text converter, at least with awk you can use the default AMI image for your shell activity), which you can feed to your shellCommandActivity. Don't forget to enable "staging" flag, so your output is uploaded back to S3 for the Import activity to pick it up.
If you are using the template data pipeline for Importing data from S3 to DynamoDB, these dataformats won't work. Instead, use the format in the link below to store the input S3 data file http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html
This format of the output file generated by the template data pipeline that exports data from DynamoDB to S3.
Hope that helps.
I would recommend using the CSV data format provided by datapipeline instead of custom.
For debugging the errors on cluster, you can lookup the jobflow in EMR console and look at the log files for the tasks that failed.
See below link for a solution that works (in the question section), albeit EMR 3.x. Just change the delimiter to "columnSeparator": ",". Personally, I wouldn't do CSV unless you are certain the data is sanitized correctly.
How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?