presto configuration through emr launch config - amazon-emr

I am trying to deploy presto om EMR through our EMR launch config JSON. I have decided the config properties as advised in this github issue of presto. I have added the following presto properties in the launch config
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.metastore.glue.datacatalog.enabled": "true",
"hive.table-statistics-enabled": "true"
},
"Configurations": []
},
{
"Classification": "presto-config",
"Properties": {
"query.max-memory": "150G",
"query.max-memory-per-node": "20G",
"query.max-total-memory-per-node": "30G",
"memory.heap-headroom-per-node": "10G",
"query.initial-hash-partitions": "15"
},
"Configurations": []
}
EMR cluster has been created but presto is failing due to the following errors
1) Explicit bindings are required and com.facebook.presto.memory.LowMemoryKiller is not explicitly bound.
while locating com.facebook.presto.memory.LowMemoryKiller
for parameter 7 at com.facebook.presto.memory.ClusterMemoryManager.<init>(ClusterMemoryManager.java:123)
at com.facebook.presto.server.CoordinatorModule.setup(CoordinatorModule.java:189) (via modules: com.facebook.presto.server.ServerMainModule -> com.facebook.presto.server.CoordinatorModule)
2) Error: Could not coerce value '150G' to io.airlift.units.DataSize (property 'query.max-memory') in order to call [public com.facebook.presto.memory.MemoryManagerConfig com.facebook.presto.memory.MemoryManagerConfig.setMaxQueryMemory(io.airlift.units.DataSize)]
3) Error: Could not coerce value '20G' to io.airlift.units.DataSize (property 'query.max-memory-per-node') in order to call [public com.facebook.presto.memory.NodeMemoryConfig com.facebook.presto.memory.NodeMemoryConfig.setMaxQueryMemoryPerNode(io.airlift.units.DataSize)]
4) Configuration property 'memory.heap-headroom-per-node' was not used
at io.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:234)
5) Configuration property 'query.max-memory' was not used
at io.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:234)
6) Configuration property 'query.max-memory-per-node' was not used
at io.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:234)
7) Configuration property 'query.max-total-memory-per-node' was not used
at io.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:234)
7 errors
at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:466)
at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:155)
at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:107)
at com.google.inject.Guice.createInjector(Guice.java:96)
at io.airlift.bootstrap.Bootstrap.initialize(Bootstrap.java:241)
at com.facebook.presto.server.PrestoServer.run(PrestoServer.java:114)
at com.facebook.presto.server.PrestoServer.main(PrestoServer.java:66)
My config.properties file
coordinator=true
node-scheduler.include-coordinator=false
discovery.uri=X.X.X.X:YYYY
http-server.threads.max=500
discovery-server.enabled=true
sink.max-buffer-size=1GB
query.max-memory=150G
query.max-memory-per-node=20G
query.max-history=40
query.min-expire-age=30m
http-server.http.port=8889
http-server.log.path=/var/log/presto/http-request.log
http-server.log.max-size=67108864B
http-server.log.max-history=5
log.max-size=268435456B
log.max-history=5
query.initial-hash-partitions = 15
memory.heap-headroom-per-node = 10G
query.max-total-memory-per-node = 30G

Setup fails because
You need to use "GB" (not "G") as the unit when setting data size config properties
Your version (0.194) doesn't support some properties that you're setting (memory.heap-headroom-per-node nor query.max-total-memory-per-node).

Related

Filepulse Connector error with S3 provider (Source Connector)

I am trying to poll csv files from S3 buckets using Filepulse source connector. When the task starts I get the following error. What additional libraries do I need to add to make this work from S3 bucket ? Config file below.
Where did I go wrong ?
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:208)
java.nio.file.FileSystemNotFoundException: Provider "s3" not installed
at java.base/java.nio.file.Path.of(Path.java:212)
at java.base/java.nio.file.Paths.get(Paths.java:98)
at io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalFileStorage.exists(LocalFileStorage.java:62)
Config file :
{
"name": "FilePulseConnector_3",
"config": {
"connector.class": "io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector",
"filters": "ParseCSVLine, Drop",
"filters.Drop.if": "{{ equals($value.artist, 'U2') }}",
"filters.Drop.invert": "true",
"filters.Drop.type": "io.streamthoughts.kafka.connect.filepulse.filter.DropFilter",
"filters.ParseCSVLine.extract.column.name": "headers",
"filters.ParseCSVLine.trim.column": "true",
"filters.ParseCSVLine.seperator": ";",
"filters.ParseCSVLine.type": "io.streamthoughts.kafka.connect.filepulse.filter.DelimitedRowFilter",
"fs.cleanup.policy.class": "io.streamthoughts.kafka.connect.filepulse.fs.clean.LogCleanupPolicy",
"fs.cleanup.policy.triggered.on":"COMMITTED",
"fs.listing.class": "io.streamthoughts.kafka.connect.filepulse.fs.AmazonS3FileSystemListing",
"fs.listing.filters":"io.streamthoughts.kafka.connect.filepulse.fs.filter.RegexFileListFilter",
"fs.listing.interval.ms": "10000",
"file.filter.regex.pattern":".*\\.csv$",
"offset.policy.class":"io.streamthoughts.kafka.connect.filepulse.offset.DefaultSourceOffsetPolicy",
"offset.attributes.string": "name",
"skip.headers": "1",
"topic": "connect-file-pulse-quickstart-csv",
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalRowFileInputReader",
"tasks.file.status.storage.class": "io.streamthoughts.kafka.connect.filepulse.state.KafkaFileObjectStateBackingStore",
"tasks.file.status.storage.bootstrap.servers": "172.27.157.66:9092",
"tasks.file.status.storage.topic": "connect-file-pulse-status",
"tasks.file.status.storage.topic.partitions": 10,
"tasks.file.status.storage.topic.replication.factor": 1,
"tasks.max": 1,
"aws.access.key.id":"<<>>",
"aws.secret.access.key":"<<>>",
"aws.s3.bucket.name":"mytestbucketamtrak",
"aws.s3.region":"us-east-1"
}
}
What should I put in the libraries to make this work ? Note : The lenses connector sources from S3 bucket without issues. So its not a credentials issue.
As mentioned in comments by #OneCricketeer
Suggest you follow - github.com/streamthoughts/kafka-connect-file-pulse/issues/382 pointed to root cause.
Modifying the config file to use this property sourced the file:
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.AmazonS3RowFileInputReader"

Use variables in azure stream analytics properties

I want to reduce the number of overrides during the deployment of my ASA by using environment variables in my properties.
Expectations
Having variables defined in the asaproj.json or the JobConfig.json file or a .env file.
{
...
"variables": [
"environment": "dev"
]
}
Call those variables in a properties file such as an SQL Reference properties input file
{
"Name": "sql-query",
"Type": "Reference data",
"DataSourceType": "SQL Database",
"SqlReferenceProperties": {
"Database": "${environment}-sql-bdd",
"Server": "${environment}-sql",
"User": "user",
"Password": null,
"FullSnapshotPath": "sql-query.snapshot.sql",
"RefreshType": "Execute periodically",
"RefreshRate": "06:00:00",
"DeltaSnapshotPath": null
},
"DataSourceCredentialDomain": null,
"ScriptType": "Input"
}
Attempt
I could use a powershell script to override values from the ARM variables file generated by the npm package azure-streamanalytics-cicd. It's not clean at all.
Problem
I can't find resources about environment variables in azure stream analytics online. Does such a thing exist ? If so, can you provide some piece of documentation ?

Fargate environment variable redis.yaml

I have a microservice and I need to pass in a file redis.yaml to configure Elasticache for Redis.
Assume I have a file called redis.yaml with contents:
clusterServersConfig:
idleConnectionTimeout: 10000
pingTimeout: 1000
connectTimeout: 10000
timeout: 60000
retryAttempts: 3
retryInterval: 60000
And my application.properties I use:
redis.config.location=file:/opt/usr/conf/redis.yaml
In Kubernetes, I can just create a secret with --from-file redis.yaml and the application runs properly.
I do not know how to do the same with AWS Fargate. I believe it could be done with AWS SSM but any help/steps on how to do it would be appreciated.
For externalized configuration, Fargate supports environment variables. Environment variables can be passed in Task definition.
"environment": [
{ "name": "env_name1", "value": "value1" },
{ "name": "env_name2", "value": "value2" }
]
If it's sensitive information, store it in AWS SSM-Parameter store (you can use KMS) and specify the parameter key in the task definition.
{
"containerDefinitions": [{
"secrets": [{
"name": "environment_variable_name",
"valueFrom": "arn:aws:ssm:region:aws_account_id:parameter/parameter_name"
}]
}]
}
In your case, you can convert your yaml to JSON and store it in the Parameter store and refer it in the task definition.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data.html

How to configure Sensu with RMQ and InfluxDB

I am trying to get started with a monitoring server solution. I got the Sensu Clients, RabbitMQ and Uchiwa configured but then I tried using Graphite but there were so many parts to configure I tried InfluxDB instead. I am stuck configuring Sensu to InfluxDB.
Is there a part missing in the below configuration?
Client [Sensu] > RabbitMQ <> Sensu Server <> InfluxDB <> Grafana
Any suggestions?
cat influx.json
{
"influxdb": {
"hosts" : ["192.168.1.1"],
"host" : "192.168.1.1",
"port" : "8086",
"database" : "sensumetrics",
"time_precision": "s",
"use_ssl" : false,
"verify_ssl" : false,
"initial_delay" : 0.01,
"max_delay" : 30,
"open_timeout" : 5,
"read_timeout" : 300,
"retry" : null,
"prefix" : "",
"denormalize" : true,
"status" : true
}
}
cat handler.json
{
"handlers": {
"influxdb": {
"type": "pipe",
"command": "/opt/sensu/embedded/bin/metrics-influxdb.rb"
}}}
checks1,
{
"checks": {
"check_memory_linux": {
"handlers": ["influxdb","default"],
"command": "/opt/sensu/embedded/bin/check-memory-percent.rb -w 90 -c 95",
"interval": 60,
"occurrences": 5,
"subscribers": [ "TEST" ]
}}}
checks2,
{
"checks": {
"check_cpu_linux-elkctrl-pipe": {
"type": "metric",
"command": "/opt/sensu/embedded/bin/check-cpu.rb -w 80 -c 90",
"subscribers": ["TEST"],
"interval": 10,
"handlers": ["debug","influxdb"]
}}}
To use InfluxDB to persist your data, you must have:
InfluxDB plugin installed (also, installation and usage instructions here)
Definitions for the plugin (an influxdb.json containin at least the host, port, user, password and database to be used by Sensu)
The definition, as other config files, must be in /etc/sensu/conf.d/
Handler configuration set properly (also in conf.d)
Mutator for InfluxDB (extensions)
Your checks must send results to the handler, so their definition must contain:
"handlers": [
"influxdb"
]
Or whatever name you gave your handler.
Case, if the influxdb config you provided above is the full extent of your configuration, it would seem to be missing the username/password attributes required by the influxdb configuration. If they're present, but not provided in the post, no big deal. However, I'd recommend doing the following for your Sensu logs:
grep -i influxdb /var/logs/sensu/sensu-server.log
And seeing if the check result is getting sent to your influxdb instance. If they are, you should be receiving an error that might be pointing a bit more to what's going on.
You can also check your influxdb logs to see if they're getting a post from your Sensu server:
journalctl -u influxdb.service -f
But yeah, if the username/password is missing from the configuration, that'd be the first place that I start.

Modeshape binary store in where?

First i config my modeshape configuration file like this:
"storage" : {
"persistence" : {
"type" : "db",
"connectionUrl": "${database.url}",
"driver": "${database.driver}",
"username": "${database.user}",
"password": "${database.password}",
"tableName": "GOVERNANCE_MODESHAPE",
"poolSize" : 5,
"createOnStart" : true,
"dropOnExit" : false
}
}
After I create a node and set a property for it and save it in my local environment, I can still find the node and the property in my local environment. But it will can't be found in my colleague local environment.
Then I change the configuration like this:
"storage" : {
"persistence" : {
"type" : "db",
"connectionUrl": "${database.url}",
"driver": "${database.driver}",
"username": "${database.user}",
"password": "${database.password}",
"tableName": "GOVERNANCE_MODESHAPE",
"poolSize" : 5,
"createOnStart" : true,
"dropOnExit" : false
},
"binaryStorage" : {
"type" : "file",
"directory": "/var/thinkbig/modeshape",
"minimumBinarySizeInBytes" : 5000000
}
}
I can find the node and property which created in my local environment, and my colleague also can find it in his local environment. But i can't find the directory of path /var/thinkbig/modeshape.
So I want to know the modeshape binary store from where? Why I add the "binaryStorage" config in the configuration file, everybody can find the node and property? Thanks in advance!
Per the doc for minimumBinarySizeInBytesthe minimum size (in bytes) above which binary values will be stored in the store. Any binary value lower in size will be stored together with the other node information..
This means that binaries smaller than the specified size are stored in the database, rather than the file system. You could change this to a value of 1 byte if you want to ensure that all binaries get stored in the file system .