I am trying to write to a S3 sink.
private static StreamingFileSink<String> createS3SinkFromStaticConfig(
final Map<String, Properties> applicationProperties
) {
Properties sinkProperties = applicationProperties.get(SINK_PROPERTIES);
String s3SinkPath = sinkProperties.getProperty(SINK_S3_PATH_KEY);
return StreamingFileSink
.forRowFormat(
new Path(s3SinkPath),
new SimpleStringEncoder<String>(StandardCharsets.UTF_8.toString())
)
.build();
}
The following code works and I can see the results in S3
input.map(value -> { // Parse the JSON
JsonNode jsonNode = jsonParser.readValue(value, JsonNode.class);
return new Tuple2<>(jsonNode.get("ticker").asText(), jsonNode.get("price").asDouble());
}).returns(Types.TUPLE(Types.STRING, Types.DOUBLE))
.keyBy(0) // Logically partition the stream per stock symbol
.timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition
.min(1) // Calculate minimum price per stock over the window
.setParallelism(3) // Set parallelism for the min operator
.map(value -> value.f0 + ": ----- " + value.f1.toString() + "\n")
.addSink(createS3SinkFromStaticConfig(applicationProperties));
But the following doesn't write anything to S3.
KeyedStream<EnrichedMetric, EnrichedMetricKey> input = env.addSource(new EnrichedMetricSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<EnrichedMetric>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
).keyBy(new EnrichedMetricKeySelector());
DataStream<String> statsStream = input
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new PValueStatisticsWindowFunction());
statsStream.addSink(createS3SinkFromStaticConfig(applicationProperties));
PValueStatisticsWindowFunction is a ProcessWindowFunction as below.
#Override
public void process(EnrichedMetricKey enrichedMetricKey,
Context context,
Iterable<EnrichedMetric> in,
Collector<String> out) throws Exception {
int count = 0;
for (EnrichedMetric m : in) {
count++;
}
out.collect("Count: " + count);
}
When I run the Flink app locally, statsStream.print() prints the results to log/flink-*-taskexecutor-*.out.
In the cluster, I can see checkpoint is enabled and the various checkpoints history from the Flink dashboard. I also made sure the S3 path is in the format s3a://<bucket>
Not sure what I am missing here.
A follow up to this:
one SCDF source, 2 processors but only 1 processes each item
The 2 processors (del-1 and del-2) in the picture are receiving the same data within milliseconds of each other. I'm trying to rig this so del-2 never receives the same thing as del-1 and vice versa. So obviously I've got something configured incorrectly but I'm not sure where.
My processor has the following application.properties
spring.application.name=${vcap.application.name:sample-processor}
info.app.name=#project.artifactId#
info.app.description=#project.description#
info.app.version=#project.version#
management.endpoints.web.exposure.include=health,info,bindings
spring.autoconfigure.exclude=org.springframework.boot.autoconfigure.security.servlet.SecurityAutoConfiguration
spring.cloud.stream.bindings.input.group=input
Is "spring.cloud.stream.bindings.input.group" specified correctly?
Here's the processor code:
#Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
public Object transform(String inputStr) throws InterruptedException{
ApplicationLog log = new ApplicationLog(this, "timerMessageSource");
String message = " I AM [" + inputStr + "] AND I HAVE BEEN PROCESSED!!!!!!!";
log.info("SampleProcessor.transform() incoming inputStr="+inputStr);
return message;
}
Is the #Transformer annotation the proper way to link this bit of code with "spring.cloud.stream.bindings.input.group" from application.properties? Are there any other annotations necessary?
Here's my source:
private String format = "EEEEE dd MMMMM yyyy HH:mm:ss.SSSZ";
#Bean
#InboundChannelAdapter(value = Source.OUTPUT, poller = #Poller(fixedDelay = "1000", maxMessagesPerPoll = "1"))
public MessageSource<String> timerMessageSource() {
ApplicationLog log = new ApplicationLog(this, "timerMessageSource");
String message = new SimpleDateFormat(format).format(new Date());
log.info("SampleSource.timeMessageSource() message=["+message+"]");
return () -> new GenericMessage<>(new SimpleDateFormat(format).format(new Date()));
}
I'm confused about the "value = Source.OUTPUT". Does this mean my processor needs to be named differently?
Is the inclusion of #Poller causing me a problem somehow?
This is how I define the 2 processor streams (del-1 and del-2) in SCDF shell:
stream create del-1 --definition ":split > processor-that-does-everything-sleeps5 --spring.cloud.stream.bindings.applicationMetrics.destination=metrics > :merge"
stream create del-2 --definition ":split > processor-that-does-everything-sleeps5 --spring.cloud.stream.bindings.applicationMetrics.destination=metrics > :merge"
Do I need to do anything differently there?
All of this is running in Docker/K8s.
RabbitMQ is given by bitnami/rabbitmq:3.7.2-r1 and is configured with the following props:
RABBITMQ_USERNAME: user
RABBITMQ_PASSWORD <redacted>:
RABBITMQ_ERL_COOKIE <redacted>:
RABBITMQ_NODE_PORT_NUMBER: 5672
RABBITMQ_NODE_TYPE: stats
RABBITMQ_NODE_NAME: rabbit#localhost
RABBITMQ_CLUSTER_NODE_NAME:
RABBITMQ_DEFAULT_VHOST: /
RABBITMQ_MANAGER_PORT_NUMBER: 15672
RABBITMQ_DISK_FREE_LIMIT: "6GiB"
Are any other environment variables necessary?
I've been experimenting with Ignite Near Caches. In doing so I'm configuring a client node with two server nodes in the cluster. I instantiated a near cache and would like to see the associated metrics on the cache hits/misses. Functionally everything works fine, but I can't figure out where the near cache metrics are.
I've tried to extract the cache metrics via calls
NearCacheConfiguration<Integer, Integer> nearCfg =
new NearCacheConfiguration<>();
nearCfg.setNearEvictionPolicyFactory(new LruEvictionPolicyFactory<>(100));
nearCfg.setNearStartSize(50);
IgniteCache<Integer, Integer> cache = ignite.getOrCreateCache(
new CacheConfiguration<Integer, Integer>("myCache"), nearCfg);
// run some cache puts and gets
for (int i=0; i<10000; i++) { cache.put(i, i); }
for (int i=0; i<10000; i++) { cache.get(i); }
// then try to retrieve metrics
System.out.println(cache.localMetrics());
System.out.println(cache.metrics());
output
CacheMetricsSnapshot [reads=0, puts=0, hits=0, misses=0, txCommits=0, txRollbacks=0, evicts=0, removes=0, putAvgTimeNanos=0.0, getAvgTimeNanos=0.0, rmvAvgTimeNanos=0.0, commitAvgTimeNanos=0.0, rollbackAvgTimeNanos=0.0, cacheName=myCache, offHeapGets=0, offHeapPuts=0, offHeapRemoves=0, offHeapEvicts=0, offHeapHits=0, offHeapMisses=0, offHeapEntriesCnt=0, heapEntriesCnt=0, offHeapPrimaryEntriesCnt=0, offHeapBackupEntriesCnt=0, offHeapAllocatedSize=0, size=0, keySize=0, isEmpty=true, dhtEvictQueueCurrSize=0, txThreadMapSize=0, txXidMapSize=0, txCommitQueueSize=0, txPrepareQueueSize=0, txStartVerCountsSize=0, txCommittedVersionsSize=0, txRolledbackVersionsSize=0, txDhtThreadMapSize=0, txDhtXidMapSize=0, txDhtCommitQueueSize=0, txDhtPrepareQueueSize=0, txDhtStartVerCountsSize=0, txDhtCommittedVersionsSize=0, txDhtRolledbackVersionsSize=0, isWriteBehindEnabled=false, writeBehindFlushSize=-1, writeBehindFlushThreadCnt=-1, writeBehindFlushFreq=-1, writeBehindStoreBatchSize=-1, writeBehindTotalCriticalOverflowCnt=0, writeBehindCriticalOverflowCnt=0, writeBehindErrorRetryCnt=0, writeBehindBufSize=-1, totalPartitionsCnt=0, rebalancingPartitionsCnt=0, keysToRebalanceLeft=0, rebalancingKeysRate=0, rebalancingBytesRate=0, rebalanceStartTime=0, rebalanceFinishTime=0, keyType=java.lang.Object, valType=java.lang.Object, isStoreByVal=true, isStatisticsEnabled=false, isManagementEnabled=false, isReadThrough=false, isWriteThrough=false, isValidForReading=true, isValidForWriting=true]
CacheMetricsSnapshot [reads=0, puts=0, hits=0, misses=0, txCommits=0, txRollbacks=0, evicts=0, removes=0, putAvgTimeNanos=0.0, getAvgTimeNanos=0.0, rmvAvgTimeNanos=0.0, commitAvgTimeNanos=0.0, rollbackAvgTimeNanos=0.0, cacheName=myCache, offHeapGets=0, offHeapPuts=0, offHeapRemoves=0, offHeapEvicts=0, offHeapHits=0, offHeapMisses=0, offHeapEntriesCnt=0, heapEntriesCnt=100, offHeapPrimaryEntriesCnt=0, offHeapBackupEntriesCnt=0, offHeapAllocatedSize=0, size=0, keySize=0, isEmpty=true, dhtEvictQueueCurrSize=-1, txThreadMapSize=0, txXidMapSize=0, txCommitQueueSize=0, txPrepareQueueSize=0, txStartVerCountsSize=0, txCommittedVersionsSize=0, txRolledbackVersionsSize=0, txDhtThreadMapSize=0, txDhtXidMapSize=-1, txDhtCommitQueueSize=0, txDhtPrepareQueueSize=0, txDhtStartVerCountsSize=0, txDhtCommittedVersionsSize=-1, txDhtRolledbackVersionsSize=-1, isWriteBehindEnabled=false, writeBehindFlushSize=-1, writeBehindFlushThreadCnt=-1, writeBehindFlushFreq=-1, writeBehindStoreBatchSize=-1, writeBehindTotalCriticalOverflowCnt=-1, writeBehindCriticalOverflowCnt=-1, writeBehindErrorRetryCnt=-1, writeBehindBufSize=-1, totalPartitionsCnt=0, rebalancingPartitionsCnt=0, keysToRebalanceLeft=0, rebalancingKeysRate=0, rebalancingBytesRate=0, rebalanceStartTime=-1, rebalanceFinishTime=-1, keyType=java.lang.Object, valType=java.lang.Object, isStoreByVal=true, isStatisticsEnabled=false, isManagementEnabled=false, isReadThrough=false, isWriteThrough=false, isValidForReading=true, isValidForWriting=true]
Looks like there are no meaningful metrics. I figured that it may be part of the NearCacheConfiguration to configure stats as is the case with CacheConfiguration but no.
Any idea?
I figure it out. I missed the CacheConfiguration passed into the igniteCache object.
code snippet is:
CacheConfiguration<Integer, Integer> cacheConfiguration = new CacheConfiguration<Integer, Integer>("myCache");
cacheConfiguration.setStatisticsEnabled(true);
IgniteCache<Integer, Integer> cache = ignite.getOrCreateCache(cacheConfiguration, nearCfg);
after all the cache operations are run I see the statistics now.
After recent investigation and a Stack over flow question I realise that the cluster sharding is a better option than a cluster-consistent-hash-router. But I am having trouble getting a 2 process cluster going.
One process is the Seed and the other is the Client. The Seed node seems to continuously throw dead letter messages (see the end of this question).
This Seed HOCON follows:
akka {
loglevel = "INFO"
actor {
provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
serializers {
wire = "Akka.Serialization.WireSerializer, Akka.Serialization.Wire"
}
serialization-bindings {
"System.Object" = wire
}
}
remote {
dot-netty.tcp {
hostname = "127.0.0.1"
port = 5000
}
}
persistence {
journal {
plugin = "akka.persistence.journal.sql-server"
sql-server {
class = "Akka.Persistence.SqlServer.Journal.SqlServerJournal, Akka.Persistence.SqlServer"
schema-name = dbo
auto-initialize = on
connection-string = "Data Source=localhost;Integrated Security=True;MultipleActiveResultSets=True;Initial Catalog=ClusterExperiment01"
plugin-dispatcher = "akka.actor.default- dispatcher"
connection-timeout = 30s
table-name = EventJournal
timestamp-provider = "Akka.Persistence.Sql.Common.Journal.DefaultTimestampProvider, Akka.Persistence.Sql.Common"
metadata-table-name = Metadata
}
}
sharding {
connection-string = "Data Source=localhost;Integrated Security=True;MultipleActiveResultSets=True;Initial Catalog=ClusterExperiment01"
auto-initialize = on
plugin-dispatcher = "akka.actor.default-dispatcher"
class = "Akka.Persistence.SqlServer.Journal.SqlServerJournal, Akka.Persistence.SqlServer"
connection-timeout = 30s
schema-name = dbo
table-name = ShardingJournal
timestamp-provider = "Akka.Persistence.Sql.Common.Journal.DefaultTimestampProvider, Akka.Persistence.Sql.Common"
metadata-table-name = ShardingMetadata
}
}
snapshot-store {
sharding {
class = "Akka.Persistence.SqlServer.Snapshot.SqlServerSnapshotStore, Akka.Persistence.SqlServer"
plugin-dispatcher = "akka.actor.default-dispatcher"
connection-string = "Data Source=localhost;Integrated Security=True;MultipleActiveResultSets=True;Initial Catalog=ClusterExperiment01"
connection-timeout = 30s
schema-name = dbo
table-name = ShardingSnapshotStore
auto-initialize = on
}
}
cluster {
seed-nodes = ["akka.tcp://my-cluster-system#127.0.0.1:5000"]
roles = ["Seed"]
sharding {
journal-plugin-id = "akka.persistence.sharding"
snapshot-plugin-id = "akka.snapshot-store.sharding"
}
}}
I have a method that essentially turns the above into a Config like so:
var config = NodeConfig.Create(/* HOCON above */).WithFallback(ClusterSingletonManager.DefaultConfig());
Without the "WithFallback" I get a null reference exception out of the config generation.
And then generates the system like so:
var system = ActorSystem.Create("my-cluster-system", config);
The client creates its system in the same manner and the HOCON is almost identical aside from:
{
remote {
dot-netty.tcp {
hostname = "127.0.0.1"
port = 5001
}
}
cluster {
seed-nodes = ["akka.tcp://my-cluster-system#127.0.0.1:5000"]
roles = ["Client"]
role.["Seed"].min-nr-of-members = 1
sharding {
journal-plugin-id = "akka.persistence.sharding"
snapshot-plugin-id = "akka.snapshot-store.sharding"
}
}}
The Seed node creates the sharding like so:
ClusterSharding.Get(system).Start(
typeName: "company-router",
entityProps: Props.Create(() => new CompanyDeliveryActor()),
settings: ClusterShardingSettings.Create(system),
messageExtractor: new RouteExtractor(100)
);
And the client creates a sharding proxy like so:
ClusterSharding.Get(system).StartProxy(
typeName: "company-router",
role: "Seed",
messageExtractor: new RouteExtractor(100));
The RouteExtractor is:
public class RouteExtractor : HashCodeMessageExtractor
{
public RouteExtractor(int maxNumberOfShards) : base(maxNumberOfShards)
{
}
public override string EntityId(object message) => (message as IHasRouting)?.Company?.VolumeId.ToString();
public override object EntityMessage(object message) => message;
}
In this scenario the VolumeId is always the same (just for experiment sake).
Both processes come to life but the Seed keeps throwing this error to the log:
[INFO][7/05/2017 9:00:58 AM][Thread 0003][akka://my-cluster-system/user/sharding
/company-routerCoordinator/singleton/coordinator] Message Register from akka.tcp
://my-cluster-system#127.0.0.1:5000/user/sharding/company-router to akka://my-cl
uster-system/user/sharding/company-routerCoordinator/singleton/coordinator was n
ot delivered. 4 dead letters encountered.
Ps. I am not using Lighthouse.
From the quick look, you're starting a cluster sharding proxy on your client node and you're telling it that sharded nodes are those using seed role. This doesn't match the cluster sharding definition on seed node, when you haven't specified any role.
Since there is no role to limit it, cluster sharding on a seed node will treat all nodes in the cluster as perfectly capable of hosting sharded actors - including client node, which doesn't have cluster sharding (non-proxy) instantiated on it.
This may not be the only issue, but you could either host cluster sharding on all of your nodes, or use ClusterShardingSettings.Create(system).WithRole("seed") to limit your shard only to a specific subset of nodes (having seed role) in the cluster.
Thanks Horusiath, that's fixed it:
return sharding.Start(
typeName: "company-router",
entityProps: Props.Create(() => new CompanyDeliveryActor()),
settings: ClusterShardingSettings.Create(system).WithRole("Seed"),
messageExtractor: new RouteExtractor(100)
);
The clustered shard is now communicating between the 2 processes. Thanks very much for that bit.
The project I am working on uses GIT in a weird way. Essentially it writes and pushes one commit at a time. The project could result in one branch having hundreds of thousands of commits. When testing we found that after only about 500 commits the performance of the GIT push started to degrade. Upon further investigation using a process monitor we believe that the degradation is due to a walk of the entire tree for the branch being pushed. Since we are only ever pushing one new commit at any given time is there any way to optimize this?
Alternatively is there a way to limit the commit history to be something like 50 commits to reduce this overhead?
I am using LibGit2Sharp Version 0.20.1.0
Update 1
To test I wrote the following code:
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
Repository repo = new Repository(localPath);
for(int i = 0; i < 2000; i++)
{
File.WriteAllText(localFilePath, RandomString((i % 2 + 1) * 10));
repo.Stage(localFilePath);
Commit commit = repo.Commit(
string.Format("Commit number: {0}", i),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
Remote defaultRemote = repo.Network.Remotes["origin"];
repo.Network.Push(defaultRemote, "refs/heads/master:refs/heads/master");
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", i, pushWatch.ElapsedMilliseconds));
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}
And ran the process monitor found here:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
The time for each push ended up being generally low with large spikes in time increasing both in frequency and in latency. When looking at the output from the process monitor I believe these spikes lined up with a long stretch where objects in the .git\objects folder were being accessed. For some reason occasionally on a pull there are large reads of the objects which when looked at closer appears to be a walk through the commits and objects.
The above flow is a condensed version of the actual flow we were actually doing in the project. In our actual flow we would first create a new branch "Temp" from "Master", make a commit to "Temp", push "Temp", merge "Temp" with "Master" then push "Master". When we timed each part of that flow we found the push was by far the longest running operation and it was increasing in elapsed time as the commits piled up on "Master".
Update 2
I recently updated to use libgit2sharp version 0.20.1.0 and this problem still exists. Does anyone know why this occurs?
Update 3
We change some of our code to create the temporary branch off of the first commit ever on the "Master" branch to reduce the commit tree traversal overhead but found it still exists. Below is an example that should be easy to compile and run. It shows the tree traversal happens when you create a new branch regardless of the commit position. To see the tree traversal I used the process monitor tool above and command line GIT Bash to examine what each object it opened was. Does anyone know why this happens? Is it expected behavior or am I just doing something wrong? It appears to be the push that causes the issue.
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
// Setup Initial Commit
string newBranch;
using (Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, 0, localFilePath, "master");
newBranch = CreateNewBranch(repo, "master");
repo.Checkout(newBranch);
}
// Commit 1000 times to the new branch
for(int i = 1; i < 1001; i++)
{
using(Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, i, localFilePath, newBranch);
}
}
// Create a single new branch from the first commit ever
// For some reason seems to walk the entire commit tree
using(Repository repo = new Repository(localPath))
{
CreateNewBranch(repo, "master");
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Generate and commit a random file to the specified branch
/// </summary>
public static void CommitRandomFile(Repository repo, int seed, string rootPath, string branch)
{
File.WriteAllText(rootPath, RandomString((seed % 2 + 1) * 10));
repo.Stage(rootPath);
Commit commit = repo.Commit(
string.Format("Commit: {0}", seed),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + branch + ":refs/heads/" + branch);
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", seed, pushWatch.ElapsedMilliseconds));
}
/// <summary>
/// Create a new branch from the specified source
/// </summary>
public static string CreateNewBranch(Repository repo, string sourceBranch)
{
Branch source = repo.Branches[sourceBranch];
string newBranch = Guid.NewGuid().ToString();
repo.Branches.Add(newBranch, source.Tip);
Stopwatch pushNewBranchWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + newBranch + ":refs/heads/" + newBranch);
pushNewBranchWatch.Stop();
Trace.WriteLine(string.Format("Push of new branch {0} took {1}ms", newBranch, pushNewBranchWatch.ElapsedMilliseconds));
return newBranch;
}
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}