StreamingFileSink doesn't work sometimes when trying to write to S3 - amazon-s3

I am trying to write to a S3 sink.
private static StreamingFileSink<String> createS3SinkFromStaticConfig(
final Map<String, Properties> applicationProperties
) {
Properties sinkProperties = applicationProperties.get(SINK_PROPERTIES);
String s3SinkPath = sinkProperties.getProperty(SINK_S3_PATH_KEY);
return StreamingFileSink
.forRowFormat(
new Path(s3SinkPath),
new SimpleStringEncoder<String>(StandardCharsets.UTF_8.toString())
)
.build();
}
The following code works and I can see the results in S3
input.map(value -> { // Parse the JSON
JsonNode jsonNode = jsonParser.readValue(value, JsonNode.class);
return new Tuple2<>(jsonNode.get("ticker").asText(), jsonNode.get("price").asDouble());
}).returns(Types.TUPLE(Types.STRING, Types.DOUBLE))
.keyBy(0) // Logically partition the stream per stock symbol
.timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition
.min(1) // Calculate minimum price per stock over the window
.setParallelism(3) // Set parallelism for the min operator
.map(value -> value.f0 + ": ----- " + value.f1.toString() + "\n")
.addSink(createS3SinkFromStaticConfig(applicationProperties));
But the following doesn't write anything to S3.
KeyedStream<EnrichedMetric, EnrichedMetricKey> input = env.addSource(new EnrichedMetricSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<EnrichedMetric>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
).keyBy(new EnrichedMetricKeySelector());
DataStream<String> statsStream = input
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new PValueStatisticsWindowFunction());
statsStream.addSink(createS3SinkFromStaticConfig(applicationProperties));
PValueStatisticsWindowFunction is a ProcessWindowFunction as below.
#Override
public void process(EnrichedMetricKey enrichedMetricKey,
Context context,
Iterable<EnrichedMetric> in,
Collector<String> out) throws Exception {
int count = 0;
for (EnrichedMetric m : in) {
count++;
}
out.collect("Count: " + count);
}
When I run the Flink app locally, statsStream.print() prints the results to log/flink-*-taskexecutor-*.out.
In the cluster, I can see checkpoint is enabled and the various checkpoints history from the Flink dashboard. I also made sure the S3 path is in the format s3a://<bucket>
Not sure what I am missing here.

Related

Cosmos DB - changeFeedWorker not working properly after Cosmos db reach 50 gb size and splitted db into 2 partitions

Just want to know if someone already get some issues with changeFeedWorker after that cosmos db reach 50Gb size limit for one partition and automatically split it into 2 partitions.
Since this split we have observed that after adding many new items to the db, changefeedworkers was not always triggered and our view that should get changes is partially updated.
In the lease container we can see now that we moved to 2 "LeaseToken" (1 and 2) was 0 before.
If someone has an idea where to look as it was working fine before the db split.
Here is how I start my worker:
async Task IChangeFeedWorker.StartAsync()
{
await _semaphoreSlim.WaitAsync();
string operationName = $"{nameof(CosmosChangeFeedWorker)}.{nameof(IChangeFeedWorker.StartAsync)}";
using var _ = BeginChangeFeedWorkerScope(_logger, operationName, _processorName, _instanceName);
try
{
if (Active)
{
return;
}
Container eventContainer = _cosmosClient.GetContainer(_databaseId, _eventContainerId);
Container leaseContainer = _cosmosClient.GetContainer(_databaseId, _leaseContainerId);
_changeFeedProcessor = eventContainer
.GetChangeFeedProcessorBuilder<CosmosEventChange>(_processorName, HandleChangesAsync)
.WithInstanceName(_instanceName)
.WithLeaseContainer(leaseContainer)
.WithStartTime(_startTimeOfTrackingChanges)
.Build();
await _changeFeedProcessor.StartAsync();
Active = true;
_logger.LogInformation(
"Change feed processor instance has been started.",
_processorName, _instanceName);
}
catch (Exception e)
{
_logger.LogError(e,
"Starting of change feed processor instance has failed.",
_processorName, _instanceName);
_changeFeedProcessor = null;
throw;
}
finally
{
_semaphoreSlim.Release();
}
}

Plc4x addressing system

I am discovering the Plc4x java implementation which seems to be of great interest in our field. But the youth of the project and the documentation makes us hesitate. I have been able to implement the basic hello world for reading out of our PLCs, but I was unable to write. I could not find how the addresses are handled and what the maskwrite, andMask and orMask fields mean.
Please can somebody explain to me the following example and detail how the addresses should be used?
#Test
void testWriteToPlc() {
// Establish a connection to the plc using the url provided as first argument
try( PlcConnection plcConnection = new PlcDriverManager().getConnection( "modbus:tcp://1.1.2.1" ) ){
// Create a new read request:
// - Give the single item requested the alias name "value"
var builder = plcConnection.writeRequestBuilder();
builder.addItem( "value-" + 1, "maskwrite:1[1]/2/3", 2 );
var writeRequest = builder.build();
LOGGER.info( "Synchronous request ..." );
var syncResponse = writeRequest.execute().get();
}catch(Exception e){
e.printStackTrace();
}
}
I have used PLC4x for writing using the modbus driver with success. Here is some sample code I am using:
public static void writePlc4x(ProtocolConnection connection, String registerName, byte[] writeRegister, int offset)
throws InterruptedException {
// modbus write works ok writing one record per request/item
int size = 1;
PlcWriteRequest.Builder writeBuilder = connection.writeRequestBuilder();
if (writeRegister.length == 2) {
writeBuilder.addItem(registerName, "register:" + offset + "[" + size + "]", writeRegister);
}
...
PlcWriteRequest request = writeBuilder.build();
request.execute().whenComplete((writeResponse, error) -> {
assertNotNull(writeResponse);
});
Thread.sleep((long) (sleepWait4Write * writeRegister.length * 1000));
}
In the case of modbus writing there is an issue regarding the return of the writer Future, but the write is done. In the modbus use case I don't need any mask stuff.

Plc4x cannot read more than 9 registers at once

I am trying to understand the address system in the plac4x java implementation. Below an example of the reading code of the plcs:
#Test
void testReadingFromPlc() {
// Establish a connection to the plc using the url provided as first argument
try( PlcConnection plcConnection = new PlcDriverManager().getConnection( "modbus:tcp://1.1.2.1" ) ){
// Create a new read request:
// - Give the single item requested the alias name "value"
var builder = plcConnection.readRequestBuilder();
builder.addItem( "value-" + 1, "register:1[9]" );
builder.addItem( "value-" + 2, "coil:1000[8]" );
var readRequest = builder.build();
LOGGER.info( "Synchronous request ..." );
var syncResponse = readRequest.execute().get();
// Simply iterating over the field names returned in the response.
var bytes = syncResponse.getAllByteArrays( "value-1" );
bytes.forEach( item -> System.out.println( TopicsMapping.byteArray2IntegerArray( item )[0] ) );
var booleans = syncResponse.getAllBooleans( "value-2" );
booleans.forEach( System.out::println );
}catch(Exception e){
e.printStackTrace();
}
}
Our PLCs manage 16 registers, but the regex of the addresses don't allow to have a quantity bigger than 9. Is it possible to change this?
Moreover, if I try to add an other field with the same purpose then no reading happen:
var builder = plcConnection.readRequestBuilder();
builder.addItem( "value-" + 0, "register:26[8]" );
builder.addItem( "value-" + 1, "register:34[8]" );
builder.addItem( "value-" + 2, "coil:1000[8]" );
var readRequest = builder.build();
Any help much appreciated. Could you also show me where I can find more information on this framework?
I am reading and writing using the modbus driver in PLC4x with success. I have attached some writing code to your other question at: Plc4x addressing system
About reading, here is some code:
public static PlcReadResponse readModbusTestData(ProtocolClient client,
String registerName,
int offset,
int size,
String registerType)
throws ExecutionException, InterruptedException, TimeoutException {
PlcReadRequest readRequest = client.getConnection().readRequestBuilder()
.addItem(registerName, registerType + ":" + offset + "[" + size + "]").build();
return readRequest.execute().get(2, TimeUnit.SECONDS);
}
The multiple read adding more items to the PlcReadRequest has not been tested yet by me, but it should work. Writing several items is working.
In any case, in order to understand how PLC4x works for modbus or opc-ua I have needed to dive into the source code. It works, but you need to read the source code for the details at its current state.

Apache-ignite: Persistent Storage

My understanding for Ignite Persistent Storage is that the data is not only saved in memory, but also written to disk.
When the node is restarted, it should read the data from disk to memory.
So, I am using this example to test it out. But I update it a little bit because I don't want to use xml.
This is my slightly updated code.
public class PersistentIgniteExpr {
/**
* Organizations cache name.
*/
private static final String ORG_CACHE = "CacheQueryExample_Organizations";
/** */
private static final boolean UPDATE = true;
public void test(String nodeId) {
// Apache Ignite node configuration.
IgniteConfiguration cfg = new IgniteConfiguration();
// Ignite persistence configuration.
DataStorageConfiguration storageCfg = new DataStorageConfiguration();
// Enabling the persistence.
storageCfg.getDefaultDataRegionConfiguration().setPersistenceEnabled(true);
// Applying settings.
cfg.setDataStorageConfiguration(storageCfg);
List<String> addresses = new ArrayList<>();
addresses.add("127.0.0.1:47500..47502");
TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
tcpDiscoverySpi.setIpFinder(new TcpDiscoveryMulticastIpFinder().setAddresses(addresses));
cfg.setDiscoverySpi(tcpDiscoverySpi);
try (Ignite ignite = Ignition.getOrStart(cfg.setIgniteInstanceName(nodeId))) {
// Activate the cluster. Required to do if the persistent store is enabled because you might need
// to wait while all the nodes, that store a subset of data on disk, join the cluster.
ignite.active(true);
CacheConfiguration<Long, Organization> cacheCfg = new CacheConfiguration<>(ORG_CACHE);
cacheCfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);
cacheCfg.setBackups(1);
cacheCfg.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
cacheCfg.setIndexedTypes(Long.class, Organization.class);
IgniteCache<Long, Organization> cache = ignite.getOrCreateCache(cacheCfg);
if (UPDATE) {
System.out.println("Populating the cache...");
try (IgniteDataStreamer<Long, Organization> streamer = ignite.dataStreamer(ORG_CACHE)) {
streamer.allowOverwrite(true);
for (long i = 0; i < 100_000; i++) {
streamer.addData(i, new Organization(i, "organization-" + i));
if (i > 0 && i % 10_000 == 0)
System.out.println("Done: " + i);
}
}
}
// Run SQL without explicitly calling to loadCache().
QueryCursor<List<?>> cur = cache.query(
new SqlFieldsQuery("select id, name from Organization where name like ?")
.setArgs("organization-54321"));
System.out.println("SQL Result: " + cur.getAll());
// Run get() without explicitly calling to loadCache().
Organization org = cache.get(54321l);
System.out.println("GET Result: " + org);
}
}
}
When I run the first time, it works as intended.
After running it one time, I am assuming that data is written to disk since the code is about persistent storage.
When I run the second time, I commented out this part.
if (UPDATE) {
System.out.println("Populating the cache...");
try (IgniteDataStreamer<Long, Organization> streamer = ignite.dataStreamer(ORG_CACHE)) {
streamer.allowOverwrite(true);
for (long i = 0; i < 100_000; i++) {
streamer.addData(i, new Organization(i, "organization-" + i));
if (i > 0 && i % 10_000 == 0)
System.out.println("Done: " + i);
}
}
}
That is the part where data is written. When the sql query is executed, it is returning null. That means data is not written to disk?
Another question is I am not very clear about TcpDiscoverySpi. Can someone explain about it as well?
Thanks in advance.
Do you have any exceptions at node startup?
Very probably, you don't have IGNITE_HOME env variable configured. And the Work Directory for persistence is chosen somehow differently each time you run a node.
You can either setup IGNITE_HOME env variable or add a code line to setup workDirectory explicitly: cfg.setWorkDirectory("C:\\workDirectory");
TcpDiscoverySpi provides a way to discover remote nodes in a grid, so the starting node can join a cluster. It is better to use TcpDiscoveryVmIpFinder if you know the list of IPs. TcpDiscoveryMulticastIpFinder broadcasts UDP messages to a network to discover other nodes. It does not require IPs list at all.
Please see https://apacheignite.readme.io/docs/cluster-config for more details.

Libgit2sharp Push Performance Degrading With Many Commits

The project I am working on uses GIT in a weird way. Essentially it writes and pushes one commit at a time. The project could result in one branch having hundreds of thousands of commits. When testing we found that after only about 500 commits the performance of the GIT push started to degrade. Upon further investigation using a process monitor we believe that the degradation is due to a walk of the entire tree for the branch being pushed. Since we are only ever pushing one new commit at any given time is there any way to optimize this?
Alternatively is there a way to limit the commit history to be something like 50 commits to reduce this overhead?
I am using LibGit2Sharp Version 0.20.1.0
Update 1
To test I wrote the following code:
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
Repository repo = new Repository(localPath);
for(int i = 0; i < 2000; i++)
{
File.WriteAllText(localFilePath, RandomString((i % 2 + 1) * 10));
repo.Stage(localFilePath);
Commit commit = repo.Commit(
string.Format("Commit number: {0}", i),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
Remote defaultRemote = repo.Network.Remotes["origin"];
repo.Network.Push(defaultRemote, "refs/heads/master:refs/heads/master");
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", i, pushWatch.ElapsedMilliseconds));
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}
And ran the process monitor found here:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
The time for each push ended up being generally low with large spikes in time increasing both in frequency and in latency. When looking at the output from the process monitor I believe these spikes lined up with a long stretch where objects in the .git\objects folder were being accessed. For some reason occasionally on a pull there are large reads of the objects which when looked at closer appears to be a walk through the commits and objects.
The above flow is a condensed version of the actual flow we were actually doing in the project. In our actual flow we would first create a new branch "Temp" from "Master", make a commit to "Temp", push "Temp", merge "Temp" with "Master" then push "Master". When we timed each part of that flow we found the push was by far the longest running operation and it was increasing in elapsed time as the commits piled up on "Master".
Update 2
I recently updated to use libgit2sharp version 0.20.1.0 and this problem still exists. Does anyone know why this occurs?
Update 3
We change some of our code to create the temporary branch off of the first commit ever on the "Master" branch to reduce the commit tree traversal overhead but found it still exists. Below is an example that should be easy to compile and run. It shows the tree traversal happens when you create a new branch regardless of the commit position. To see the tree traversal I used the process monitor tool above and command line GIT Bash to examine what each object it opened was. Does anyone know why this happens? Is it expected behavior or am I just doing something wrong? It appears to be the push that causes the issue.
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
// Setup Initial Commit
string newBranch;
using (Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, 0, localFilePath, "master");
newBranch = CreateNewBranch(repo, "master");
repo.Checkout(newBranch);
}
// Commit 1000 times to the new branch
for(int i = 1; i < 1001; i++)
{
using(Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, i, localFilePath, newBranch);
}
}
// Create a single new branch from the first commit ever
// For some reason seems to walk the entire commit tree
using(Repository repo = new Repository(localPath))
{
CreateNewBranch(repo, "master");
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Generate and commit a random file to the specified branch
/// </summary>
public static void CommitRandomFile(Repository repo, int seed, string rootPath, string branch)
{
File.WriteAllText(rootPath, RandomString((seed % 2 + 1) * 10));
repo.Stage(rootPath);
Commit commit = repo.Commit(
string.Format("Commit: {0}", seed),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + branch + ":refs/heads/" + branch);
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", seed, pushWatch.ElapsedMilliseconds));
}
/// <summary>
/// Create a new branch from the specified source
/// </summary>
public static string CreateNewBranch(Repository repo, string sourceBranch)
{
Branch source = repo.Branches[sourceBranch];
string newBranch = Guid.NewGuid().ToString();
repo.Branches.Add(newBranch, source.Tip);
Stopwatch pushNewBranchWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + newBranch + ":refs/heads/" + newBranch);
pushNewBranchWatch.Stop();
Trace.WriteLine(string.Format("Push of new branch {0} took {1}ms", newBranch, pushNewBranchWatch.ElapsedMilliseconds));
return newBranch;
}
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}