Apache Beam : RabbitMqIO watermark doesn't advance

Apache Beam : RabbitMqIO watermark doesn't advance - rabbitmq

I need some help please. I'm trying to use Apache beam with RabbitMqIO source (version 2.11.0) and AfterWatermark.pastEndOfWindow trigger. It seems like the RabbitMqIO's watermark doesn't advance and remain the same. Because of this behavior, the AfterWatermark trigger doesn't work. When I use others triggers which doesn't take watermark in consideration, that works (eg: AfterProcessingTime, AfterPane) Below, my code, thanks :
public class Main {
private static final Logger LOGGER = LoggerFactory.getLogger(Main.class);
// Window declaration with trigger
public static Window<RabbitMqMessage> window() {
return Window. <RabbitMqMessage>into(FixedWindows.of(Duration.standardSeconds(60)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes();
}
public static void main(String[] args) {
SpringApplication.run(Main.class, args);
// pipeline creation
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
// Using RabbitMqIO
PCollection<RabbitMqMessage> messages = pipeline
.apply(RabbitMqIO.read().withUri("amqp://guest:guest#localhost:5672").withQueue("test"));
PCollection<RabbitMqMessage> windowedData = messages.apply("Windowing", window());
windowedData.apply(Combine.globally(new MyCombine()).withoutDefaults());
pipeline.run();
}
}
class MyCombine implements SerializableFunction<Iterable<RabbitMqMessage>, RabbitMqMessage> {
private static final Logger LOGGER = LoggerFactory.getLogger(MyCombineKafka.class);
/**
*
*/
private static final long serialVersionUID = 6143898367853230506L;
#Override
public RabbitMqMessage apply(Iterable<RabbitMqMessage> input) {
LOGGER.info("After trigger launched");
return null;
}
}

I spent a lot of time looking into this. After opening https://issues.apache.org/jira/browse/BEAM-8347 I left some notes in the ticket on what I think the problems are with the current implementation.
Re-stated here:
The documentation for UnboundedSource.getWatermark reads:
[watermark] can be approximate. If records are read that violate this guarantee, they will be considered late, which will affect how
they will be processed. ...
However, this value should be as late as possible. Downstream windows may not be able to close until this watermark passes their
end.
For example, a source may know that the records it reads will be in timestamp order. In this case, the watermark can be the timestamp
of the last record read. For a source that does not have natural
timestamps, timestamps can be set to the time of reading, in which
case the watermark is the current clock time.
The implementation in UnboundedRabbitMqReader uses the oldest timestamp as the watermark, in violation of the above suggestion.
Further, the timestamp applied is delivery time, which should be monotonically increasing. We should reliably be able to increase the watermark on every message delivered, which mostly solves the issue.
Finally, we can make provisions for increasing the watermark even when no messages have come in. In the event where there are no new messages, it should be ok to advance the watermark following the approach taken in the kafka io TimestampPolicyFactory when the stream is 'caught up'. In this case, we would increment the watermark to, e.g., max(current watermark, NOW - 2 seconds) when we see no new messages, just to ensure windows/triggers can fire without requiring new data.
Unfortunately, it's difficult to make these slight modifications locally as the Rabbit implementations are closed to extension, and are mostly private or package-private.
Update: I've opened a PR upstream to address this. Changes here: https://github.com/apache/beam/pull/9820

Related

Changing the GemFire query ResultSender batch size

I am experiencing a performance issue related to the default batch size of the query ResultSender using client/server config. I believe the default value is 100.
If I run a simple query to get keys (with some order by columns due to the PARTITION Region type), this default batch size causes too many chunks being sent back for even 1000 records. In my tests, even the total query time is only less than 100 ms, however, the app takes more than 10 seconds to process those chunks.

Reading between the lines in your problem statement, it seems you are:
Executing an OQL query on a PARTITION Region (PR).
Running the query inside a Function as recommended when executing queries on a PR.
Sending batch results (as opposed to streaming the results).
I also assume since you posted exclusively in the #spring-data-gemfire channel, that you are using Spring Data GemFire (SDG) to:
Execute the query (e.g. by using the SDG GemfireTemplate; Of course, I suppose you could also be using the GemFire Query API inside your Function directly, too)?
Implemented the server-side Function using SDG's Function annotation support?
And, are possibly (indirectly) using SDG's BatchingResultSender, as described in the documentation?
NOTE: The default batch size in SDG is 0, NOT 100. Zero means stream the results individually.
Regarding #2 & #3, your implementation might look something like the following:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction", batchSize = "1000")
public List<SomeApplicationType> myFunction(FunctionContext functionContext) {
RegionFunctionContext regionFunctionContext =
(RegionFunctionContext) functionContext;
Region<?, ?> region = regionFunctionContext.getDataSet();
if (PartitionRegionHelper.isPartitionRegion(region)) {
region = PartitionRegionHelper.getLocalDataForContext(regionFunctionContext);
}
GemfireTemplate template = new GemfireTemplate(region);
String OQL = "...";
SelectResults<?> results = template.query(OQL); // or `template.find(OQL, args);`
List<SomeApplicationType> list = ...;
// process results, convert to SomeApplicationType, add to list
return list;
}
}
NOTE: Since you are most likely executing this Function "on Region", the FunctionContext type will actually be a RegionFunctionContext in this case.
The batchSize attribute on the SDG #GemfireFunction annotation (used for Function "implementations") allows you to control the batch size.
Of course, instead of using SDG's GemfireTemplate to execute queries, you can, of course, use the GemFire Query API directly, as mentioned above.
If you need even more fine grained control over "result sending", then you can simply "inject" the ResultSender provided by GemFire to the Function, even if the Function is implemented using SDG, as shown above. For example you can do:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction")
public void myFunction(FunctionContext functionContext, ResultSender resultSender) {
...
SelectResults<?> results = ...;
// now process the results and use the `resultSender` directly
}
}
This allows you to "send" the results however you see fit, as required by your application.
You can batch/chunk results, stream, whatever.
Although, you should be mindful of the "receiving" side in this case!
The 1 thing that might not be apparent to the average GemFire user is that GemFire's default ResultCollector implementation collects "all" the results first before returning them to the application. This means the receiving side does not support streaming or batching/chunking of the results, allowing them to be processed immediately when the server sends the results (either streamed, batched/chunked, or otherwise).
Once again, SDG helps you out here since you can provide a custom ResultCollector on the Function "execution" (client-side), for example:
#OnRegion("SomePartitionRegion", resultCollector="myResultCollector")
interface MyApplicationFunctionExecution {
void myFunction();
}
In your Spring configuration, you would then have:
#Configuration
class ApplicationGemFireConfiguration {
#Bean
ResultCollector myResultCollector() {
return ...;
}
}
Your "custom" ResultCollector could return results as a stream, a batch/chunk at a time, etc.
In fact, I have prototyped a "streaming" ResultCollector implementation that will eventually be added to SDG, here.
Anyway, this should give you some ideas on how to handle the performance problem you seem to be experiencing. 1000 results is not a lot of data so I suspect your problem is mostly self-inflicted.
Hope this helps!

John,
Just to clarify, I use client/server topology(actually wan, but that is not important in here). My client is a spring boot web app which has kendo grid as ui. Users can filter/sort on any combination of the columns, which will be passed to the spring boot app for generating dynamic OQL and create the pagination. Till now, except for being dynamic, my OQL queries are quite straight forward. I do not want to introduce server side functions due to the complexity of our global deployment process. But I can if you think that is something I have to do.
Again, thanks for your answers.

Read SQL Server Broker messages and publish them using NServiceBus

I am very new to NServiceBus, and in one of our project, we want to accomplish following -
Whenever table data is modified in Sql server, construct a message and insert in sql server broker queue
Read the broker queue message using NServiceBus
Publish the message again as another event so that other subscribers
can handle it.
Now it is point 2, that I do not have much clue, how to get it done.
I have referred the following posts, after which I was able to enter the message in broker queue, but unable to integrate with NServiceBus in our project, as the NServiceBus libraries are of older version and also many methods used are deprecated. So using them with current versions is getting very troublesome, or if I was doing it in improper way.
http://www.nullreference.se/2010/12/06/using-nservicebus-and-servicebroker-net-part-2
https://github.com/jdaigle/servicebroker.net
Any help on the correct way of doing this would be invaluable.
Thanks.

I'm using the current version of nServiceBus (5), VS2013 and SQL Server 2008. I created a Database Change Listener using this tutorial, which uses SQL Server object broker and SQLDependency to monitor the changes to a specific table. (NB This may be deprecated in later versions of SQL Server).
SQL Dependency allows you to use a broad selection of all the basic SQL functionality, although there are some restrictions that you need to be aware of. I modified the code from the tutorial slightly to provide better error information:
void NotifyOnChange(object sender, SqlNotificationEventArgs e)
{
// Check for any errors
if (#"Subscribe|Unknown".Contains(e.Type.ToString())) { throw _DisplayErrorDetails(e); }
var dependency = sender as SqlDependency;
if (dependency != null) dependency.OnChange -= NotifyOnChange;
if (OnChange != null) { OnChange(); }
}
private Exception _DisplayErrorDetails(SqlNotificationEventArgs e)
{
var message = "useful error info";
var messageInner = string.Format("Type:{0}, Source:{1}, Info:{2}", e.Type.ToString(), e.Source.ToString(), e.Info.ToString());
if (#"Subscribe".Contains(e.Type.ToString()) && #"Invalid".Contains(e.Info.ToString()))
messageInner += "\r\n\nThe subscriber says that the statement is invalid - check your SQL statement conforms to specified requirements (http://stackoverflow.com/questions/7588572/what-are-the-limitations-of-sqldependency/7588660#7588660).\n\n";
return new Exception(messageMain, new Exception(messageInner));
}
I also created a project with a "database first" Entity Framework data model to allow me do something with the changed data.
[The relevant part of] My nServiceBus project comprises two "Run as Host" endpoints, one of which publishes event messages. The second endpoint handles the messages. The publisher has been setup to IWantToRunAtStartup, which instantiates the DBListener and passes it the SQL statement I want to run as my change monitor. The onChange() function is passed an anonymous function to read the changed data and publish a message:
using statements
namespace Sample4.TestItemRequest
{
public partial class MyExampleSender : IWantToRunWhenBusStartsAndStops
{
private string NOTIFY_SQL = #"SELECT [id] FROM [dbo].[Test] WITH(NOLOCK) WHERE ISNULL([Status], 'N') = 'N'";
public void Start() { _StartListening(); }
public void Stop() { throw new NotImplementedException(); }
private void _StartListening()
{
var db = new Models.TestEntities();
// Instantiate a new DBListener with the specified connection string
var changeListener = new DatabaseChangeListener(ConfigurationManager.ConnectionStrings["TestConnection"].ConnectionString);
// Assign the code within the braces to the DBListener's onChange event
changeListener.OnChange += () =>
{
/* START OF EVENT HANDLING CODE */
//This uses LINQ against the EF data model to get the changed records
IEnumerable<Models.TestItems> _NewTestItems = DataAccessLibrary.GetInitialDataSet(db);
while (_NewTestItems.Count() > 0)
{
foreach (var qq in _NewTestItems)
{
// Do some processing, if required
var newTestItem = new NewTestStarted() { ... set properties from qq object ... };
Bus.Publish(newTestItem);
}
// Because there might be a number of new rows added, I grab them in small batches until finished.
// Probably better to use RX to do this, but this will do for proof of concept
_NewTestItems = DataAccessLibrary.GetNextDataChunk(db);
}
changeListener.Start(string.Format(NOTIFY_SQL));
/* END OF EVENT HANDLING CODE */
};
// Now everything has been set up.... start it running.
changeListener.Start(string.Format(NOTIFY_SQL));
}
}
}
Important The OnChange event firing causes the listener to stop monitoring. It basically is a single event notifier. After you have handled the event, the last thing to do is restart the DBListener. (You can see this in the line preceding the END OF EVENT HANDLING comment).
You need to add a reference to System.Data and possibly System.Data.DataSetExtensions.
The project at the moment is still proof of concept, so I'm well aware that the above can be somewhat improved. Also bear in mind I had to strip out company specific code, so there may be bugs. Treat it as a template, rather than a working example.
I also don't know if this is the right place to put the code - that's partly why I'm on StackOverflow today; to look for better examples of ServiceBus host code. Whatever the failings of my code, the solution works pretty effectively - so far - and meets your goals, too.
Don't worry too much about the ServiceBroker side of things. Once you have set it up, per the tutorial, SQLDependency takes care of the details for you.

The ServiceBroker Transport is very old and not supported anymore, as far as I can remember.
A possible solution would be to "monitor" the interesting tables from the endpoint code using something like a SqlDependency (http://msdn.microsoft.com/en-us/library/62xk7953(v=vs.110).aspx) and then push messages into the relevant queues.
.m

App Folder files not visible after un-install / re-install

I noticed this in the debug environment where I have to do many re-installs in order to test persistent data storage, initial settings, etc... It may not be relevant in production, but I mention this anyway just to inform other developers.
Any files created by an app in its App Folder are not 'visible' to queries after manual un-install / re-install (from IDE, for instance). The same applies to the 'Encoded DriveID' - it is no longer valid.
It is probably 'by design' but it effectively creates 'orphans' in the app folder until manually cleaned by 'drive.google.com > Manage Apps > [yourapp] > Options > Delete hidden app data'. It also creates problem if an app relies on finding of files by metadata, title, ... since these seem to be gone. As I said, not a production problem, but it can create some frustration during development.
Can any of friendly Googlers confirm this? Is there any other way to get to these files after re-install?

Try this approach:
Use requestSync() in onConnected() as:
#Override
public void onConnected(Bundle connectionHint) {
super.onConnected(connectionHint);
Drive.DriveApi.requestSync(getGoogleApiClient()).setResultCallback(syncCallback);
}
Then, in its callback, query the contents of the drive using:
final private ResultCallback<Status> syncCallback = new ResultCallback<Status>() {
#Override
public void onResult(#NonNull Status status) {
if (!status.isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
query = new Query.Builder()
.addFilter(Filters.and(Filters.eq(SearchableField.TITLE, "title"),
Filters.eq(SearchableField.TRASHED, false)))
.build();
Drive.DriveApi.query(getGoogleApiClient(), query)
.setResultCallback(metadataCallback);
}
};
Then, in its callback, if found, retrieve the file using:
final private ResultCallback<DriveApi.MetadataBufferResult> metadataCallback =
new ResultCallback<DriveApi.MetadataBufferResult>() {
#SuppressLint("SetTextI18n")
#Override
public void onResult(#NonNull DriveApi.MetadataBufferResult result) {
if (!result.getStatus().isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
MetadataBuffer mdb = result.getMetadataBuffer();
for (Metadata md : mdb) {
Date createdDate = md.getCreatedDate();
DriveId driveId = md.getDriveId();
}
readFromDrive(driveId);
}
};
Job done!
Hope that helps!

It looks like Google Play services has a problem. (https://stackoverflow.com/a/26541831/2228408)
For testing, you can do it by clearing Google Play services data (Settings > Apps > Google Play services > Manage Space > Clear all data).
Or, at this time, you need to implement it by using Drive SDK v2.

I think you are correct that it is by design.
By inspection I have concluded that until an app places data in the AppFolder folder, Drive does not sync down to the device however much to try and hassle it. Therefore it is impossible to check for the existence of AppFolder placed by another device, or a prior implementation. I'd assume that this was to try and create a consistent clean install.
I can see that there are a couple of strategies to work around this:
1) Place dummy data on AppFolder and then sync and recheck.
2) Accept that in the first instance there is the possibility of duplicates, as you cannot access the existing file by definition you will create a new copy, and use custom metadata to come up with a scheme to differentiate like-named files and choose which one you want to keep (essentially implement your conflict merge strategy across the two different files).
I've done the second, I have an update number to compare data from different devices and decide which version I want so decide whether to upload, download or leave alone. As my data is an SQLite DB I also have some code to only sync once updates have settled down and I deliberately consider people updating two devices at once foolish and the results are consistent but undefined as to which will win.

uiitem.WaitForControlExist(milliseconds); Waits too long

When I use the method uiitem.WaitForControlExist(milliseconds); Execution waits too long. Muchmore of the specified parameter.
Any idea?
Just an example on UIMap.cs file:
public void AnyAlertClickNo(int seconds)
{
#region Variable Declarations
WinWindow uIAlert = this.UIAlertWindow;
WinButton uINoButton = this.UIAlertWindow.UIAnswerPanel.UINoButton;
#endregion
if(uIAlert.WaitForControlExist(seconds*1000)){
Mouse.Click(uINoButton, new Point(20, 10));
}
}
Te calls could be:
Any_UIMap aaa = new Any_UIMap();
aaa.AnyAlertClickNo(3);
I don't know why this code are waiting for this alert arround 15-20 seconds.
thanks in advance

The code is unlikely to be
uiitem.WaitForControlExist(milliseconds);
There are often several levels of UI control, so the code is more likely to be of the form:
UiMap.uiOne.uiTwo.uiThree.WaitForControlExist(milliseconds);
A line like the above has a meaning like the following, provided that it has the first use of the three UI controls:
var one = UiMap.uiOne.Find();
var two = one.uiTwo.Find();
two.uiThree.WaitForControlExist(milliseconds);
I suspect that your Coded UI test is spending some time on the ...Find() calls. You might do some diagnostics to check where the time is spent. Look here and here for some good ideas on speeding up Coded UI tests.

Well it is supposed to.
UITestControl.WaitForControlExist Method (Int32)
When the wait operation causes an implicit search for the control or, when the application is busy, the actual wait time could be more than the time-out specified.

Random error when testing with NHibernate on an in-Memory SQLite db

I have a system which after getting a message - enqueues it (write to a table), and another process polls the DB and dequeues it for processing. In my automatic tests I've merged the operations in the same process, but cannot (conceptually) merge the NH sessions from the two operations.
Naturally - problems arise.
I've read everything I could about getting the SQLite-InMemory-NHibernate combination to work in the testing world, but I've now ran into RANDOMLY failing tests, due to "no such table" errors. To make it clear - "random" means that the same test with the same exact configuration and code will sometimes fail.
I have the following SQLite configuration:
return SQLiteConfiguration
.Standard
.ConnectionString(x => x.Is("Data Source=:memory:; Version=3; New=True; Pooling=True; Max Pool Size=1;"))
.Raw(NHibernate.Cfg.Environment.ReleaseConnections, "on_close");
At the beginning of my test (every test) I fetch the "static" session provider, and kindly ask it to flush the existing DB clean, and recreate the schema:
public void PurgeDatabaseOrCreateNew()
{
using (var session = GetNewSession())
using (var tx = session.BeginTransaction())
{
PurgeDatabaseOrCreateNew(session);
tx.Commit();
}
}
private void PurgeDatabaseOrCreateNew(ISession session)
{
//http://ayende.com/Blog/archive/2009/04/28/nhibernate-unit-testing.aspx
new SchemaExport(_Configuration)
.Execute(false, true, false, session.Connection, null);
}
So yes, it's on a different session, but the connection is pooled on SQLite, so the next session I create will see the generated schema. Yet, while most of the times it works - sometimes the later "enqueue" operation will fail because it cannot see a table for my incoming messages.
Also - that seems to happen at max one or twice per test suite run; not all the tests are failing, just the first one (and sometimes another one. Not quite sure if it's the second or not).
The worst part is the randomness, naturally. I've told myself I've fixed this several times now, just because it simply "stopped failing". At random.
This happens on FW4.0, System.Data.SQLite x86 version, Win7 64b and 2008R2 (three differen machine in total), NH2.1.2, configured with FNH, on TestDriven.NET 32b precesses and NUnit console 32b processes.
Help?

Hi I'm pretty sure I have the exact same problem as you. I open and close multiple sessions per integration test. After digging through the SQLite connection pooling and some experimenting of my own, I've come to the following conclusion:
The SQLite pooling code caches the connection using WeakReferences, which isn't the best option for caching, since the reference to the connection(s) will be cleared when there is no normal (strong) reference to the connection and the GC runs. Since you can't predict when the GC runs, this explains the "randomness". Try and add a GC.Collect(); between closing one and opening another session, your test will always fail.
My solution was to cache the connection myself between opening sessions, like this:
public class BaseIntegrationTest
{
private static ISessionFactory _sessionFactory;
private static Configuration _configuration;
private static SchemaExport _schemaExport;
// I cache the whole session because I don't want it and the
// underlying connection to get closed.
// The "Connection" property of the ISession is what we actually want.
// Using the NHibernate SQLite Driver to get the connection would probably
// work too.
private static ISession _keepConnectionAlive;
static BaseIntegrationTest()
{
_configuration = new Configuration();
_configuration.Configure();
_configuration.AddAssembly(typeof(Product).Assembly);
_sessionFactory = _configuration.BuildSessionFactory();
_schemaExport = new SchemaExport(_configuration);
_keepConnectionAlive = _sessionFactory.OpenSession();
}
[SetUp]
protected void RecreateDB()
{
_schemaExport.Execute(false, true, false, _keepConnectionAlive.Connection, null);
}
protected ISession OpenSession()
{
return _sessionFactory.OpenSession(_keepConnectionAlive.Connection);
}
}
Each of my integrationtests inherits from this class, and calls OpenSession() to get a session. RecreateDB is called by NUnit before each test because of the [SetUp] attribute.
I hope this helps you or anyone else who gets this error.

Only thing that comes into mind that you are randomly leaving session open after the test. You must make sure any existing ISession is closed before you open another one. If you are not using the using() statement or calling Dispose() manually the session might still be alive somewhere causing those random exceptions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache Beam : RabbitMqIO watermark doesn't advance - rabbitmq

Related

Changing the GemFire query ResultSender batch size

Read SQL Server Broker messages and publish them using NServiceBus

App Folder files not visible after un-install / re-install

uiitem.WaitForControlExist(milliseconds); Waits too long

Random error when testing with NHibernate on an in-Memory SQLite db

Categories

Resources