On-the-fly data generation for benchmarking Beam - datasource

My goal is to benchmark the latency and the throughput of Apache Beam on a streaming data use-case with different window queries.
I want to create my own data with an on-the-fly data generator to control the data generation rate manually and consume this data directly from a pipeline without a pub/sub mechanism, i.e. I don't want to read the data from a broker, etc. to avoid bottlenecks.
Is there a way of doing something similar to what I want to achieve? or is there any source code for such use-case with Beam SDKs?
So far I couldn't find a starting point, existing code samples use pub/sub mechanism and they assume data comes from somewhere.
Thank you for suggestions in advance.

With regards to On-the-fly data, one option would be to make use of GenerateSequence for example:
pipeline.apply(GenerateSequence.from(0).withRate(RATE,Duration.millis(1000)))
To create other types of objects you can use a ParDo after to consume the Long and make it into something else:
Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
p.apply(GenerateSequence.from(0).withRate(2, Duration.millis(1000)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(FlatMapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
.via(i -> IntStream.range(0,2).mapToObj(k -> KV.of(String.format("Gen Value %s" , i),String.format("FlatMap Value %s ", k))).collect(Collectors.toList())))
.apply(ParDo.of(new DoFn<KV<String,String>, String>() {
#ProcessElement
public void process(#Element KV<String,String> input){
LOG.info("Value was {}", input);
}
}));
p.run();
That should generate values like:
Value was KV{Gen Value 0, FlatMap Value 0 }
Value was KV{Gen Value 0, FlatMap Value 1 }
Value was KV{Gen Value 1, FlatMap Value 0 }
Value was KV{Gen Value 1, FlatMap Value 1 }
Value was KV{Gen Value 2, FlatMap Value 0 }
Value was KV{Gen Value 2, FlatMap Value 1 }
Some other things to keep in mind for your pipelines performance testing:
The Direct runner is designed for unit testing, it does cool things like simulate failures, this helps catch issues which will be seen when running a production pipeline. It is not designed to help with performance testing however. I would recommend always using a main runner for those types of integration tests.
Please be aware of the Fusion optimization Link to Docs, when using a artificial data source like GenerateSequence you may need to do a GBK as the next step to allow the work to be parallelized. For the Dataflow runner more info can be found here: Link to Docs
In general for performance testing, I would recommend testing the whole end to end pipeline. There are interactions with sources and sinks ( for example watermarks ) which will not be tested in a standalone pipeline.
Hope that helps.

Related

How to print size of flow in kotlin

Hey I am new in kotlin flow. I am trying to print flow size. As we know that list has size() function. Do we have something similar function for flow.
val list = mutableListof(1,2,3)
println(list.size)
output
2
How do we get size value in flow?
dataMutableStateFlow.collectLatest { data ->
???
}
Thanks
A Flow doesn't know its size at any moment, because there is an unknown number of future values to be emitted. Also, Flows do not keep a record of how many values they have emitted in the past.
Sequences have the same problem. With both Flows and Sequences, you can only get the count by doing something terminal with them, something that iterates through it all.
The only way to get the size of the Flow is to do something that iterates through the entire Flow. For instance, you can call the suspend function count() on a Flow to get its size. The more complicated way to do it would be to create a count variable and then increment the count inside a collect call. However, counting the emissions of a Flow is only usable for finite cold Flows. Hot flows (SharedFlow and StateFlow) are never finite, and many cold Flows are also infinite.

Chronicle Queue - reader/tailer latency when run at same time while writing

I'm setting up a market data back-testing using Chronicle Queue (CQ), reading data from a binary file then writing into a single CQ and simultaneously reading the data from that CQ and dumping the statistics. I am doing a POC to replace our existing real-time market data feed handler worker queue.
While doing basic read/writes testing on Linux/SSD setup, I see reads are lagging behind writes - in fact latency is accumulating. Both Appender and Tailer are running as separate processes on same host.
Would like to know, if there is any issue in the code I am using?
Below is the code snippet -
Writer -
In constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myAppender = myQueue.acquireAppender();
In data callback -
myAppender.writeDocument(myDataPacket);
myQueue.close();
where myDataPacket is Java object wrapping the byte[] and other fields.
Tailer -
In Constructor -
myQueue = SingleChronicleQueueBuilder.binary(queueName).build();
myTailer = myQueue.createTailer();
In Read method -
while (notLastRecord)
{
if(myTailer.readDocument(myDataPacket))
{
notLastRecord = ;
//do stuff
}
}
myQueue.close();
Any help is highly appreciated.
Thanks,
Pavan
First of all I assume by "reads are lagging behind writes - in fact latency is accumulating" you mean that for every every subsequent message, the time the message is read from the queue is further from the time the event was written to the queue.
If you see latency accumulating like that, most likely the data is produced much quicker then you can consume it which from the use case you described is very much possible - if all you need at the write side is parsing simple text line and dump it into a queue file, it's quick, but if you do some processing when you read the entry from the queue - it might be slower.
From the code it's not clear what/how much work your code is doing, and the code looks OK to me, except you probably shouldn't call queue.close() after each appender.writeDocument() call but most likely you are not doing this otherwise it would blow up.
Without seeing actual code or test case it's impossible to say more.

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?
Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!
Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

Kotlin stdlib operatios vs for loops

I wrote the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
for (i in src)
{
if (i % 2 == 0) dest.add(Math.sqrt(i.toDouble()))
}
IntellJ (in my case AndroidStudio) is asking me if I want to replace the for loop with operations from stdlib. This results in the following code:
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
Now I must say, I like the changed code. I find it easier to write than for loops when I come up with similar situations. However upon reading what filter function does, I realized that this is a lot slower code compared to the for loop. filter function creates a new list containing only the elements from src that match the predicate. So there is one more list created and one more loop in the stdlib version of the code. Ofc for small lists it might not be important, but in general this does not sound like a good alternative. Especially if one should chain more methods like this, you can get a lot of additional loops that could be avoided by writing a for loop.
My question is what is considered good practice in Kotlin. Should I stick to for loops or am I missing something and it does not work as I think it works.
If you are concerned about performance, what you need is Sequence. For example, your above code will be
val src = (0 until 1000000).toList()
val dest = ArrayList<Double>(src.size / 2 + 1)
src.asSequence()
.filter { it % 2 == 0 }
.mapTo(dest) { Math.sqrt(it.toDouble()) }
In the above code, filter returns another Sequence, which represents an intermediate step. Nothing is really created yet, no object or array creation (except a new Sequence wrapper). Only when mapTo, a terminal operator, is called does the resulting collection is created.
If you have learned java 8 stream, you may found the above explaination somewhat familiar. Actually, Sequence is roughly the kotlin equivalent of java 8 Stream. They share similiar purpose and performance characteristic. The only difference is Sequence isn't designed to work with ForkJoinPool, thus a lot easier to implement.
When there is multiple steps involved or the collection may be large, it's suggested to use Sequence instead of plain .filter {...}.mapTo{...}. I also suggest you to use the Sequence form instead of your imperative form because it's easier to understand. Imperative form may become complex, thus hard to understand, when there are 5 or more steps involved in the data processing. If there is just one step, you don't need a Sequence, because it just creates garbage and gives you nothing useful.
You're missing something. :-)
In this particular case, you can use an IntProgression:
val progression = 0 until 1_000_000 step 2
You can then create your desired list of squares in various ways:
// may make the list larger than necessary
// its internal array is copied each time the list grows beyond its capacity
// code is very straight forward
progression.map { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// no copies are made
// code is more complicated
progression.mapTo(ArrayList(progression.last / 2 + 1)) { Math.sqrt(it.toDouble()) }
// will make the list the exact size needed
// a single intermediate list is made
// code is minimal and makes sense
progression.toList().map { Math.sqrt(it.toDouble()) }
My advice would be to choose whichever coding style you prefer. Kotlin is both object-oriented and functional language, meaning both of your propositions are correct.
Usually, functional constructs favor readability over performance; however, in some cases, procedural code will also be more readable. You should try to stick with one style as much as possible, but don't be afraid to switch some code if you feel like it's better suited to your constraints, either readability, performance, or both.
The converted code does not need the manual creation of the destination list, and can be simplified to:
val src = (0 until 1000000).toList()
val dest = src.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
And as mentioned in the excellent answer by #glee8e you can use a sequence to do a lazy evaluation. The simplified code for using a sequence:
val src = (0 until 1000000).toList()
val dest = src.asSequence() // change to lazy
.filter { it % 2 == 0 }
.map { Math.sqrt(it.toDouble()) }
.toList() // create the final list
Note the addition of the toList() at the end is to change from a sequence back to a final list which is the one copy made during the processing. You can omit that step to remain as a sequence.
It is important to highlight the comments by #hotkey saying that you should not always assume that another iteration or a copy of a list causes worse performance than lazy evaluation. #hotkey says:
Sometimes several loops. even if they copy the whole collection, show good performance because of good locality of reference. See: Kotlin's Iterable and Sequence look exactly same. Why are two types required?
And excerpted from that link:
... in most cases it has good locality of reference thus taking advantage of CPU cache, prediction, prefetching etc. so that even multiple copying of a collection still works good enough and performs better in simple cases with small collections.
#glee8e says that there are similarities between Kotlin sequences and Java 8 streams, for detailed comparisons see: What Java 8 Stream.collect equivalents are available in the standard Kotlin library?

How to use Redis and geo proximity search to find two users at the same location?

I want to implement a service that, given users' geo coordinates, can detect whether two users are at the very same location in real time.
In order to do this in real time and to scale, it seems I should go with a distributed in-memory datastore like Redis. I have researched using geohashing, but the problem is that points close to each other may not always share the same hash prefix. And geohashing may be overkill since I'm interested in finding whether two users are close enough where they are standing next to each other.
The simple solution of course is just to test whether pairs of geo coordinates fall within a small distance of each other. But AFAIK, Redis and other in-memory datastorse don't have geospatial indexing to support that kind of look-up.
What is the best way to go about implementing this?
This functionality is baked into Redis 3.2+.
But for older versions the problem still exists. I've taken Yin Qiwen's answer and created a module for Node, and you can see how it uses Redis by examining the code. His instructions are perfect and I was able to follow them for great results.
https://github.com/arjunmehta/node-georedis
The same algorithm is essentially what is used for the native commands.
It is very fast, and avoids any kind of intersections/haversine type operations. The coolest thing (I think) about Yin Qiwen's method is that the most computationally intense parts of the algorithm can be distributed to clients (instead of all happening in the DB or on the server).
It's not 100% precise and uses preconfigured distance steps, but for most applications you won't need exact precision I'd imagine.
I've also paraphrased Yin Qiwen's article at the GIS stack exchange.
Sorry for all the linkage. :P
Generally, this could be done by GeoHash and Redis's sorted set. There is a design I wrote before talking about how to implement a spatial index service on redis.
https://github.com/yinqiwen/ardb/wiki/Spatial-Index
Maybe you can try this one:
Redis Geography Edition
You really want to try it, it works awesome.
:)
I realize this doesn't answer your question... but I don't think that it's the correct tool.
PostgreSQL + PostGIS can perform really, really well. You can configure PostgreSQL to pretty much run as much of the database as it can fit in memory.
PostGIS uses (I think) rtree indexes, so it's incredibly fast to perform the kind of lookup you are interested in.
Using a backend that fires off websocket requests would allow you to perform pretty much real-time. Anytime your backend receives a persons GPS coordinates; perform the spatial lookup; and notify applicable clients through websockets.
The Redis geography edition mentioned by other answers in this thread has been integrated into Redis since version 3.2 (also see this earlier comment).
You can find the new commands here (in beta for now) :
GEOADD
GEODIST
GEOHASH
GEOPOS
GEORADIUS
GEORADIUSBYMEMBER
Tarantool database keeps data in memory, pushes them to disk as transaction logs, has RTree-type spatial index (not only 2-dimensional) and a number of nice operations on such index (containment, overlapping, distance).
I use it in a commercial project for storing and querying records which describe objects in 3D space.
http://tarantool.org/doc/book/box/box_index.html
https://github.com/tarantool/tarantool/wiki/R-tree-index-quick-start-and-usage
Standard client and examples are in Lua, but there are couple of other clients developed by the database authors. I use Java client in an Scala application with success.
The database is also very fast - here's scientific comparison with other databases (putting aside an aspect of being a spatial db):
http://airccse.org/journal/ijdms/papers/6314ijdms01.pdf
I would like to share a sample Java code for Redis Geography edition.
public void geoadd(String objectId, BigDecimal latitude, BigDecimal longitude) {
log.info("geoadd(): {} {} {}", objectId, latitude, longitude);
try (Jedis jedis = jedisPool.getResource()) {
if (geoaddSha == null) {
String script = "return redis.call('geoadd','" + GEOSET + "', ARGV[1], ARGV[2], KEYS[1])";
geoaddSha = jedis.scriptLoad(script);
}
log.info("geoaddSha: {}", geoaddSha);
log.info(jedis.evalsha(geoaddSha, 1, objectId, latitude.toString(), longitude.toString()).toString());
}
}
#SuppressWarnings("unchecked")
public List<String> georadius(BigDecimal latitude, BigDecimal longitude, int radius, Unit unit) {
log.info("georadius(): {} {} {} {}", latitude, longitude, radius, unit);
try (Jedis jedis = jedisPool.getResource()) {
if (georadiusSha == null) {
String script = "return redis.call('georadius','" + GEOSET + "', ARGV[1], ARGV[2], ARGV[3], ARGV[4])";
georadiusSha = jedis.scriptLoad(script);
}
log.info("georadiusSha: {}", georadiusSha);
List<String> objectIdList = (List<String>) jedis.evalsha(georadiusSha, 0, latitude.toString(), longitude.toString(), String.valueOf(radius), unit.toString());
log.info("objectIdList: {}", objectIdList);
return objectIdList;
}
}
public void remove(String objectId) {
log.info("remove(): {}", objectId);
try (Jedis jedis = jedisPool.getResource()) {
jedis.zrem(GEOSET, objectId);
}
}