stateSnapshots on demand in mapWithState - apache-spark-sql

I am streaming data from Kafka (batch interval 10 sec), convert the RDD to a PairRDD, and then storing the RDD into the state using mapWithState(). Below is the code:
JavaPairDStream<String, Object> transformedStream = stream
.mapToPair(record -> new Tuple2<>(record.getKey(), record))
.mapWithState(StateSpec.function(updateDataFuncGDM).numPartitions(32)).stateSnapshots();
transformedStream.foreachRDD(rdd -> {
//if flag is true, put the RDD to a SQL table, and run a query to do some aggregations liek sum, avg etc
// if flag is false, return;
}
Now, i keep on updating data in the state, and on a certain event, i change the flag to true, and i put this data in the table, and do my calculations.
The problem here is that since i am getting "stateSnapshots" in every batch, its not efficient, and mapWithState keeps lots of data in memory, and as the state grows it will become even worse. Also, since mapWithState checkpoints data after every 10 iterations, it takes a lot of time because the data is very large.
I want to get a stateSnapshot of the state only on demand (i.e. only in the iteration of foreachRDD when the flag is true)
But i didn't find a lot of ways to play around with the state

Related

Does Flux.collectList() introduce blocking behavior

Assume we have a Finder that returns multiple objects
fun findAll(): Flux<CarEntity> {
return carRepository.findAll()
}
Because we want to apply some logic to all cars and passengers at the same time, we convert it to a Mono through .collectList()
val carsMono = carsFinder.findAll().collectList()
val passengerMono = carsFinder.findAll().collectList()
return Mono.zip(carsMono, passengerMono)
In other words,
We have a list of entities of undefined length
We gather every item in a list until there is no more - how is this done without blocking the threat?
No collectList() is not a blocking operator but we need to be careful when we use this operator.
with this operator we wait till the upper stream has emitted all elements to us, and if this stream is never ending stream like kafkatemplate or a processor, the collect list will collect elements till it run in out of memory exception.
Or when we have a big stream like findAll from database this will collect a big list consuming a lot of ram memory or even giving out of memory exception.
If you know that you deal with small number of elements you are safe to go.
If you can avoid and process elements in stream than would be better

How to implement a time based length queue in F#?

This is a followup to question: How to optimize this moving average calculation, in F#
To summarize the original question: I need to make a moving average of a set of data I collect; each data point has a timestamp and I need to process data up to a certain timestamp.
This means that I have a list of variable size to average.
The original question has the implementation as a queue where elements gets added and eventually removed as they get too old.
But, in the end, iterating through a queue to make the average is slow.
Originally the bulk of the CPU time was spent finding the data to average, but then once this problem was removed by only keeping the data needed in the first place, the Seq.average call proved to be very slow.
It looks like the original mechanism (based on Queue<>) is not appropriate and this question is about finding a new one.
I can think of two solutions:
implement this as a circular buffer which is large enough to accommodate the worst case scenario, this would allow to use an array and do only two iterations to make the sum.
quantize the data in buckets and pre-sum it, but I'm not sure if the extra complexity will help performance.
Is there any implementation of a circular buffer that would behave similarly to a Queue<>?
The fastest code, so far, is:
module PriceMovingAverage =
// moving average queue
let private timeQueue = Queue<DateTime>()
let private priceQueue = Queue<float>()
// update moving average
let updateMovingAverage (tradeData: TradeData) priceBasePeriod =
// add the new price
timeQueue.Enqueue(tradeData.Timestamp)
priceQueue.Enqueue(float tradeData.Price)
// remove the items older than the price base period
let removeOlderThan = tradeData.Timestamp - priceBasePeriod
let rec dequeueLoop () =
let p = timeQueue.Peek()
if p < removeOlderThan then
timeQueue.Dequeue() |> ignore
priceQueue.Dequeue() |> ignore
dequeueLoop()
dequeueLoop()
// get the moving average
let getPrice () =
try
Some (
priceQueue
|> Seq.average <- all CPU time goes here
|> decimal
)
with _ ->
None
Based on a queue length of 10-15k I'd say there's definitely scope to consider batching trades into precomputed blocks of maybe around 100 trades.
Add a few types:
type TradeBlock = {
data: TradeData array
startTime: DateTime
endTime: DateTime
sum: float
count:int
}
type AvgTradeData =
| Trade of TradeData
| Block of TradeBlock
I'd then make the moving average use a DList<AvgTradeData>. (https://fsprojects.github.io/FSharpx.Collections/reference/fsharpx-collections-dlist-1.html) The first element in the DList is summed manually if startTime is after the price period and removed from the list once the price period exceeds the endTime. The last elements in the list are kept as Trade tradeData until 100 are appended and then all removed from the tail and turned into a TradeBlock.

StackExchange.Redis: Does a transaction hit the server multiple times?

When I execute a transaction (MULTI/EXEC) via SE.Redis, does it hit the server multiple times? For example,
ITransaction tran = Database.CreateTransaction();
tran.AddCondition(Condition.HashExists(cacheKey, oldKey));
HashEntry hashEntry = GetHashEntry(newKeyValuePair);
Task fieldDeleteTask = tran.HashDeleteAsync(cacheKey, oldKey);
Task hashSetTask = tran.HashSetAsync(cacheKey, new[] { hashEntry });
if (await tran.ExecuteAsync())
{
await fieldDeleteTask;
await hashSetTask;
}
Here I am executing two tasks in the transaction. Does this mean I hit the server 4 times? 1 for MULTI, 1 for delete, 1 for set, 1 for exec? Or is SE.Redis smart enough to buffer the tasks in local memory and send everything in one shot when we call ExecuteAsync?
It has to send multiple commands, but it doesn't pay latency costs per command; specifically, when you call Execute[Async] (and not before) it issues a pipeline (all together, not waiting for replies) of:
WATCH cacheKey // observes any competing changes to cacheKey
HEXIST cacheKey oldKey // see if the existing field exists
MULTI // starts the transacted commands
HDEL cacheKey oldKey // delete the existing field
HSET cachKey newField newValue // assign the new field
then it pays latency costs to get the result from the HEXIST, because only when that is known can it decide whether to proceed with the transaction (issuing EXEC and checking the result - which can be negative if the WATCH detects a conflict), or whether to throw everything away (DISCARD).
So; either way 6 commands are going to be issued, but in terms of latency: you're paying for 2 round trips due to the need for a decision point before the final EXEC/DISCARD. In many cases, though, this can itself be further masked by the reality that the result of HEXIST could already be on the way back to you before we've even got as far as checking, especially if you have any non-trivial bandwidth, for example a large newValue.
However! As a general rule: anything you can do with redis MULTI/EXEC: can be done faster, more reliably, and with fewer bugs, by using a Lua script instead. It looks like what we're actually trying to do here is:
for the hash cacheKey, if (and only if) the field oldField exists: remove oldField and set newField to newValue
We can do this very simply in Lua, because Lua scripts are executed at the server from start to finish without interruption from competing connections. This means that we don't need to worry about things like atomicity i.e. other connections changing data that we're making decisions with. So:
var success = (bool)await db.ScriptEvaluateAsync(#"
if redis.call('hdel', KEYS[1], ARGV[1]) == 1 then
redis.call('hset', KEYS[1], ARGV[2], ARGV[3])
return true
else
return false
end
", new RedisKey[] { cacheKey }, new RedisValue[] { oldField, newField, newValue });
The verbatim string literal here is our Lua script, noting that we don't need to do a separate HEXISTS/HDEL any more - we can make our decision based on the result of the HDEL. Behind the scenes, the library performs SCRIPT LOAD operations as needed, so: if you are doing this lots of times, it doesn't need to send the script itself over the network more than once.
From the perspective of the client: you are now only paying a single latency fee, and we're not sending the same things repeatedly (the original code sent cacheKey four times, and oldKey twice).
(a note on the choice of KEYS vs ARGV: the distinction between keys and values is important for routing purposes, in particular on sharded environments such as redis-cluster; sharding is done based on the key, and the only key here is cacheKey; the field identifiers in hashes do not impact sharding, so for the purpose of routing they are values, not keys - and as such, you should convey them via ARGV, not KEYS; this won't impact you on redis-server, but on redis-cluster this difference is very important, as if you get it wrong: the server will most-likely reject your script, thinking that you are attempting a cross-slot operation; multi-key commands on redis-cluster are only supported when all the keys are on the same slot, usually achieved via "hash tags")

How to cache data during the first epoch correctly (Tensorflow, dataset)?

I'm trying to used the cache transformation for a dataset. Here is my current code (simplified):
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=1)
dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=5000, count=1))
dataset = dataset.map(_parser_a, num_parallel_calls=12)
dataset = dataset.padded_batch(
20,
padded_shapes=padded_shapes,
padding_values=padding_values
)
dataset = dataset.prefetch(buffer_size=1)
dataset = dataset.cache()
After the first epoch, I received the following error message:
The calling iterator did not fully read the dataset we were attempting
to cache. In order to avoid unexpected truncation of the sequence, the
current [partially cached] sequence will be dropped. This can occur if
you have a sequence similar to dataset.cache().take(k).repeat().
Instead, swap the order (i.e. dataset.take(k).cache().repeat())
Then, the code proceeded and still read data from the hard drive instead of the cache. So, where should I place dataset.cache() to avoid the error?
Thanks.
The implementation of the Dataset.cache() transformation is fairly simple: it builds up a list of the elements that pass through it as you iterate over completely it the first time, and it returns elements from that list on subsequent attempts to iterate over it. If the first pass only performs a partial pass over the data then the list is incomplete, and TensorFlow doesn't try to use the cached data, because it doesn't know whether the remaining elements will be needed, and in general it might need to reprocess all the preceding elements to compute the remaining elements.
By modifying your program to consume the entire dataset, and iterate over it until tf.errors.OutOfRangeError is raised, the cache will have a complete list of the elements in the dataset, and it will be used on all subsequent iterations.

Replay Recorded Data Stream in F#

I have recorded real-time stock quotes in an SQL database with fields Id, Last, and TimeStamp. Last being the current stock price (as a double), and TimeStamp is the DateTime when the price change was recorded.
I would like to replay this stream in the same way it came in, meaning if a price change was originally 12 seconds apart then the price change event firing (or something similar) should be 12 seconds apart.
In C# I might create a collection, sort it by the DateTime then fire an event using the difference in time to know when to fire off the next price change. I realize F# has a whole lot of cool new stuff relating to events, but I don't know how I would even begin this in F#. Any thoughts/code snippets/helpful links on how I might proceed with this?
I think you'll love the F# solution :-).
To keep the example simple, I'm storing the price and timestamps in a list containing tuples (the first element is the delay from the last update an the second element is the price). It shouldn't be too difficult to turn your input data into this format. For example:
let prices = [ (0, 10.0); (1000, 10.5); (500, 9.5); (2500, 8.5) ]
Now we can create a new event that will be used to replay the process. After creating it, we immediatelly attach some handler that will print the price updates:
let evt = new Event<float>()
evt.Publish.Add(printfn "Price updated: %f")
The last step is to implement the replay - this can be done using asynchronous workflow that loops over the values, asynchronously waits for the specified time and then triggers the event:
async { for delay, price in prices do
do! Async.Sleep(delay)
evt.Trigger(price) }
|> Async.StartImmediate
I'm starting the workflow using StartImmediate which means that it will run on the current thread (the waiting is asynchronous, so it doesn't block the thread). Keeping everything single-threaded makes it a bit simpler (e.g. you can safely access GUI controls).
EDIT To wrap the functionality in some component that could be used from other parts of the application, you could define a type like this:
type ReplyDataStream(prices) =
let evt = new Event<float>()
member x.Reply() =
// Start the asynchronous workflow here
member x.PriceChanged =
evt.Publish
The users can then create an instance of the type, attach their event handlers using stream.PriceChanged.Add(...) and then start the replaying the recorded changes using Reply()