stream data into my vega view progressively - vega

I am using Papaparse to parse the CSV and on each data, I run an insert into the view, like so:
Papa.parse(createReadStream('geo.csv'), {
header: true,
chunk(data) {
console.log('chunk: ', data.data.length)
// data.data.length > 0 && tally.push(...data.data)
view.insert('test1', data.data)
},
complete() {
view.data('test1').length // this will return 0
console.log('memory:', process.memoryUsage().heapUsed / 1024 / 1024, ` == time: ${Date.now() - start}`)
},
})
the only way to keep inserting new data is to either:
call run() after insert, insert('test1', data.data).run() to "commit", but I do not need it to run yet, not until I have all of the data (which is why I run() in the complete() callback).
I would have to parse everything at once in memory then pass it using data('test1', allRows) (which I think, will use a lot more memory)
how do I progressively stream data into my vega view? Note that I am running this inside a web worker, as far as I know, vega loader does not support browser's File instance (only URLs for browser environment) this I'm using papaparse.

You need to run runAsync and await it before inserting more data into the view or otherwise updates may bet lost. See https://github.com/vega/vega/issues/2513 for more information on this.
If you don't care about intermediate updates while more data comes in, I would recommend collecting all the data you want to insert and then adding it at once. Memory won't be an issue since you will need all the data in memory anyway. Vega will keep the full data in memory anyway.

Related

Neo4j 3.5's embeded database does not seem to persist data

I am trying to build a small command line tool that will store data in a neo4j graph. To do this I have started experimenting with Neo4j3.5's embedded databases. After putting together the following example I have found that either the nodes I am creating are not being saved to the database or the method of database creation is overwriting my previous run.
The Example:
fun main() {
//Spin up data base
val graphDBFactory = GraphDatabaseFactory()
val graphDB = graphDBFactory.newEmbeddedDatabase(File("src/main/resources/neo4j"))
registerShutdownHook(graphDB)
val tx = graphDB.beginTx()
graphDB.createNode(Label.label("firstNode"))
graphDB.createNode(Label.label("secondNode"))
val result = graphDB.execute("MATCH (a) RETURN COUNT(a)")
println(result.resultAsString())
tx.success()
}
private fun registerShutdownHook(graphDb: GraphDatabaseService) {
// Registers a shutdown hook for the Neo4j instance so that it
// shuts down nicely when the VM exits (even if you "Ctrl-C" the
// running application).
Runtime.getRuntime().addShutdownHook(object : Thread() {
override fun run() {
graphDb.shutdown()
}
})
}
I would expect that every time I run main the resulting query count will increase by 2.
That is currently not the case and I can find nothing in the docs that references a different method of opening an already created embedded database. Am I trying to use the embedded database incorrectly or am I missing something? Any help or info would be appreciated.
build Info:
Kotlin jvm 1.4.21
Neo4j-comunity-3.5.35
Transactions in neo4j 3.x have a 3 stage model
create
success / failure
close
you missed the third, which would then commit or rollback.
You can use Kotlin's use as Transaction is an AutoCloseable

Changing the GemFire query ResultSender batch size

I am experiencing a performance issue related to the default batch size of the query ResultSender using client/server config. I believe the default value is 100.
If I run a simple query to get keys (with some order by columns due to the PARTITION Region type), this default batch size causes too many chunks being sent back for even 1000 records. In my tests, even the total query time is only less than 100 ms, however, the app takes more than 10 seconds to process those chunks.
Reading between the lines in your problem statement, it seems you are:
Executing an OQL query on a PARTITION Region (PR).
Running the query inside a Function as recommended when executing queries on a PR.
Sending batch results (as opposed to streaming the results).
I also assume since you posted exclusively in the #spring-data-gemfire channel, that you are using Spring Data GemFire (SDG) to:
Execute the query (e.g. by using the SDG GemfireTemplate; Of course, I suppose you could also be using the GemFire Query API inside your Function directly, too)?
Implemented the server-side Function using SDG's Function annotation support?
And, are possibly (indirectly) using SDG's BatchingResultSender, as described in the documentation?
NOTE: The default batch size in SDG is 0, NOT 100. Zero means stream the results individually.
Regarding #2 & #3, your implementation might look something like the following:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction", batchSize = "1000")
public List<SomeApplicationType> myFunction(FunctionContext functionContext) {
RegionFunctionContext regionFunctionContext =
(RegionFunctionContext) functionContext;
Region<?, ?> region = regionFunctionContext.getDataSet();
if (PartitionRegionHelper.isPartitionRegion(region)) {
region = PartitionRegionHelper.getLocalDataForContext(regionFunctionContext);
}
GemfireTemplate template = new GemfireTemplate(region);
String OQL = "...";
SelectResults<?> results = template.query(OQL); // or `template.find(OQL, args);`
List<SomeApplicationType> list = ...;
// process results, convert to SomeApplicationType, add to list
return list;
}
}
NOTE: Since you are most likely executing this Function "on Region", the FunctionContext type will actually be a RegionFunctionContext in this case.
The batchSize attribute on the SDG #GemfireFunction annotation (used for Function "implementations") allows you to control the batch size.
Of course, instead of using SDG's GemfireTemplate to execute queries, you can, of course, use the GemFire Query API directly, as mentioned above.
If you need even more fine grained control over "result sending", then you can simply "inject" the ResultSender provided by GemFire to the Function, even if the Function is implemented using SDG, as shown above. For example you can do:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction")
public void myFunction(FunctionContext functionContext, ResultSender resultSender) {
...
SelectResults<?> results = ...;
// now process the results and use the `resultSender` directly
}
}
This allows you to "send" the results however you see fit, as required by your application.
You can batch/chunk results, stream, whatever.
Although, you should be mindful of the "receiving" side in this case!
The 1 thing that might not be apparent to the average GemFire user is that GemFire's default ResultCollector implementation collects "all" the results first before returning them to the application. This means the receiving side does not support streaming or batching/chunking of the results, allowing them to be processed immediately when the server sends the results (either streamed, batched/chunked, or otherwise).
Once again, SDG helps you out here since you can provide a custom ResultCollector on the Function "execution" (client-side), for example:
#OnRegion("SomePartitionRegion", resultCollector="myResultCollector")
interface MyApplicationFunctionExecution {
void myFunction();
}
In your Spring configuration, you would then have:
#Configuration
class ApplicationGemFireConfiguration {
#Bean
ResultCollector myResultCollector() {
return ...;
}
}
Your "custom" ResultCollector could return results as a stream, a batch/chunk at a time, etc.
In fact, I have prototyped a "streaming" ResultCollector implementation that will eventually be added to SDG, here.
Anyway, this should give you some ideas on how to handle the performance problem you seem to be experiencing. 1000 results is not a lot of data so I suspect your problem is mostly self-inflicted.
Hope this helps!
John,
Just to clarify, I use client/server topology(actually wan, but that is not important in here). My client is a spring boot web app which has kendo grid as ui. Users can filter/sort on any combination of the columns, which will be passed to the spring boot app for generating dynamic OQL and create the pagination. Till now, except for being dynamic, my OQL queries are quite straight forward. I do not want to introduce server side functions due to the complexity of our global deployment process. But I can if you think that is something I have to do.
Again, thanks for your answers.

Issues uploading/downloading files in akka-http/akka-streams

I'm trying to use akka-streams and akka-http and the alpakka library to download/upload files to Amazon S3. I am seeing two issues which might be related...
I can only download very small files, the largest one 8kb.
I can't upload larger files. It fails with the message
Error during processing of request: 'Substream Source has not been
materialized in 5000 milliseconds'. Completing with 500 Internal
Server Error response. To change default exception handling behavior,
provide a custom ExceptionHandler.
akka.stream.impl.SubscriptionTimeoutException:
Substream Source has not been materialized in 5000 milliseconds
Here are my routes
pathEnd {
post {
fileUpload("attachment") {
case (metadata, byteSource) => {
val writeResult: Future[MultipartUploadResult] = byteSource.runWith(client.multipartUpload("bucketname", key))
onSuccess(writeResult) { result =>
complete(result.location.toString())
}
}
}
}
} ~
path("key" / Segment) {
(sourceSystem, sourceTable, sourceId) =>
get {
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
onSuccess(result) {
complete(_)
}
}
}
Trying to download a file of say 100KB will end up fetching a truncated version of the file usually of size around 16-25Kb
Any help appreciated
Edit: For the download issue, I took Stefano's suggestion and got
[error] found : akka.stream.scaladsl.Source[akka.util.ByteString,akka.NotUsed]
[error] required: akka.http.scaladsl.marshalling.ToResponseMarshallable
This made it work
complete(HttpEntity(ContentTypes.`application/octet-stream`, client.download("bucketname", key).runWith(Sink.head)))
1) On the download issue: by calling
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
you are streaming all the data from S3 into memory, and then serve the result.
Akka-Http as streaming support that allows you to stream bytes straight from a source, without buffering them all in memory. More info on this can be found in the docs. Practically, this means the complete directive can take a Source[ByteString, _], as in
...
get {
complete(client.download("bucketname", key))
}
2) On the upload issue: you can try to tweak Akka HTTP akka.http.server.parsing.max-content-length setting:
# Default maximum content length which should not be exceeded by incoming request entities.
# Can be changed at runtime (to a higher or lower value) via the `HttpEntity::withSizeLimit` method.
# Note that it is not necessarily a problem to set this to a high value as all stream operations
# are always properly backpressured.
# Nevertheless you might want to apply some limit in order to prevent a single client from consuming
# an excessive amount of server resources.
#
# Set to `infinite` to completely disable entity length checks. (Even then you can still apply one
# programmatically via `withSizeLimit`.)
max-content-length = 8m
Resulting code to test this would be something along the lines of:
withoutSizeLimit {
fileUpload("attachment") {
...
}
}

Limiting simultaneous downloads using RxAlamofire

Given my App will download files from a server and I only want 1 download to be progressed at the same time, then how could this be done with RxAlamofire? I might simply be missing an Rx operator.
Here's the rough code:
Observable
.from(paths)
.flatMapWithIndex({ (ip, idx) -> Observable<(Int, Video)> in
let v = self.files![ip.row] as! Video
return Observable.from([(idx, v)])
})
.flatMap { (item) -> Observable<Video> in
let req = URLRequest(url: item.1.downloadURL())
return Api.alamofireManager()
.rx
.download(req, to: { (url, response) -> (destinationURL: URL, options: DownloadRequest.DownloadOptions) in
...
})
.flatMap({ $0.rx.progress() })
.flatMap { (progress) -> Observable<Float> in
// Update a progress bar
...
}
// Only propagate finished items
.filter { $0 >= 1.0 }
// Return the item itself
.flatMap { _ in Observable.from([item.1]) }
}
.subscribe(
onNext: { (res) in
...
},
onError: { (error) in
...
},
onCompleted: {
...
}
)
My problem is a) RxAlamofire will download multiple items at the same time and b) the (progress) block is called multiple times for those various items (with different progress infos on each, causing the UI to behave a bit weird).
How to ensure the downloads are done one by one instead of simultaneously?
Does alamofireManager().rx.download() download concurrently or serially?
I'm not sure how it does, so test that first. Isolate this code and see if it does execute multiple downloads at once. If it does, then read up on the documentation for serial downloads instead of concurrent downloads.
If it downloads one at a time, then it means it has something to do with your Rx code that triggers the progress bar update issue. If it doesn't download one at a time, then it means we just need to read up on Alamofire's documentation on how to download one at a time.
Complex transformations and side effects
Something to consider is that your data streams are becoming more complex and difficult to debug because so many things are happening in one stream. Because of the multiple flat maps, there can be a lot more emissions coming out affecting the progress bar update. It is also possible that the numerous flat maps operations that acquired an Observable are the cause for the multiple triggering of the updates on the progress bar.
Complex data streams
In one data stream you (a) performed the network call (b) updated the progress bar (c) filtered finished videos (d) and went back to the video you wanted by using flatMapWithIndex at the start to pair together id and the video model so that you can return back to the model at the end. Kind of complicated... My guess is that the weird progress bar updates might be caused by creating a hot observable on call of $0.rx.progress().
I made a github gist of my Rx Playground that tries to model what you're trying to do.
In functional reactive programming, it would be much more readable and easier to debug if you first define your data streams/observables. In my gist, I began with the observables and how I planned to model the download progress.
This code will avoid the concurrency issues if the RxAlamofire query downloads 1 at a time, and it properly presents the progress value for a UIProgressBar.
Side note
Do you need to track the individual progress downloads per download item? Or do you want your progress bar to just increment per finished download item?
Also, be wary with the possible dangers of misusing a chain of multiple flatMaps as explained here.

mongoskin and bulk operations? (mongodb 3.2, mongoskin 2.1.0 & 2.2.0)

I've read the various bits of literature, and I'm seeing the same problem that the questioner in
https://stackoverflow.com/a/25636911
was seeing.
My code looks like this:
coll = db.collection('foobar');
bulk = coll.initializeUnorderedBulkOp();
for entry in messages {
bulk.insert(entry);
}
bulk.execute(function (err, result) {
if (err) throw err
inserted += result.nInserted
});
bulk is an object
bulk.insert works just fine
bulk.execute is undefined
The answer in the stackoverflow question said, "only the callback flavor of db.collection() works, so I tried:
db.collection('foobar', function (err, coll) {
logger.debug "got here"
if (err) throw err
bulk = coll.initializeUnorderedBulkOp()
... same code as before
We never get to "got here" implying that the "callback flavor" of db.collection() was dropped for 3.0?
Unfortunately, my python is way better than my JS prototyping skills, so looking at the skin source code doesn't make any sense to me.
What is the right way, with mongoskin 2.1.0 and the 2.2.0 mongodb JS driver, to do a bulk operation, or is this not implemented at all anymore?
There are at least two answers:
(1) Use insert, but the array form, so you insert multiple documents with one call. Works like a charm.
(2) If you really need bulk operations, you'll need to switch from mongoskin to the native mongo interface, but just for that one call.
This kinda sucks because it's using a private interface in mongoskin, but it's also the most efficient way to stick with mongoskin:
(example in coffeescript)
// bulk write all the messages in "messages" to a collection
// and insert the server's current time in the recorded field of
// each message
// use the _native interface and wait for callback to get collection
db._native.collection collectionName, (err, collection) ->
bulk = collection.initializeUnorderedBulkOp()
for message in messages
bulk.find
_id: message._id
.upsert().updateOne
$set: message
$currentDate:
recorded: true
bulk.execute (err, result) ->
// ... error and result checking code
or (3) if you want to implement that $currentDate and not any generic bulk operation, refer to solution (1) but use the not-very-well-documented BSON object Timestamp() with no arguments:
for msg in messages:
msg.recorded = Timestamp()
db.mycollection.insert(msg)
which will do a bulk insert and set timestamp to the DB server's time at the time the record is written to the db.