How do I safely (i.e. crash resistantly) serialize docs in spacy? - serialization

I have a pretty big corpus of texts that I use to make a few million Doc-objects.
I am somewhat familiar with the usage of DocBin(), but it looks like it loads all the Docs to memory before dumping it to file. Seems a bit risky and crash-prone. Ideally I want to be able to continue where it stopped should a crash or unexpected exit occur.
I came up with two options:
Write every doc as a single file with Doc.to_disk.
After the first x iterations I write the docbin to file, then every x iteration after that I load the docbin-file, merge it with the ones in memory, then write that new DocBin to file again.
Anything I've missed here? What are the up- and downsides to these methods? Any other suggestions?

Related

How to watermark a wave file with NAudio?

My application outputs wave files and I have the requirement to implement watermarking. By this I mean taking another wave file containing, say, 4 seconds of someone saying "This file is copyrighted material" and overlay this into the original file every 30 seconds.
I have tried many things but none quite works. I can pad the watermark to 30 seconds using OffsetSampleProvider and mix this with MixingSampleProvider, but that would only mix in one copy.
LoopStream is a WaveStream, not an ISampleProvider, and even if I could do the conversion, I think this would mix forever, because the LoopStream would never stop returning data.
This seems like a fairly basic use case, but I cannot figure out how to do it!
You're not too far off. You can turn a LoopStream into an ISampleProvider with the ToSampleProvider extension method, and then you can use the Take extension method to limit the duration of the looped stream to match the duration of the file you're mixing it with.

!Dimension too large when knitting file to PDF using rmarkdown in RStudio

I receiving the following error when I try to knit into a pdf:
! Dimension too large.
\fb#put#frame ...p \ifdim \dimen# >\ht \#tempboxa
\fb#putboxa #1\fb#afterfra...
It's an extremely long line of code that I need to knit into a pdf (about 5000 lines). A lot of preprocessing data. The output itself is quite small maybe a line or more. Has anyone had this issue with huge blocks of code? If so, could you tell me how you solved it? I'm up for suggestions.
That's a LaTeX framed package error. RMarkdown tries to put all of that code into a single environment (I believe it's a snugshade environment, but I might be wrong), and that environment isn't ready for something that's going to stretch over many pages. The most I managed to get was about 1300 lines which were broken up into 48 pages of code.
The simplest solution would be to break that up into 4 or 5 pieces, but that might not be easy for you to do.
Next simplest might be not to show it as code at all: use echo = FALSE in the code chunk that runs it, and include it some other way (e.g. in a verbatim environment, or using the listings package). With that much code, showing it as a couple of hundred pages of text doesn't really make much sense anyway.

Realm database performance

I'm trying to use this database with react-native. First of all, i've found out that it can't retrieve plain objects - i have to retrieve all of the properties in the desired object tree recursively. And it takes about a second per object (~50 numeric props). Slow-ish!
Now, i've somehow imported ~9000 objects in it (each up to 1000 chars including titles). Looks like there is no easy ay to import it, at least it is not described in docs. Anyway, that's acceptable. But now i've found out that my database size (default.realm) is 3.49GB (!). The JSON file, which i was importing is only 6.5mb. I've opened default.realm with Realm Browser and it shows only those ~9000 objects, nothing else. Why so heavy?
Either, i don't understand something very fundamental about this database or it is complete garbage. I really hope i'm wrong. What am I doing wrong?
Please make sure you are not running in chrome debug mode. This is probably why things seem so slow. As far as the file size issue goes, it would be helpful if you posted your code to help figure out why that is happening.

Best strategy to read file only once at launch?

I am reading a file datafile at the launch of my application.This is just self learning drill.
On apple developers website under the heading Minimize File Access at Launch Time
it says
If you must read a file at launch time, do so only once.
So my question is that is there a standard or preferred way of doing this. At the moment I have an instance varible NSArray and I populate that in - (void)viewDidUnloadand never garbage collect it. Is that good enough ? or Should I be making use of application Object (I am not even sure if it is logical to say this.).
There is no standard way to optimize. But there are some guidelines.
One basic idea of optimization is to do less. E.g. as the advice you cited where the data of a file may be needed at multiple points in your code, it is better to read it from disk once and then distribute a data pointer within your program.
If the file is big enough to cause a stutter when you start your application, i.e. it takes more than 20ms to read and parse the file, you should consider reading the file in a background thread/task and adding a ‘loading…’-state to display to the user.

is it possible to use html tags in gml file?

is there any way to use html anchor tag in a gml file..I want to create a hyperlink to location/point in a gml file.
how can i do so???
thanks in advance..
This is a little known GML technique that GREATLY increases the power of Game Maker, and is well worth learning, but as a note, it does NOT work in Studio, because of the countless new restrictions on commands. Go back to GM8.1 (I only ever use that now), and you should have no problem making use of this technique.
The technique is to write a program in another language through GML (batch, vbs, etc, or in this case, HTLM), execute it through GML, then delete the program.
Quite simply, use the file_text commands to create a file with the correct content and extension, execute it with execute_program, and then delete it with file_delete.
Specifically for this script:
argument0 is the link, including the protocol.
argument1 is the anchor, minus the # (that's handled for you).
argument2 is the full browser path.
argument3 is important. This is the time in milliseconds the program will wait before deleting the temporary link file.
(The execute_program command, even when told to wait for the program to complete, continues as soon as the temp file is loaded. If external, the redirect takes some time depending on your connection, so deleting the temporary file halfway through will cause it to fail. 10 milliseconds worked fine for me. The program will hang for this time in this setup, but if you would like to set up an alarm based system to stop it from hanging, that wouldn't be too hard.)
In other uses of this technique without the use of the internet (I use small batch and vbs files a lot), the "hang time" (pun not intended) is usually not necessary.
In addition, the browser location will need to be changed for each different computer
file=file_text_open_write(temp_directory+"\tempLink.html")
file_text_write_string(file,'<!DOCTYPE html>')
file_text_writeln(file)
file_text_write_string(file,'<html>')
file_text_writeln(file)
file_text_write_string(file,'<body onload="')
file_text_write_string(file,"location.href='")
file_text_write_string(file,argument0+"#"+argument1+"';")
file_text_write_string(file,'">')
file_text_writeln(file)
file_text_write_string(file,'</body>')
file_text_writeln(file)
file_text_write_string(file,'</html>')
file_text_close(file)
execute_program(argument2,temp_directory+"\tempLink.html",true)
sleep(argument3)
file_delete(temp_directory+"\tempLink.html")
Sorry I wish It was possible but it's not unless you want to spend a lot of time with dll's. BUT you can create a Script and reuse it everywhere in your code...
script0(argument0,argument1...)