Which will perform better, parsing string each iteration, or parsing once and storing

Which will perform better, parsing string each iteration, or parsing once and storing - vb.net

I'm creating a vb.net winforms application that will take in user given strings, parse them, and print out labels with variable information. The given string will be used in all the labels, but the variable part of the string will change with each label.
My question is: is it better to parse the strings one time, then store those values in arrays, or to parse the string each time a label is printed? Which will perform better? Which is better practice? What is the proper way to test something like this?

If saving data in memory is done for performance, my preference would be not to do it until I knew the parsing was actually a performance problem, which I detect by random-pausing.
Generally my rule of thumb is - the less data structure the better.
If you have data structure you have to worry about it getting stale, and any time you change the program you have more code to modify and put bugs into.
Besides, parsing does not need to be slow, especially compared to whatever else you're doing.

Parsing is definitely heavy stuff.
In this particular instance if you do not have any memory constraints then you should parse once and store to speed up your application, especially if it is going to be something that is being used by people :)

Related

Isn't storing all of its data as strings an overhead in terms of memory usage?

Isn't storing only strings as data type a big overhead in terms of memory consumption?
e.g.: To store "304.2" in any application is more expensive than to store 304.2 as float/double.
Even if internally the value is indeed stored as a numeric value, delegating to every client the responsibility of "parsing" the string isn't another source of inefficiency?
I was getting super excited to start using redis but my case of usage is to cache a key x value structure like "string" x "doubles[]". Even if it would probably pay off in comparison with disk those two points really turns me off in adopting the technology.
I would love to be proven wrong, this is why I'm asking the question.
Thank you,

For point 1: You can't store 304.2 as a float/double; you can only store a close approximation to it. To store it, you need e.g. a dedicated decimal type, or more general rational type. Or a string.
For point 2:
RESP is a compromise between the following things:
Simple to implement.
Fast to parse.
Human readable.
Human readable means that no matter how numbers are stored internally, they still would be sent as strings and clients would have to parse them.

After all I've chosen Infinispan which gave me the APIs was looking for. Pros of the choosen solution is the actual ability to refer to the cache as a generic key x value concurrent map. Cons: probably less flexible in terms of out of the box client supported programming languages, even though you can always use google protobuff.

What is the best way to represent a form with hundreds of questions in a model

I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?

One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?

With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.

Store and loading large plists

I am parsing some xml and storing the result in a plist save it to file. I later frequently use that plist to search, add/remove stuff and then save it back.
Now, I don't have any problem with this, everything works fine, im just wondering if there's a better/more efficient/faster way of doing this?
About the plist: array of 200 dictionaries with 150 entries each. Some of those entries are array themselves with sub dictionaries of 50-100 entries... (you get the point)
Thanks.

Unless you are running into performance problems I would suggest not worrying about it and just focus on getting the rest of app completed. Premature optimization is the root of all evil (someone had to say it right?).
If you decide that the time has come to make that part of your app as efficient as possible then we would need to see that actual code that you are using to determine if there are more efficient ways to do it. Considering your description of the plist I would guess that if there was anything incredibly inefficient with your strategy and/or implementation then you would already be running up against it with respect to performance.

Should I load everything in memory upon application start?

I'm using VB.Net, and I have a set of data which I have to able to filter through fairly quickly. Basically, the program is like google sugest, but instead of a drop-down menu, I'm using a listbox. When a user enters a word, I compare the word using LINQ and filter those that contain the user's input. The data are all strings of variable length (from 0 to 200 characters, most on 150 character mark), and I have 240,000+ of this strings and counting- all stored in an XML file.
A colleague of mine told me that loading all of that to memory (using VB.Net's XML serializer plus collections of string/objects) is not practical, and would slow the 'startup' time of the program. I haven't finished building the program yet and I'm having second thoughts about continuing this path.
So, my question is: Should I continue with my current approach on the problem (which is load everything to memory on startup), or is there a better way of solving my dilemma?

If you want to prevent startup time and keeping it in memory isn't an issue on performance, then load it asynchronously. Although loading 240.000+ strings from an XML and keeping it in memory doesn't sound like the greatest idea. Probably a database would be the better approach. Or at least some format like JSON that's faster to parse.

Depends on a number of things:
If
((you know the strings will not hugely increase in number) &&
(you know the spec of the machines that will run your app) &&
(you are able to test that the load time is *good enough* on the above spec))
{
**don't bother changing approach.**
}
else
{
**change approach.**
}
The alternative approach is obviously some kind of asynch lazy-load.

You're talking about loading roughly 36MB of strings. While this isn't a daunting amount by any means (though you could probably load it faster reading the XML yourself...I wouldn't go with the serialization engine if I was worried about performance), it's also a non-trivial amount. You're looking a adding a couple of seconds to your startup time, assuming you don't do it asynchronously as Mircea suggests.
If you do do it asynchronously, you'll have to ensure that any UI process that relies on the data doesn't occur until after it has loaded. That may be a difficult thing to ensure.

The question seems to imply an online application. A few suggestions if that is the case:
The data could / should be zipped. I suspect it would compress very nicely.
Maybe the data could be cached accross multiple sessions, possibly be delivered as html content with a expiry cache date as appropriate. This would save systematic loading, and may be feasible if the data isn't updated frequently.
The suggestion feature feature could be initially disabled (i.e. say showing a "loading..." message while the application initializes the cache, asynchronously). In this fashion the application would be quickly available upon startup, even though the suggest feature may lag by up to say 30 seconds or so.
Edit: Independently of how the data gets downloaded and cached, I second the opinion of Mircea Grelus that an xml file of this size is a poor substitute for a database.

It may not be a bad idea to load the XML into memory when the app starts up. But if you go this route I'd look into using the BackgroundWorker thread. The idea would be to load the XML into memory asynchronously so the UI is still responsive as this is going on. As far as the user is concerned the app shouldn't appear to start any slower, and yet once done the Google-suggest-like feature should be significantly faster.
I must say that even in memory this is an inherently inefficient operation since you have no advantage of using an index when querying an XML file in this way. This is something that would be 10X faster in SQL with full-text searching.
Of course XML has the advantage of being self-contained and requiring no additional components. And that makes it a decent choice for small desktop apps that query small amounts of data. Otherwise I would consider using a database for better performance.

You might be better served by using binary serialization rather than XML serialization to persist the data that your app reads on startup, particularly if you end up implementing a data structure that's faster to search than a `StringCollection. You'd still maintain the XML version of the data somewhere, of course.
And by all means, use a BackgroundWorker to load the data asynchronously if that'll make your application feel more responsive.

What's the canonical way to store arbitrary (possibly marked up) text in SQL?

What do wikis/stackoverflow/etc. do when it comes to storing text? Is the text broken at newlines? Is it broken into fixed-length chunks? How do you best store arbitrarily long chunks of text?

nvarchar(max) ftw. because over complicating simple things is bad, mmkay?

I guess if you need to offer the ability to store large chunks of text and you don't mind not being able to look into their content too much when querying, you can use CLobs.

This all depends on the RDBMS that you are using as well as the types of text that you are going to store. If the text is formatted into sizable chunks of data that mean something in and of themselves, like, say header/body, then you might want to break the data up into columns of these types. It may take multiple tables to use this method depending on the content that you are dealing with.
I don't know how other RDBMS's handle it, but I know that that it's not a good idea to have more than one open ended column in each table (text or varchar(max)). So you will want to make sure that only one column has unlimited characters.

Regarding PostgreSQL - use type TEXT or BYTEA. If you need to read random chunks you may consider large objects.

If you need to worry about keeping things like formatting strings, quotes, and other "cruft" in the text, as code would likely have, then the special characters need to be completely escaped first - otherwise on submission the db, they might end up causing an invalid command to be issued.
Most scripting languages have tools to do this built-in natively.

I guess it depends on where you want to store the text, if you need things like transactions etc.
Databases like SQL Server have a type that can store long text fields. In SQL Server 2005 this would primarily be nvarchar(max) for long unicode text strings. By using a database you can benefit from transactions and easy backup/restore assuming you are using the database for other things like StackOverflow.com does.
The alternative is to store text in files on disk. This may be fairly simple to implement and can work in environments where a database is not available or overkill.
Regards the format of the text that is stored in a database or file, it is probably very close to the input. If it's HTML then you would just push it through a function that would correctly escape it.
Something to remember is that you probably want to be using unicode or UTF-8 from creation to storage and vice-versa. This will allow you to support additional languages. Any problem with this encoding mechanism will corrupt your text. Historically people may have defaulted to ASCII based on the assumption they were saving disk space etc.

For SQL Server:
Use a varchar(max) to store. I think the upper limit is 2 GB.
Don't try to escape the text yourself. Pass the text through a parameterizing structure that will do the escapes properly for you. In .Net you'd add a parameter to a SqlCommand, or just use LinqToSQL (which then manages the SqlCommand for you).

I suspect StackOverflow is storing text in markdown format in arbitrarily-sized 'text' column. Maybe as UTF8 (but it might be UTF16 or something. I'm guessing it's SQL Server, which I don't know much about).
As a general rule you want to store stuff in your database in the 'rawest' form possible. That is, do all your decoding, and possibly cleaning, but don't do anything else with it (for example, if it's Markdown, don't encode it to HTML, leave it in its original 'raw' format)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Which will perform better, parsing string each iteration, or parsing once and storing - vb.net

Parsing is definitely heavy stuff. In this particular instance if you do not have any memory constraints then you should parse once and store to speed up your application, especially if it is going to be something that is being used by people :)

Related

Isn't storing all of its data as strings an overhead in terms of memory usage?

What is the best way to represent a form with hundreds of questions in a model

Store and loading large plists

Should I load everything in memory upon application start?

What's the canonical way to store arbitrary (possibly marked up) text in SQL?

Categories

Resources