While implementing a simple app I ran into the problem of trying to update a nested record. I found a solution online but it really seems like a whole lot of bloated code.
As I was looking for alternatives I found Dictionaries. This seem like a solution to that problem -- If I use a dictionary inside of a record I can avoid all that bloated code and get nested updates.
Seeing dictionaries and records next to each other made me wonder, why would I use a record instead of a dictionary, or vice versa? The two seem really similar to me, so I am not sure I see the advantage of one or the other. Of course I can see that there is a difference in syntax, but is that all ?
I learned somewhere that the access time complexity of Dict is O(log(n)) -- does it do a binary search on the keys ? -- but I can't find the access time complexity for record, but I am wondering if that is O(1) and that is one of the advantages.
Either way, they both seem to map to 1 single data structure in other languages (e.g Python's dictionaries, JS objects, Java hash-tables), why do we need two in elm ?
Dicts and records might seem very similar when coming from JavaScript, but in a statically typed language they are actually very different. I think just about the only property they have in common is that they are both key-value containers.
The biggest differences, I think, are that Dicts are homogeneous, meaning values must be of the same type, and "dynamically" keyed and sized, meaning keys are not statically checked (ie. at compile-time) and that key-value pairs can be added at runtime. Records on the other hand includes the key names and value types in the record type, which means they can hold values of different types, but also can't have keys added or removed at runtime without changing the type itself.
The benefits of easily being able to insert and update a Dict is something you pay for when you try to get it back out. Dict.get returns a Maybe which you'll then have to handle, because the type doesn't give any guarantee that it contains anything at all. You also won't get a compiler error if you mistype the name of a key.
Overall, a Dict forsakes most of the benefits of static typing. I think a good rule of thumb is that if you know the key names, you should most likely go with records. If you don't, go with Dict.
You also seem about right regarding performance, but I think that's a secondary concern. Record access should be equivalent to accessing the elements of an array by index, since so much information is known at compile time that it can essentially be compiled down to a fixed-size array.
Related
I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?
One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?
With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.
So i have developed an application in vb.net but recently i came across the requisite of allowing multiple languages for it. I dont know if there is any 'common' way of doing this kind of things, but my approach to accomplish that is the following:
I'll need to search in the code for components, error messages and everything that is displayed in the GUI of the application to be translated.
Secondly i will create a class in which i'll store in memory a dictionary of everything that will be translated
after, i'll replace the stuff to be translated withing an entry of the dictionary
then when the application start i'll load the dictionary
later on, i'll replace the static dictionary and will load it in memory from the database
So for example, my dictionary class:
Dim dictionary As New Dictionary(Of String, String)
dictionary.Add("00011", "hello there!")
Somewhere in my code i'll replace:
mylabel.text = "hello there!"
With:
mylabel.text = dictionary.item("00011")
Later on i will, instead of having a static dictionary, create that dictionary getting the information from a database like this (and load it at the start of the application:
_______________________________________
word_code ### word_EN ### word_FR
_______________________________________
00011 ### hello there ### bonjour il
I will load the dictionary considering which language is selected.
I'm not very confortable with this approach and i have no idea if this is the right thing to do, but if so i have a couple of questions:
is a dictionary the best data-structure to do so?
will this be memory-heavy considering i'll have 1000 entries, 1m entries or 10m entries?
is there any logic and faster way of accomplish the same?
Thank you so much in advanced,
J
It's a common way of doing it - having a system name along side a language code being used to look up a translated value. However, generally speaking I'd only advice you to do this for something like system texts and smaller text segments.
The reason is that in for example CMS/ecommerce systems, pages with lots of text likely will need to be translated in a data model to support it to begin with; and then you already have the language division.
So in that situation, you're better off making a page structure with a translated data model where the detail will be language specific per language for your current website.
For example, you'll have a product -> product_detail where detail keeps the translated values for said product. Similar for article -> article_detail and so on.
But for general translations and system texts which needs to be displayed, it's a common way to do it.
And as you suggest yourself, structures like like dictionary would be a good structures to to make fast look ups and can be cached in the system so you do not need to retrieve them all the time.
Some ways you can expand on it, is by sub dividing your translations into sub groups; say you have an order page and a product page. Then you can have translations assigned to "product" and to "order" with a "common" group as well.
It will also make it easier to build smaller cache objects, extract less data from your data storage etc, so a page which only revolves around orders don't need to worry about product translations.
It will require memory, but unless you put entire CMS systems into the translations, it should be "minor".
I would however question a need of 10 million entities of translations and wonder whether or not your system actually requires that many and if it does, then maybe consider an alternate approach and whether it might be better to make multiple versions of the "page" to eliminate the need for translations.
I would also advice you to not use "00011" as a system code to begin, and go for a more "readable" version (like "hello") to ease the readability and maintainability of your code. Then if you want a 'system value' which is like "00011", it's easy to do a search/replace.
I'm trying to use Redis as a primary database for a small game I'm making (mostly to mess around with programming and using Redis).
However I came across a scenario that I couldn't find an answer to:
I wish to store a list of the names of different maps that people can be on (not many of them) along with their id. Note: I never need to get the ID from the name.
The two ways I believe this can be done are either storing the information as a string or as a hash.
i.e:
1) String based:
set maps:0 "Main"
set maps:1 "Island"
etc (and maybe a maps:id to
store an auto increment value)
2) Hash based:
hset maps "0" "Main"
hset maps "1" "Island"
etc
My question is which way seems the best. Given that there will never be that many maps I'm leaning towards the single hashed object. Partially because this provides a nice method to return all the maps in existence. But is there any particular reason that the string based queries would be more useful.
Hopefully you can give me some clear information.
Thank you,
Pluckerpluck
The String based values are actually discouraged because it consumes a lot more memory than a hash.
Redis optimizes small hashes and encodes them in a memory efficient manner. This encoding is called zipmap (or ziplist in redis 2.6). See http://redis.io/topics/memory-optimization, specially the section "Use hashes when possible".
Example case:
We're building a renting service, using SQL Server. Information about items that can be rented is stored in a table. Each item has a state that can be either "Available", "Rented" or "Broken". The different states reside in a lookup table.
ItemState table:
id name
1 'Available'
2 'Rented'
3 'Broken'
Adding to this we have a business rule which states that whenever an item is returned, it's state is changed from "Rented" to "Available".
This could be done with a an update statement like "update Items set state=1 where id=#itemid". In application code we might have an enum that maps to the ItemState id:s. However, these contain hard coded values that could lead to maintenance issues later on. Say if a developer were to change the set of states but forgot to fix the related business logic layer...
What good methods or alternate designs are there for dealing with this type of design issues?
Links to related articles are also appreciated in addition to direct answers.
In my experience this is a case where you actually have to hardcode, preferably by using an Enum which integer values match the id's of your lookup tables. I can't see nothing wrong with saying that "1" is always "Available" and so forth.
Most systems that I've seen hard code the lookup table values and live with it. That's because, in practice, code tables rarely change as much as you think they might. And if they ever do change, you generally need to re-compile any programs that rely on that DDL anyway.
That said, if you want to make the code maintainable (a laudable goal), the best approach would be to externalize the values into a properties file. Then you can edit this file later without having to re-code your entire app.
The limiting factor here is that your app depends for its own internal state on the value you get from the lookup table, so that implies a certain amount of coupling.
For lookups where the app doesn't rely on that code, (for instance, if your code table stores a list of two-letter state codes for use in an address drop-down), then you can lazily load the codes into an object and access them only when needed. But that won't work for what you're doing.
When you have your lookup tables as well as enums defined in the code, then you always have an issue with keeping them in sync. There is not much that can be done here. Both live effectively in two different worlds and are generally unaware of each other.
You may wish to reject using lookup tables and only let your business logic operate these values. In that case you miss the options of relying on referential integrity to back you ap on the data integrity.
The other option is to build up your application in that way that you never need these values in your code. That means moving part of your business logic to the database layer, meaning, putting them in stored procedures and triggers. This will also have the benefit of being agnostic to the client. Anyone can invoke SPs and get assured the data will be kept in the consistence state, consistent with your business logic rules as well.
You'll need to have some predefined value that never changes, be it an integer, a string or something else.
In your case, the numerical value of the state is the state's surrogate PRIMARY KEY which should never change in a well-designed database.
If you're concerned about the consistency, use a CHAR code: A, R or B.
However, you should stick to it as well as to a numerical code so that A always means Available etc.
You database structure should be documented as well as the code is.
The answer depends entirely on the language you're using: solutions for this are not the same in Java, PHP, Smalltalk or even Assembler...
But let me tell you something: while it's true hard coded values are not a great thing, there are times in which you do need them. And this one is pretty much one of them: you need to declare in your code your current knowledge of the business logic, which includes these hard coded states.
So, in this particular case, I would hard code those values.
Don't overdesign it. Before trying to come up with a solution to this problem, you need to figure out if it's even a problem. Can you think of any legit hypothetical scenario where you would change the values in the itemState table? Not just "What if someone changes this table?" but "Someone wants to change this table in X way for Y reason, what effect would that have?". You need to stay realistic.
New state? you add a row, but it doesn't affect the existing ones.
Removing a state? You have to remove the references to it in code anyway.
Changing the id of a state? There is no legit reason to do that.
Changing the name of a state? There is no legit reason to do that.
So there really should be no reason to worry about this. But if you must have this cleanly maintainable in the case of irrational people who randomly decide to change Available to 2 because it just fits their Feng Shui better, make sure all tables are generated via a script which reads these values from a configuration file, and then make sure all code reads constants from that same configuration file. Then you have one definition location and any time you want to change the value you modify that configuration file instead of the DB/code.
I think this is a common problem and a valid concern, that's why I googled and found this article in the first place.
What about creating a public static class to hold all the lookup values, but instead of hard-coding, we initialize these values when the application is loaded and use names to refer them?
In my application, we tried this, it worked. Also you can do some checking, e.g. the number of different possible values of a lookup in code should be the same as in db, if it's not, log/email/etc. But I don't want to manually code this for the status of 40+ biz entities.
Moreover, this can be part of the bigger problem of OR mapping. We're exposed with too much details of the persistence layer, and thus we have to take care of it. With technologies like Entity Framework, we don't need to worry about the "sync" part because it's automated, am I right?
Thanks!
I've used a similar method to what you're describing - a table in the database with values and descriptions (useful for reporting, etc.) and an enum in code. I've handled the synchronization with a comment in code saying something like "these values are taken from table X in database ABC" so that the programmer knows the database needs to be updated. To prevent changes from the database side without the corresponding changes in code I set permissions on the table so that only certain people (who hopefully remember they need to change the code as well) have access.
The values have to be hard-coded, which effectively means that they can't be changed in the database, which means that storing them in the database is redundant.
Therefore, hard-code them and don't have a lookup table in the database. Instead store the items state directly in the items table.
You can structure your database so that your application doesn't actually have to care about the codes themselves, but rather the business rules behind them.
I have done both of the following:
Do one or more of your codes have a certain characteristic, such as IsAvailable, that the application cares about? If so, add it as a flag column to the code table, where those that match are set to true (or your DB's equivalent), and those that don't are set to false.
Do you need to use a specific, single code under a certain condition? You can create a singleton table, named something like EnvironmentSettings, with a column such as ItemStateIdOnReturn that's a foreign key to the ItemState table.
If I wanted to avoid declaring an enum in the application, I would use #2 to address the example in the question.
Whether you take this approach depends on your application's priorities. This type of structure comes at the cost of additional development and lookup overhead. Plus, if every individual code comes with its own business rules, then it's not practical to create one new column per required code.
But, it may be worthwhile if you don't want to worry about synchronizing your application with the contents of a code table.
I know its possible to get the top terms within a Lucene Index, but is there a way to get the top terms based on a subset of a Lucene index?
I.e. What are the top terms in the Index for documents within a certain date range?
Ideally there'd be a utility somewhere to do this, but I'm not aware of one. However, it's not too hard to do this "by hand" in a reasonably efficient way. I'll assume that you already have a Query and/or Filter object that you can use to define the subset of interest.
First, build a list in memory of all of the document IDs in your index subset. You can use IndexSearcher.search(Query, Filter, HitCollector) to do this very quickly; the HitCollector documentation includes an example that seems like it ought to work, or you can use some other container to store your doc IDs.
Next, initialize an empty HashMap (or whatever) to map terms to total frequency, and populate the map by invoking one of the IndexReader.getTermFreqVector methods for every document and field of interest. The three-argument form seems simpler, but either should be just fine. For the three-argument form, you'd make a TermVectorMapper whose map method checks if term is in the map, associates it with frequency if not, or adds frequency to the existing value if so. Be sure to use the same TermVectorMapper object across all of the calls to getTermFreqVector in this pass, rather than instantiating a new one for each document in the loop. You can also speed things up quite a bit by overriding isIgnoringPositions() and isIgnoringOffsets(); your object should return true for both of those. It looks like your TermVectorMapper might also be forced to define a setExpectations method, but that one doesn't need to do anything.
Once you've built your map, just sort the map items by descending frequency and read off however many top terms you like. If you know in advance how many terms you want, you might prefer to do some kind of fancy heap-based algorithm to find the top k items in linear time instead of using an O(n log n) sort. I imagine the plain old sort will be plenty fast in practice. But it's up to you.
If you prefer, you can combine the first two stages by having your HitCollector invoke getTermFreqVector directly. This should certainly produce equally correct results, and intuitively seems like it would be simpler and better, but the docs seem to warn that doing so is likely to be quite a bit slower than the two-pass approach (on same page as the HitCollector example above). Or I could be misinterpreting their warning. If you're feeling ambitious you could try it both ways, compare, and let us know.
Counting up the TermVectors will work, but will be slow if there are a lot of documents to iterate. Also note if you mean docFreq by top terms, then don't use the count in the TermFreqVector just count the terms as binary.
Alternatively, you could iterate the terms like facet counts. Use a cached filter for every term; their BitSets can be used for a fast intersection count.