String interning with SQLAlchemy

String interning with SQLAlchemy - orm

I've been trying out various approaches to "string interning" in a database that's accessed primarily using SQLAlchemy ORM. I've tried a couple things, and so far I'm not really loving any of them. It seems like a common pattern, and I feel like I might be missing some obvious, elegant solution.
To elaborate: the situation is that my database (Postgres, if it matters) table is likely to to contain many of the same strings, but they are still arbitrary, and not bounded in a way that a native enum type would be the right solution. I want to collect these strings in another table with an auto-incrementing PK and then reference them in the main table by FK. The goals here include both space savings and string "hygiene" (i.e. I'd like to be able to easily assess and track the growth of this string table.)
I've tried the obvious naive solution of creating a separate entity, but this seems to foist the mechanics of the string interning onto every consumer of the entity. i.e. every consumer has to traverse the relationship to get the value, like this: obj.interned_property.value And absent joinedload hints, it causes another database hit for every new access. (In general, I try to keep loading strategies out of the model itself, since different use cases often benefit from different loading strategies.) Adding a python property to traverse the relationship is not a good approach because it can't participate in SQLAlchemy filtering/ordering operations.
I've tried using the AssociationProxy extension, but I've been generally disappointed with it. I discovered that AssociationProxy attributes don't follow the same metadata contract of other SA ORM attributes; They lack an info property, for instance. An info dictionary was relatively simple to graft on, but this was really just the first shoe to drop. After that, I discovered that you can't filter against them in a query (at least not with the LIKE operator.) I've gotten to the point where I'm kinda sick of discovering the next thing that AssociationProxy attributes can't do.
The next thought I had was to do all the interning inside the database using triggers and updatable views, but that inherently hampers portability w/r/t database engine, and splits the logic between Python and PL/SQL which makes it harder for future developers coming into this code to figure out what's going on. And, it's a bunch of effort, so if I'm going to do it, I would like to feel more confident that it's the right way to go.
Anyway, it seems like this is a pretty common pattern, and I feel like someone must have figured out an elegant solution by now. So, I'd love to hear from someone who's been down this road before: what's the best way to handle string interning with SQLAlchemy?

Related

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur

I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

Database access: one master database object or have objects call queries themselves?

For a hobby project I am building an application to keep track of my money. Register everything that comes in and goes out. I am using sqlite as a database backend.
I have two data access models in mind.
Creating one master object as a sort of database connector, which contains methods which execute the queries and provide the required sets of data as a list of objects
Have objects who need data execute the queries themselves
Which one of these is 'the best' and why? Or are there different, better models out there?

The latter option is better. In the first option, you would end up having to touch your universal data access object for just about any update to the code that wasn't purely a change in display logic. If you have different data access objects, then you will have much more testable, maintainable code.
I suggest you read up a bit on the model-view-controller paradigm. The wikipedia article on it is a good start: http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller.
Also, you didn't say which language/platform you were coding in, but most platforms have numerous options for auto-generating a starting point your data access classes from your database. You may find something like that useful.

Much of a muchness really, the thing to avoid is having the "same" sql sprinkled all over your code base.
The key point is. You've just added a new column to Table1. When you do Find In Files "Table1", how many hits are you going to get and where.
If you use one class and there's a lot of db operations, it's going to get very messy very quickly, but if you have one interface (say IModel) with one implementation, you can swap backends very easily.
So how many db operations, and how likely is it you will move away from SqlLite.

Where do ORMs fall through?

I often hear people bashing ORMs for being inflexible and a "leaky abstraction", but you really don't hear why they're problematic. When used properly, what exactly are the faults of ORMs? I'm asking this because I'm working on a PHP orm and I'd like for it to solve problems that a lot of other ORMs fail at, such as lazy loading and the lack of subqueries.
Please be specific with your answers. Show some code or describe a database schema where an ORM struggles. Doesn't matter the language or the ORM.

One of the bigger issues I have noticed with all the ORMs I have used is updating only a few fields without retrieving the object first.
For example, say I have a Project object mapped in my database with the following fields: Id, name, description, owning_user. Say, through ajax, I want to just update the description field. In most ORMs the only way for me to update the database table while only having an Id and description values is to either retrieve the project object from the database, set the description and then send the object back to the database (thus requiring two database operations just for one simple update) or to update it via stored procedures (which is the method I am currently using).

Objects and database records really aren't all that similar. They have typed slots that you can store stuff in, but that's about it. Databases have a completely different notion of identity than programming languages. They can't handle composite objects well, so you have to use additional tables and foreign keys instead. Most have no concept of type inheritance. And the natural way to navigate a network of objects (follow some of the pointers in one object, get another object, and dereference again) is much less efficient when mapped to the database world, because you have to make multiple round trips and retrieve lots of data that you didn't care about.
In other words: the abstraction cannot be made very good in the first place; it isn't the ORM tools that are bad, but the metaphor that they implement. Instead of a perfect isomorphism it is is only a superficial similarity, so the task itself isn't a very good abstraction. (It is still way more useful than having to understand databases intimately, though. The scorn for ORM tools come mostly from DBAs looking down on mere programmers.)

ORMs also can write code that is not efficient. Since database performance is critical to most systems, they can cause problems that could have been avoided if a human being wrote the code (but which might not have been any better if the human in question didn't understand database performance tuning). This is especially true when the querying gets complex.
I think my biggest problem with them though is that by abstracting away the details, junior programmers are getting less understanding of how to write queries which they need to be able to to handle the edge cases and the places where the ORM writes really bad code. It's really hard to learn the advanced stuff when you never had to understand the basics. An ORM in the hands of someone who understands joins and group by and advanced querying is a good thing. In the hands of someone who doesn't understand boolean algebra and joins and a bunch of other basic SQL concepts, it is a very bad thing resulting in very poor design of database and queries.
Relational databases are not objects and shouldn't be treated as such. Trying to make an eagle into a silk purse is generally not successful. Far better to learn what the eagle is good at and why and let the eagle fly than to have a bad purse and a dead eagle.

The way I see it is like this. To use an ORM, you have to usually stack several php functions, and then connect to a database and essentially still run a MySQL query or something similar.
Why all of the abstraction in between code and database? Why can't we just use what we already know? Typically a web dev knows their backend language, their db language (some sort of SQL), and some sort of frontend languages, such as html, css, js, etc...
In essence, we're trying to add a layer of abstraction that includes many functions (and we all know php functions can be slower than assigning a variable). Yes, this is a micro calculation, but still, it adds up.
Not only do we now have several functions to go through, but we also have to learn the way the ORM works, so there's some time wasted there. I thought the whole idea of separation of code was to keep your code separate at all levels. If you're in the LAMP world, just create your query (you should know MySQL) and use the already existing php functionality for prepared statements. DONE!
LAMP WAY:
create query (string);
use mysqli prepared statements and retrieve data into array.
ORM WAY:
run a function that gets the entity
which runs a MySQL query
run another function that adds a conditional
run another function that adds another conditional
run another function that joins
run another function that adds conditionals on the join
run another function that prepares
runs another MySQL query
run another function that fetches the data
runs another MySQL Query
Does anyone else have a problem with the ORM stack? Why are we becoming such lazy developers? Or so creative that we're harming our code? If it ain't broke don't fix it. In turn, fix your dev team to understand the basics of web dev.

ORMs are trying to solve a very complex problem. There are edge cases galore and major design tradeoffs with no clear or obvious solutions. When you optimize an ORM design for situation A, you inherently make it awkward for solving situation B.
There are ORMs that handle lazy loading and subqueries in a "good enough" manner, but it's almost impossible to get from "good enough" to "great".
When designing your ORM, you have to have a pretty good handle on all the possible awkward database designs your ORM will be expected to handle. You have to explicitly make tradeoffs around which situations you are willing to handle awkwardly.
I don't look at ORMs as inflexible or any more leaky than your average complex abstraction. That said, certain ORMs are better than others in those respects.
Good luck reinventing the wheel.

Coldfusion ORM Large Tables

Say if I have a large dataset, the table has well over a million records and the database is normalized so theres foreign keys and stuff. Ive set up the relations properly and i get a list of the first object applications = EntityLoad("entityName") but because of the relations and stuff the page takes like 24 seconds to load, even when i limit the number of records to show to like 5 it takes an awful long time to load.
My solution to this was create another object that just gets the list, and then when the user wants to , use the object with all the relations and show it to the user. Is this the right way to approach it, or am i missing a big ORM concept?

Are you counting just the time to get the data, or are you perhaps doing a CFDUMP on it or something else visually that could be slow. In other words, have you wrapped the EntityLoad by itself in a cftimer tag to be sure that it is the culprit?

The first thing I would do is enable SQL logging in your Application.cfc. Add logSQL=true to This.ormSettings.
That should allow you to grab the SQL that ORM generates. Run it in an analyzer. See if the ORM SQL is doing somethign crazy. See if it is an index that you missed or something.
Also are you doing paging as Ray talks about here: http://www.coldfusionjedi.com/index.cfm/2009/8/14/Simple-ColdFusion-9-ORM-Paging-Demo?
If not have you tried using ORMExecuteQuery and HQL to enable paging.
Those are my thoughts.

When defining complex domain models with Hibernate - you will sometimes need to tweak the mapping to improve performance. This is especially true if you are dealing with inheritance (not sure how much inheritance is in your model). The ultimate goal is to have your query pulling from as few tables as possible while still preserving your domain model. This might require using the advanced inheritance mappings (more on that in a sec).
LOGGING SQL
As Terry mentioned, you will want to be sure you can log the actual SQL that is being passed to your database (yeah, you don't totally get away from SQL with ORM). Here is a great article on setting up logging for Hibernate in CF9 from Rupesh:
http://www.rupeshk.org/blog/index.php/2009/07/coldfusion-orm-how-to-log-sql/
HIBERNATE MAPPING FILES
Anytime you want to do something beyond the basic, you want to be sure that you are looking at the actual Hibernate mapping files that are generated for your CFC's. Be sure to set the following with all of your hibernate options in Application.cfc:
savemapping = true
While the cfproperty properties allow you to define many aspects of the mapping, there are actually some things that can only be done in the Hibernate mapping files (and there are tons of community resources on this.
INHERITANCE MAPPING
As I mentioned earlier, Hibernate provides different inheritance strategies for mapping. They are Table per Hierarchy, Table per subclass, Table per concrete class, and implicit polymorphism. You can read more about these types in the CF9 docs under Advanced Mapping > Inheritance Mapping or in the Hibernate documentation (as it would take forever to explain each of these).
Knowing how your tables are mapped is very important with inheritance (and it is also where Hibernate can generate some HUGE queries if you don't tweak your setup).
Those are the things I can think of - if you can give some additional information about your domain model - we can look to see what other things might be done to tweak it.

There is a good chance Hibernate is doing it's caching thing. A fair comparison in my mind (everyone please feel free to add) is doing an:
EntityLoad("entity_name") is the same as doing a select * from TABLE
So, in this case, what Hibernate might be doing in instantiating the memory, and caching it a certain way, your database server might do this similarly when you sent such a broad SQL instruction.
I have been extremely interested in ORM the past few weeks and it looks to be a very rewarding undertaking.
For this reason, is there a tiem you would ever load all 500,000 records as a result? I assume not.
I have one large logging table that I will be attacking, I am finding that the SQL good stuff must be there. For example, mark the fields that are indexes as such, this will speed it up incredibly when searching. I am sure the ORM can handle this.
Beyond this:
Find some excellent Hibernate forums, resources, and tutorials so you can learn Hibernate. This isn't really as much a Coldfusion --> ORM issue as what Hibernate might do on it's own. I have ordered a few Hibernate books that I'm waiting on to see how they are.
Likewise there seems to be an incredible amount of Hibernate resources out there where you can bring the Performance enhancement solutions of Hibernate into the Coldfusion sphere. I might be making it too simple, but I see the CF-ORM implementation as a wrapper with some code generation to save us time.
Take a look at implementing filters to cut down your data in the EntityLoad() call.
As recommended in other threads, turn on sql logging and see what sql is being generated. Chances are it might not be what you need. Check out HQL to see if you can form a better statement.
Most importantly, share what you find. I'll volunteer to do the same on this as you've tempted me to go try this out in my spare time a bit sooner than planned.

Faisal, we ran into this with Linq (c# orm).
Our solution was to create simple objects not holding the relational data. For instance, along with Users we had a SimpleUsers object which held little or no relation to any other object and had a limited set of columns.
There could be other ways of handling this but this approach helped tremendously with the query speed.

Many-to-many relationship: use associative table or delimited values in a column?

Update 2009.04.24
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
I've seen delimited data used in commercial product databases (Ektron lol).
SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.
/end Update
The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.
Here's an example DB structure:
Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))
Category
------------
ID (PK)
Title
There is a many-to-many relationship between Document and Category.
In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.
To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.
With that model, to get all Documents for a Category, you would need something like the following:
select * from documents where categoryids like '%|' + #targetCategoryId + '|%'
My solution is to create an associative table as follows:
Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)
This is confusing to the developers. Is there some elegant alternate solution that I'm missing?
I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?
Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?
Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?
Thanks.

This is confusing to the developers.
Get better developers. That is the right approach.

Your suggestion IS the elegant, powerful, best practice solution.
Since I don't think the other answers said the following strongly enough, I'm going to do it.
If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,
Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.
Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.
I hope this answer is very helpful to you.
Update
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).
This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.
I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.

It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.

My solution is to create an
associative table as follows: This is
confusing to the developers
Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.
What you propose is the right solution!!

The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).
Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.

The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.
I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.
Why are the developers confused?

The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.
If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.

This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.
I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.

The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.
Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.
It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.
So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.
Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.
We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.
Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.
Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas