How do I model many-to-many relationships with tables that have similar attributes? - sql

Here's a fairly straightforward many-to-many mapping of Nerf gun toys to the price range that they fall under. The Zombie Strike and Elite Retaliator are pricey, while both the Jolt Blaster and Elite Triad are cheaper (in the $5.00-$9.99) range.
So far so good. But what happens when I want to start tracking the prices of other items? These other items have different columns, but still need PRICE_RANGES mappings. So I can potentially still use the PRICE_RANGES table, but I need other tables for the other items.
Let's add board games. How should I model this new table, and others like it?
Should I add multiple many-to-many tables, one for each new type of item I'm tracking?
Or should I denormalize PRICE_RANGES, get rid of the mapping tables altogether, and just duplicate PRICE_RANGES tables for every item type?
The second solution has the advantage of being much similar, but at the cost of duplicating all the ranges in PRICE_RANGES. (and there may be many thousands of PRICE_RANGES, depending on how small the increments are). Is that denormalization still a valid solution?
Or maybe there's a third way that's considered better than these two?
Thanks for the help!

Why do you have a "price ranges" table at all? That would make it highly restrictive. Unless there is a really compelling reason I am missing... Here is what I would consider.
Drop the mapping tables
Drop the price ranges tables
Add a min price and max price to each table you want to track price ranges. If there is no range, you can either allow max price to be null, or make both be the same price. Then you can just query the tables to find items within whatever range you want.
Another thought I would consider... how many different types of products are you trying to track? If you are going to make a separate table for every single kind of product... that will quickly become unmanageable if you expect to have hundreds or thousands of items. Consider having a "Product" table that has columns that share attributes, such as price, across all the products. It would have a ProductType column that either references a lookup table or just puts the types directly in the column. Then have either a separate key/value table to cover other random things like bolt capacity. Or even consider putting that in an xml/json/blob column to cover all the extra bits of info.

Related

Modeling N-to-N with DynamoDB

I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data.
As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as
items
-----
item_id (PK)
name
categories
----------
category_id (PK)
name
item_categories
---------------
item_id (PK)
category_id (PK)
To list all items in a category, one could perform a join such as
SELECT items.name from items
JOIN item_categories ON items.item_id = item_categories.item_id
WHERE item_categories.category_id = ?
And to list all categories to which an item belongs, the corresponding query could be made:
SELECT categories.name from categories
JOIN item_categories ON categories.category_id = item_categories.category_id
WHERE item_categories.item_id = ?
Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of (N, even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOINs?
Or should I just go for RDS instead?
Things I have considered:
Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item. Category updates would be awkward.
Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse (category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.
Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.
The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...
You are already in looking in the right direction!
In order to make an informed decision you will need to also consider the cardinality of your data:
Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)
How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?
Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.
Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.
Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.
I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.
In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.

What's the best way in Postgres to store a bunch of arbitrary boolean values for a row?

I have a database full of recipes, one recipe per row. I need to store a bunch of arbitrary "flags" for each recipe to mark various properties such as Gluton-Free, No meat, No Red Meat, No Pork, No Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and Low Carb. Users need to be able to search for recipes that contain one or more of those flags by checking checkboxes in the UI.
I'm searching for the best way to store these properties in the Recipes table. My ideas so far:
Have a separate column for each property and create an index on each of those columns. I may have upwards of about 20 of these properties, so I'm wondering if there's any drawbacks with creating a whole bunch of BOOL columns on a single table.
Use a bitmask for all properties and store the whole thing in one numeric column that contains the appropriate number of bits. Create a separate index on each bit so searches will be fast.
Create an ENUM with a value for each tag, then create a column that has an ARRAY of that ENUM type. I believe an ANY clause on an array column can use an INDEX, but have never done this.
Create a separate table that has a one-to-many mapping of recipes to tags. Each tag would be a row in this table. The table would contain a link to the recipe, and an ENUM value for which tag is "on" for that recipe. When querying, I'd have to do a nested SELECT to filter out recipes that didn't contain at least one of these tags. I think this is the more "normal" way of doing this, but it does make certain queries more complicated - If I want to query for 100 recipes and also display all their tags, I'd have to use an INNER JOIN and consolidate the rows, or use a nested SELECT and aggregate on the fly.
Write performance is not too big of an issue here since recipes are added by a backend process, and search speed is critical (there might be a few hundred thousand recipes eventually). I doubt I will add new tags all that often, but I want it to be at least possible to do without major headaches.
Thanks!
I would advise you to use a normalized setup. Setting this up from the get go as a de-normalized structure is not what I would advise.
Without knowing all the details of what he have going on I think the best setup would be to have your recipe table and new property table and a new recipe_property table. That allows a recipe to have 0 or many properties and normalizes your data making it fast and easy to maintain and query your data.
High level structure would be:
CREATE TABLE recipe(recipe_id);
CREATE TABLE property(property_id);
CREATE TABLE recipe_property(recipe_property_id,recipe_id,property_id);

How would you implement a very wide "table"?

Let's say you're modeling an entity that has many attributes (2400+), far greater than the physical limit on a given database engine (e.g. ~1000 SQL Server). Knowing nothing about the relative importance of these data points (which ones are hot/used most often) besides the domain/candidate keys, how would you implement it?
A) EAV. (boo... Native relational tools thrown out the window.)
B) Go straight across. The first table has a primary key and 1000 columns, right up to the limit. The next table is 1000, foreign keyed to the first. The last table is the remaining 400, also foreign keyed.
C) Stripe evenly across ceil( n / limit ) tables. Each table has an even number of columns, foreign keying to the first table. 800, 800, 800.
D) Something else...
And why?
Edit: This is more of a philosophical/generic question, not tied to any specific limits or engines.
Edit^2: As many have pointed out, the data was probably not normalized. Per usual, business constraints at the time made deep research an impossibility.
My solution: investigate further. Specifically, establish whether the table is truly normalised (at 2400 columns this seems highly unlikely).
If not, restructure until it is fully normalised (at which point there are likely to be fewer than 1000 columns per table).
If it is already fully normalised, establish (as far as possible) approximate frequencies of population for each attribute. Place the most commonly occurring attributes on the "home" table for the entity, use 2 or 3 additional tables for the less frequently populated attributes. (Try to make frequency of occurrence the criteria for determining which fields should go on which tables.)
Only consider EAV for extremely sparsely populated attributes (preferably, not at all).
Use Sparse Columns for up to 30000 columns. The great advantage over EAV or XML is that you can use Filtered Indexes in conjunction with sparse columns, for very efficient searches over common attributes.
Without having much knowlegde in this area, i think an entity with so many attributes really really needs a re-design. With that I mean splitting the big thing into smaller parts that are logically connected.
The key item to me is this piece:
Knowing nothing about the relative importance of these data points (which ones are hot/used most often)
If you have an idea of which fields are more important, I would put those more important fields in the "native" table and let an EAV structure handle the rest.
The thing is, without this information you're really shooting blind anyway. Whether you have 2400 fields or just 24, you ought to have some kind of idea about the meaning (and therefore relative importance, or at least logical groupings) your data points.
I'd use a one to many attribute table with a foreign key to the entity.
Eg
entities: id,
attrs: id, entity_id, attr_name, value
ADDED
Or as Butler Lampson would say, "all problems in Computer Science can be solved by another level of indirection"
I would rotate the columns and make them rows. Rather than having a column containing the name of the attribute as a string (nvarchar) you could have it as a fkey back to a lookup table which contains a list of all the possible attributes.
Rotating it in this way means you:
don't have masses of tables to record the details of just one item
don't have massively wide tables
you can store only the info you need due to the rotation (if you don't want to store a particular attribute, then just don't have that row)
I'd look at the data model a lot
more carefully. Is it 3rd normal
form? Are there groups of attributes
that should be logically grouped
together into their own tables?
Assuming it is normalized and the
entity truly has 2400+ attributes, I
wouldn't be so quick to boo an
EAV model. IMHO, it's the best,
most flexible solution for the
situation you've described. It gives you built in support for sparse data and gives you good searching speed as the values for any given attribute can be found in a single index.
I would like to use vertical ( increase number of rows ) approach instead of horizontal ( increase number of columns).
You can try this approach like
Table -- id , property_name -- property_value.
The advantage with approach is, no need to alter / create a table when you introduce the new property / column.

One mysql table with many fields or many (hundreds of) tables with fewer fields?

I am designing a system for a client, where he is able to create data forms for various products he sales him self.
The number of fields he will be using will not be more than 600-700 (worst case scenario). As it looks like he will probably be in the range of 400 - 500 (max).
I had 2 methods in mind for creating the database (using meta data):
a) Create a table for each product, which will hold only fields necessary for this product, which will result to hundreds of tables but with only the neccessary fields for each product
or
b) use one single table with all availabe form fields (any range from current 300 to max 700), resulting in one table that will have MANY fields, of which only about 10% will be used for each product entry (a product should usualy not use more than 50-80 fields)
Which solution is best? keeping in mind that table maintenance (creation, updates and changes) to the table(s) will be done using meta data, so I will not need to do changes to the table(s) manually.
Thank you!
/**** UPDATE *****/
Just an update, even after this long time (and allot of additional experience gathered) I needed to mention that not normalizing your database is a terrible idea. What is more, a not normalized database almost always (just always from my experience) indicates a flawed application design as well.
i would have 3 tables:
product
id
name
whatever else you need
field
id
field name
anything else you might need
product_field
id
product_id
field_id
field value
Your key deciding factor is whether normalization is required. Even though you are only adding data using an application, you'll still need to cater for anomalies, e.g. what happens if someone's phone number changes, and they insert multiple rows over the lifetime of the application? Which row contains the correct phone number?
As an example, you may find that you'll have repeating groups in your data, like one person with several phone numbers; rather than have three columns called "Phone1", "Phone2", "Phone3", you'd break that data into its own table.
There are other issues in normalisation, such as transitive or non-key dependencies. These concepts will hopefully lead you to a database table design without modification anomalies, as you should hope for!
Pulegiums solution is a good way to go.
You do not want to go with the one-table-for-each-product solution, because the structure of your database should not have to change when you insert or delete a product. Only the rows of one or many tables should be inserted or deleted, not the tables themselves.
While it's possible that it may be necessary, having that many fields for something as simple as a product list sounds to me like you probably have a flawed design.
You need to analyze your potential table structures to ensure that each field contains no more than one piece of information (e.g., "2 hammers, 500 nails" in a single field is bad) and that each piece of information has no more than one field where it belongs (e.g., having phone1, phone2, phone3 fields is bad). Either of these situations indicates that you should move that information out into a separate, related table with a foreign key connecting it back to the original table. As pulegium has demonstrated, this technique can quickly break things down to three tables with only about a dozen fields total.

Too many columns design question

I have a design question.
I have to store approx 100 different attributes in a table which should be searchable also. So each attribute will be stored in its own column. The value of each attribute will always be less than 200, so I decided to use TINYINT as data type for each attribute.
Is it a good idea to create a table which will have approx 100 columns (Each of TINYINT)? What could be wrong in this design?
Or should I classify the attributes into some groups (Say 4 groups) and store them in 4 different tables (Each approx have 25 columns)
Or any other data storage technique I have to follow.
Just for example the table is Table1 and it has columns Column1,Column2 ... Column100 of each TINYINT data type.
Since size of each row is going to be very small, Is it OK to do what I explained above?
I just want to know the advantages/disadvantages of it.
If u think that it is not a good idea to have a table with 100 columns, then please suggest other alternatives.
Please note that I don;t want to store the information in composite form (e.g. few xml columns)
Thanks in advance
Wouldn't a many-to-many setup work here?
Say Table A would have a list of widget, which your attributes would apply to
Table B has your types of attributes (color, size, weight, etc), each as a different row (not column)
Table C has foreign keys to the widget id (Table A) and the attribute type (Table B) and then it actually has the attribute value
That way you don't have to change your table structure when you've got a new attribute to add, you simply add a new attribute type row to Table C
Its ok to have 100 columns. Why not? Just employ code generation to reduce handwriting of this columns.
I wouldn't worry much about the number of columns per se (unless you're stuck using some really terrible relational engine, in which case upgrading to a decent one would be my most hearty recommendation -- what engine[s] do you plan/need to support, btw?) but about the searchability thereby.
Does the table need to be efficiently searchable by the value of an attribute? If you need 100 indexes on that table, THAT might make insert and update operations slow -- how frequent are such modifications (vs reads to the table and especially searches on attribute values) and how important is their speed to you?
If you do "need it all" there just possibly may be no silver bullet of a "perfect" solution, just compromises among unpleasant alternatives -- more info is needed to weigh them. Are typical rows "sparse", i.e. mostly NULL with just a few of the 100 attributes "active" for any given row (just different subsets for each)? Is there (at least statistically) some correlation among groups of attributes (e.g. most of the time when attribute 12 is worth 93, attribute 41 will be worth 27 or 28 -- that sort of thing)?
BAsed on your last, it seems to me that you may have a bad design. WHat is the nature of these columns? Are you storing information together that shouldn't be together, are you storing information that shoul be in related tables?
So really what we need to best help you is to see what the nature of the data you have is.
what would be in
column1,column3,column10 vice column4,column15,column20,column25
I had a table with 250 columns. There's nothing wrong. For some cases, it's how it works.
unless some of the columns you are defining have a meaning "per se" as independent entities and they can be shared by multiple rows. In that case, it makes sense to normalize out the set of columns in a different table, and put a column in the original table (possibly with a foreign key constraint)
I think the correct way is to have a table that looks more like:
CREATE TABLE [dbo].[Settings](
[key] [varchar](250) NOT NULL,
[value] tinyint NOT NULL
) ON [PRIMARY]
Put an index on the key column. You can eventually make a page where the user can update the values.
Having done a lot of these in the real world, I don't understand why anyone would advocate having each variable be its own column. You have "approx 100 different attributes" so far, you don't think you are going to want to add and delete to this list? Every time you do it is a table change and a production release? You will not be able to build something to hand the maintenance off to a power user. Your reports are going to be hard-coded too? Things take off and you reach the max number of columns of 1,024 are you going to rework the whole thing?
It is nothing to expand the table above - add Category, LastEditDate, LastEditBy, IsActive, etc. or to create archiving functionality. Much more awkward to do this with the column based solution.
Performance is not going to be any different with this small amount of data, but to rely on the programmer to make and release a change every time the list changes is unworkable.