I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data.
As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as
items
-----
item_id (PK)
name
categories
----------
category_id (PK)
name
item_categories
---------------
item_id (PK)
category_id (PK)
To list all items in a category, one could perform a join such as
SELECT items.name from items
JOIN item_categories ON items.item_id = item_categories.item_id
WHERE item_categories.category_id = ?
And to list all categories to which an item belongs, the corresponding query could be made:
SELECT categories.name from categories
JOIN item_categories ON categories.category_id = item_categories.category_id
WHERE item_categories.item_id = ?
Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of (N, even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOINs?
Or should I just go for RDS instead?
Things I have considered:
Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item. Category updates would be awkward.
Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse (category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.
Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.
The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...
You are already in looking in the right direction!
In order to make an informed decision you will need to also consider the cardinality of your data:
Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)
How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?
Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.
Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.
Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.
I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.
In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.
Related
I have two tables users (whose primary key is user_id) and items (whose primary key is item_id).
One user_id (resp. item_id) can be associated with zero to many item_id (resp. user_id).
Under normal circumstances, I would model this many-to-many relationship with a joining table linking the two i.e. users_items (user_id, item_id).
In practice however, most user_id are associated with all items rows, and with millions of users and items the matching between every single pair results in billions of rows and rapidly becomes impractical (indexation, RAM usage, storage...).
For instance, 2 million users all having 1 million items would result in 2000 billion rows.
The less-than-ideal temporary solution I came up with is to add a boolean is_owned_by_all_users column in the items table, which seems to me like an inherently bad design (more complex queries, information split between two tables etc).
Is there a better strategy to implement this type of relationship?
While this is a general SQL question, I am also interested in engine-specific implementations that would solve this problem (if more suited to this particular scenario).
The solution I've seen posed to this is that you create a "security profile" table. Each record in the table represents a particular list of users.
then instead of a many to many bridge table, you have a many to one. Each item record (many) refers to one security profile record (which actually represents many)
The security profile record represents basically a particular combination of users.
There is of course overhead in maintaining, generating, and decoding this table but it stops you needing a bridge table.
If there are a lot of different combinations, it's also impractical.
If that is not practical then your original idea sounds fine to me, if that is a particular characteristic of your data, there's nothing wrong with modelling it that way.
Here's a fairly straightforward many-to-many mapping of Nerf gun toys to the price range that they fall under. The Zombie Strike and Elite Retaliator are pricey, while both the Jolt Blaster and Elite Triad are cheaper (in the $5.00-$9.99) range.
So far so good. But what happens when I want to start tracking the prices of other items? These other items have different columns, but still need PRICE_RANGES mappings. So I can potentially still use the PRICE_RANGES table, but I need other tables for the other items.
Let's add board games. How should I model this new table, and others like it?
Should I add multiple many-to-many tables, one for each new type of item I'm tracking?
Or should I denormalize PRICE_RANGES, get rid of the mapping tables altogether, and just duplicate PRICE_RANGES tables for every item type?
The second solution has the advantage of being much similar, but at the cost of duplicating all the ranges in PRICE_RANGES. (and there may be many thousands of PRICE_RANGES, depending on how small the increments are). Is that denormalization still a valid solution?
Or maybe there's a third way that's considered better than these two?
Thanks for the help!
Why do you have a "price ranges" table at all? That would make it highly restrictive. Unless there is a really compelling reason I am missing... Here is what I would consider.
Drop the mapping tables
Drop the price ranges tables
Add a min price and max price to each table you want to track price ranges. If there is no range, you can either allow max price to be null, or make both be the same price. Then you can just query the tables to find items within whatever range you want.
Another thought I would consider... how many different types of products are you trying to track? If you are going to make a separate table for every single kind of product... that will quickly become unmanageable if you expect to have hundreds or thousands of items. Consider having a "Product" table that has columns that share attributes, such as price, across all the products. It would have a ProductType column that either references a lookup table or just puts the types directly in the column. Then have either a separate key/value table to cover other random things like bolt capacity. Or even consider putting that in an xml/json/blob column to cover all the extra bits of info.
I'm trying to design the best way to index my data into Azure Search. Let's say my Azure SQL Database contains two tables:
products
orders
In my Azure Search index I want to have not only products (name, category, description etc.), but also count of orders for this product (to use this in the scoring profiles, to boost popular products in search results).
I think that the best way to do this is to create a view (indexed view?) which will contain columns from products and count of orders for each product, but I'm not sure if my view (indexed view?) can have its own rowversion column, which will change every time the count changes (orders may be withdrawn - DELETED - and placed - INSERTED).
Maybe there is some easier solution to my problem? Any hints are appreciated.
Regards,
MJ
Yes, I believe the way you are looking to do this is a good approach. Some other things that I have seen people do is to also includes types For example, you could have a Collection field (which is an Array of strings), perhaps called OrderTypes that you would load with all of the associated order types for that product. That way you can use the Azure Search $facets features to show you the total count of specific order types. Also, you can use this to drill into the specifics of those order. For example, you could then filter based on the selected order type they selected. Certainly if there are too many types of Orders, perhaps that might not be viable.
In any case, yes, I think this would work well and also don't forget, if you want to periodically update this count you could simply pass on just that value (rather than sending the whole product fields) to make it more efficient.
A view cannot have its "own" rowversion column - that column should come from either products or orders table. If you make that column indexed, a high water mark change tracking policy will be able to capture new or updated (but not deleted) rows efficiently. If products are deleted, you should look into using a soft-delete approach as described in http://azure.microsoft.com/en-us/documentation/articles/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers-2015-02-28/
HTH,
Eugene
We are currently developing a online advert site for people to buy and sell (similar to gumtree) difference being this will be used for employees who work for the company, it wont be reachable from people outside the company.
Now we have 15 categories which have sub categories and those sub categories have child categories.
We have a main table called Adverts which consists on ItemId, Title, SubTitle, Description, CreatedBy, BroughtBy, StartDate, EndDate and ParentCategoryId, SubCategoryId and ChildCategoryId etc
Now instead of having one massive tables which consists of all the details for the item they are selling we were going to create separate table(s) per category for the details of the item.
So we would have Advert.Vehicle_Spec which would have all the details about a car they were selling i.e
ItemId (which will be a FK to the main Advert table), Make, Model, Colour, Mot, Tax etc
That way when we query the main table Advert we can join onto the relevant Spec table which in a way would keep the tables clean and tidy now my question would be to you is this a good approach? Would there be any performance issues with this approach? I will create all the relevant FK where needed to help with queries etc.
I did ask this question on an SQL Server forum and one individual suggested using XML - each category gets an XML schema and the XML tags and values are held in a single field but the data varies depending on what type of item is being sold. This requires more setup but probably has the best overall balance of performance and flexibility, I Personally have never worked with XML within SQL so I can't comment on this being a good approach or not?
Each category can have many different status's we have a variety of tables already which hold the description of each status, the queries we will be performing will vary from select, delete, insert, update some queries will have multiple joins on to the Status/User table, we will also be implementing a "Suggested" form which will show all records suggested for a user depending on what they search for.
Is XML right for this in regards to flexibility and performance?
XML seems to be a good approach for this, you can directly write stored procedures that queries the specific categories you want and organize them into tables and display them. You will then possibly want to use something like XSLT to extract the XML data and display them in a table.
I have a database full of recipes, one recipe per row. I need to store a bunch of arbitrary "flags" for each recipe to mark various properties such as Gluton-Free, No meat, No Red Meat, No Pork, No Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and Low Carb. Users need to be able to search for recipes that contain one or more of those flags by checking checkboxes in the UI.
I'm searching for the best way to store these properties in the Recipes table. My ideas so far:
Have a separate column for each property and create an index on each of those columns. I may have upwards of about 20 of these properties, so I'm wondering if there's any drawbacks with creating a whole bunch of BOOL columns on a single table.
Use a bitmask for all properties and store the whole thing in one numeric column that contains the appropriate number of bits. Create a separate index on each bit so searches will be fast.
Create an ENUM with a value for each tag, then create a column that has an ARRAY of that ENUM type. I believe an ANY clause on an array column can use an INDEX, but have never done this.
Create a separate table that has a one-to-many mapping of recipes to tags. Each tag would be a row in this table. The table would contain a link to the recipe, and an ENUM value for which tag is "on" for that recipe. When querying, I'd have to do a nested SELECT to filter out recipes that didn't contain at least one of these tags. I think this is the more "normal" way of doing this, but it does make certain queries more complicated - If I want to query for 100 recipes and also display all their tags, I'd have to use an INNER JOIN and consolidate the rows, or use a nested SELECT and aggregate on the fly.
Write performance is not too big of an issue here since recipes are added by a backend process, and search speed is critical (there might be a few hundred thousand recipes eventually). I doubt I will add new tags all that often, but I want it to be at least possible to do without major headaches.
Thanks!
I would advise you to use a normalized setup. Setting this up from the get go as a de-normalized structure is not what I would advise.
Without knowing all the details of what he have going on I think the best setup would be to have your recipe table and new property table and a new recipe_property table. That allows a recipe to have 0 or many properties and normalizes your data making it fast and easy to maintain and query your data.
High level structure would be:
CREATE TABLE recipe(recipe_id);
CREATE TABLE property(property_id);
CREATE TABLE recipe_property(recipe_property_id,recipe_id,property_id);