Query by attributes of referenced entity in RavenDB - ravendb

I have two collections in RavenDB: one for products and another for pricing strategies for each product. Each pricing strategy references a single product by product ID. For example, a product may look like this:
products/1
{
"Brand":"Dewalt",
"Model":"ABC123",
"Category":"Tools"
}
and a strategy may look like this:
strategies/1
{
"ProductId":"products/1",
"PriceCalculation":{
"$type":"...",
"Margin":0.2
}
}
I need to be able to query strategies by their attributes but also by the attributes of the associated products. For example - return all strategies with a specific margin where the product is in a specific category. If product data was denormalized and stored with the strategy then I could simply add product attributes to the index. Is there a way to do this without denormalization?
I understand that the Include method allows the inclusion of referenced entities in the result set so that they don't have to be loaded, but it doesn't support querying on the included entity. The same is true for live projections - they allow including referenced entities in the result set but don't support querying by attributes of the referenced entity.
I can run two queries across two indexes - one for strategies and another for products and then join the two result sets on product ID. The problem in this case is that the product collection may not always be synchronized with the set of products referenced by strategies. More specifically, the products collection may contain more products than are referenced by strategies and so the query may return products which don't have a pricing strategy and therefore can't be joined to a strategy all while taking up a position in the result set.
What could work is if I could resolve the referenced entity in the Map function and then include attributed of referenced entity in the index.
EDIT
I seem to be looking for this: https://groups.google.com/d/msg/ravendb/k3qvdEb870U/95OWtjL3U3YJ

After a bit of experimentation, I've found that the best way to do this is with a Multi-Maps / Reduce Index. Two map functions are specified, one for products and another for strategies. The reduce function groups the two result sets by product ID and then merges in the output. The declaration of this index is a bit awkward because you have to ensure that the shape of the results match - it would be nice if Raven did this automatically with an option to override. Also merging results in the reduce function is awkward because you have to select a value from the group which comes from the desired map function. Overall though, a multi-maps / reduce index allows the joining of distinct collections into a single indexed projection.

Related

Modeling N-to-N with DynamoDB

I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data.
As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as
items
-----
item_id (PK)
name
categories
----------
category_id (PK)
name
item_categories
---------------
item_id (PK)
category_id (PK)
To list all items in a category, one could perform a join such as
SELECT items.name from items
JOIN item_categories ON items.item_id = item_categories.item_id
WHERE item_categories.category_id = ?
And to list all categories to which an item belongs, the corresponding query could be made:
SELECT categories.name from categories
JOIN item_categories ON categories.category_id = item_categories.category_id
WHERE item_categories.item_id = ?
Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of (N, even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOINs?
Or should I just go for RDS instead?
Things I have considered:
Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item. Category updates would be awkward.
Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse (category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.
Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.
The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...
You are already in looking in the right direction!
In order to make an informed decision you will need to also consider the cardinality of your data:
Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)
How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?
Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.
Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.
Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.
I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.
In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.

SQL vs NoSQL for data that will be presented to a user after multiple filters have been added

I am about to embark on a project for work that is very outside my normal scope of duties. As a SQL DBA, my initial inclination was to approach the project using a SQL database but the more I learn about NoSQL, the more I believe that it might be the better option. I was hoping that I could use this question to describe the project at a high level to get some feedback on the pros and cons of using each option.
The project is relatively straightforward. I have a set of objects that have various attributes. Some of these attributes are common to all objects whereas some are common only to a subset of the objects. What I am tasked with building is a service where the user chooses a series of filters that are based on the attributes of an object and then is returned a list of objects that matches all^ of the filters. When the user selects a filter, he or she may be filtering on a common or subset attribute but that is abstracted on the front end.
^ There is a chance, depending on user feedback, that the list of objects may match only some of the filters and the quality of the match will be displayed to the user through a score that indicates how many of the criteria were matched.
After watching this talk by Martin Folwler (http://www.youtube.com/watch?v=qI_g07C_Q5I), it would seem that a document-style NoSQL database should suit my needs but given that I have no experience with this approach, it is also possible that I am missing something obvious.
Some additional information - The database will initially have about 5,000 objects with each object containing 10 to 50 attributes but the number of objects will definitely grow over time and the number of attributes could grow depending on user feedback. In addition, I am hoping to have the ability to make rapid changes to the product as I get user feedback so flexibility is very important.
Any feedback would be very much appreciated and I would be happy to provide more information if I have left anything critical out of my discussion. Thanks.
This problem can be solved in by using two separate pieces of technology. The first is to use a relatively well designed database schema with a modern RDBMS. By modeling the application using the usual principles of normalization, you'll get really good response out of storage for individual CRUD statements.
Searching this schema, as you've surmised, is going to be a nightmare at scale. Don't do it. Instead look into using Solr/Lucene as your full text search engine. Solr's support for dynamic fields means you can add new properties to your documents/objects on the fly and immediately have the ability to search inside your data if you have designed your Solr schema correctly.
I'm not an expert in NoSQL, so I will not be advocating it. However, I have few points that can help you address your questions regarding the relational database structure.
First thing that I see right away is, you are talking about inheritance (at least conceptually). Your objects inherit from each-other, thus you have additional attributes for derived objects. Say you are adding a new type of object, first thing you need to do (conceptually) is to find a base/super (parent) object type for it, that has subset of the attributes and you are adding on top of them (extending base object type).
Once you get used to thinking like said above, next thing is about inheritance mapping patterns for relational databases. I'll steal terms from Martin Fowler to describe it here.
You can hold inheritance chain in the database by following one of the 3 ways:
1 - Single table inheritance: Whole inheritance chain is in one table. So, all new types of objects go into the same table.
Advantages: your search query has only one table to search, and it must be faster than a join for example.
Disadvantages: table grows faster than with option 2 for example; you have to add a type column that says what type of object is the row; some rows have empty columns because they belong to other types of objects.
2 - Concrete table inheritance: Separate table for each new type of object.
Advantages: if search affects only one type, you search only one table at a time; each table grows slower than in option 1 for example.
Disadvantages: you need to use union of queries if searching several types at the same time.
3 - Class table inheritance: One table for the base type object with its attributes only, additional tables with additional attributes for each child object type. So, child tables refer to the base table with PK/FK relations.
Advantages: all types are present in one table so easy to search all together using common attributes.
Disadvantages: base table grows fast because it contains part of child tables too; you need to use join to search all types of objects with all attributes.
Which one to choose?
It's a trade-off obviously. If you expect to have many types of objects added, I would go with Concrete table inheritance that gives reasonable query and scaling options. Class table inheritance seems to be not very friendly with fast queries and scalability. Single table inheritance seems to work with small number of types better.
Your call, my friend!
May as well make this an answer. I should comment that I'm not strong in NoSQL, so I tend to lean towards SQL.
I'd do this as a three table set. You will see it referred to as entity value pair logic on the web...it's a way of handling multiple dynamic attributes for items. Lets say you have a bunch of products and each one has a few attributes.
Prd 1 - a,b,c
Prd 2 - a,d,e,f
Prd 3 - a,b,d,g
Prd 4 - a,c,d,e,f
So here are 4 products and 6 attributes...same theory will work for hundreds of products and thousands of attributes. Standard way of holding this in one table requires the product info along with 6 columns to store the data (in this setup at least one third of them are null). New attribute added means altering the table to add another column to it and coming up with a script to populate existing or just leaving it null for all existing. Not the most fun, can be a head ache.
The alternative to this is a name value pair setup. You want a 'header' table to hold the common values amoungst your products (like name, or price...things that all rpoducts always have). In our example above, you will notice that attribute 'a' is being used on each record...this does mean attribute a can be a part of the header table as well. We'll call the key column here 'header_id'.
Second table is a reference table that is simply going to store the attributes that can be assigned to each product and assign an ID to it. We'll call the table attribute with atrr_id for a key. Rather straight forwards, each attribute above will be one row.
Quick example:
attr_id, attribute_name, notes
1,b, the length of time the product takes to install
2,c, spare part required
etc...
It's just a list of all of your attributes and what that attribute means. In the future, you will be adding a row to this table to open up a new attribute for each header.
Final table is a mapping table that actually holds the info. You will have your product id, the attribute id, and then the value. Normally called the detail table:
prd1, b, 5 mins
prd1, c, needs spare jack
prd2, d, 'misc text'
prd3, b, 15 mins
See how the data is stored as product key, value label, value? Any future product added can have any combination of any attributes stored in this table. Adding new attributes is adding a new line to the attribute table and then populating the details table as needed.
I beleive there is a wiki for it too... http://en.wikipedia.org/wiki/Entity-attribute-value_model
After this, it's simply figuring out the best methodology to pivot out your data (I'd recommend Postgres as an opensource db option here)

What's the best way in Postgres to store a bunch of arbitrary boolean values for a row?

I have a database full of recipes, one recipe per row. I need to store a bunch of arbitrary "flags" for each recipe to mark various properties such as Gluton-Free, No meat, No Red Meat, No Pork, No Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and Low Carb. Users need to be able to search for recipes that contain one or more of those flags by checking checkboxes in the UI.
I'm searching for the best way to store these properties in the Recipes table. My ideas so far:
Have a separate column for each property and create an index on each of those columns. I may have upwards of about 20 of these properties, so I'm wondering if there's any drawbacks with creating a whole bunch of BOOL columns on a single table.
Use a bitmask for all properties and store the whole thing in one numeric column that contains the appropriate number of bits. Create a separate index on each bit so searches will be fast.
Create an ENUM with a value for each tag, then create a column that has an ARRAY of that ENUM type. I believe an ANY clause on an array column can use an INDEX, but have never done this.
Create a separate table that has a one-to-many mapping of recipes to tags. Each tag would be a row in this table. The table would contain a link to the recipe, and an ENUM value for which tag is "on" for that recipe. When querying, I'd have to do a nested SELECT to filter out recipes that didn't contain at least one of these tags. I think this is the more "normal" way of doing this, but it does make certain queries more complicated - If I want to query for 100 recipes and also display all their tags, I'd have to use an INNER JOIN and consolidate the rows, or use a nested SELECT and aggregate on the fly.
Write performance is not too big of an issue here since recipes are added by a backend process, and search speed is critical (there might be a few hundred thousand recipes eventually). I doubt I will add new tags all that often, but I want it to be at least possible to do without major headaches.
Thanks!
I would advise you to use a normalized setup. Setting this up from the get go as a de-normalized structure is not what I would advise.
Without knowing all the details of what he have going on I think the best setup would be to have your recipe table and new property table and a new recipe_property table. That allows a recipe to have 0 or many properties and normalizes your data making it fast and easy to maintain and query your data.
High level structure would be:
CREATE TABLE recipe(recipe_id);
CREATE TABLE property(property_id);
CREATE TABLE recipe_property(recipe_property_id,recipe_id,property_id);

How to model a mutually exclusive relationship in SQL Server

I have to add functionality to an existing application and I've run into a data situation that I'm not sure how to model. I am being restricted to the creation of new tables and code. If I need to alter the existing structure I think my client may reject the proposal.. although if its the only way to get it right this is what I will have to do.
I have an Item table that can me link to any number of tables, and these tables may increase over time. The Item can only me linked to one other table, but the record in the other table may have many items linked to it.
Examples of the tables/entities being linked to are Person, Vehicle, Building, Office. These are all separate tables.
Example of Items are Pen, Stapler, Cushion, Tyre, A4 Paper, Plastic Bag, Poster, Decoration"
For instance a Poster may be allocated to a Person or Office or Building. In the future if they add a Conference Room table it may also be added to that.
My intital thoughts are:
Item
{
ID,
Name
}
LinkedItem
{
ItemID,
LinkedToTableName,
LinkedToID
}
The LinkedToTableName field will then allow me to identify the correct table to link to in my code.
I'm not overly happy with this solution, but I can't quite think of anything else. Please help! :)
Thanks!
It is not a good practice to store table names as column values. This is a bad hack.
There are two standard ways of doing what you are trying to do. The first is called single-table inheritance. This is easily understood by ORM tools but trades off some normalization. The idea is, that all of these entities - Person, Vehicle, whatever - are stored in the same table, often with several unused columns per entry, along with a discriminator field that identifies what type the entity is.
The discriminator field is usually an integer type, that is mapped to some enumeration in your code. It may also be a foreign key to some lookup table in your database, identifying which numbers correspond to which types (not table names, just descriptions).
The other way to do this is multiple-table inheritance, which is better for your database but not as easy to map in code. You do this by having a base table which defines some common properties of all the objects - perhaps just an ID and a name - and all of your "specific" tables (Person etc.) use the base ID as a unique foreign key (usually also the primary key).
In the first case, the exclusivity is implicit, since all entities are in one table. In the second case, the relationship is between the Item and the base entity ID, which also guarantees uniqueness.
Note that with multiple-table inheritance, you have a different problem - you can't guarantee that a base ID is used by exactly one inheritance table. It could be used by several, or not used at all. That is why multiple-table inheritance schemes usually also have a discriminator column, to identify which table is "expected." Again, this discriminator doesn't hold a table name, it holds a lookup value which the consumer may (or may not) use to determine which other table to join to.
Multiple-table inheritance is a closer match to your current schema, so I would recommend going with that unless you need to use this with Linq to SQL or a similar ORM.
See here for a good detailed tutorial: Implementing Table Inheritance in SQL Server.
Find something common to Person, Vehicle, Building, Office. For the lack of a better term I have used Entity. Then implement super-type/sub-type relationship between the Entity and its sub-types. Note that the EntityID is a PK and a FK in all sub-type tables. Now, you can link the Item table to the Entity (owner).
In this model, one item can belong to only one Entity; one Entity can have (own) many items.
your link table is ok.
the trouble you will have is that you will need to generate dynamic sql at runtime. parameterized sql does not typically allow the objects inthe FROM list to be parameters.
i fyou want to avoid this, you may be able to denormalize a little - say by creating a table to hold the id (assuming the ids are unique across the other tables) and the type_id representing which table is the source, and a generated description - e.g. the name value from the inital record.
you would trigger the creation of this denormalized list when the base info is modified, and you could use that for generalized queries - and then resort to your dynamic queries when needed at runtime.

Define Generic Data Model for Custom Product Types

I want to create a product catalog that allows for intricate details on each of the product types in the catalog. The product types have vastly different data associated with them; some with only generic data, some with a few extra fields of data, some with many fields that are specific to that product type. I need to easily add new product types to the system and respect their configuration, and I'd love tips on how to design the data model for these products as well as how to handle persistence and retrieval.
Some products will be very generic and I plan to use a common UI for editing those products. The products that have extensible configuration associated with them will get new views (and controllers) created for their editing. I expect all custom products to have their own model defined but to share a common base class. The base class would represent the generic product that has no custom fields.
Example products that need to be handled:
Generic product
Description
Light Bulb
Description
Type (with an enum of florescent, incandescent, halogen, led)
Wattage
Style (enum of flood, spot, etc.)
Refrigerator
Description
Make
Model
Style (with an enum in the domain model)
Water Filter information
Part number
Description
I expect to use MEF for discovering what product types are available in the system. I plan to create assemblies that contain product type models, views, and controllers, drop those assemblies into the bin, and have the application discover the new product types, and show them in the navigation.
Using SQL Server 2008, what would be the best way to store products of these various types, allowing for new types to be added without having to grow the database schema?
When retrieving data from the database, what's the best way to translate these polymorphic entities into their correct domain models?
Updates and Clarifications
To avoid the Inner Platform Effect, if there is a database table for every product type (to store the products of that type), then I still need a way to retrieve all products that spans product types. How would that be achieved?
I talked with Nikhilk in more detail about his SharePoint reference. Specifically, he was talking about this: http://msdn.microsoft.com/en-us/library/ms998711.aspx. It actually seems pretty attractive. No need to parse XML; and there is some indexing that could be done allowing for simple and fast queries over the data. For instance, I could say "find all 75-watt light bulbs" by knowing that the first int column in the row is the wattage when the row represents a light bulb. Something (NHibernate?) in the app tier would define the mapping from the product type to the userdata schema.
Voted down the schema that has the Property Table because this could lead to lots of rows per product. This could lead to index difficulties, plus all queries would have to essentially pivot the data.
Use a Sharepoint-style UserData table, that has a set of string columns, a set of int columns, etc. and a Type column.
Then you have a list of types table that specifies the schema for each type - its properties, and the specific columns they map to in the UserData table.
With things like Azure and other utility computing storage you don't even need to define a table. Every store object is basically a dictionary.
I think you need to go with a data model like --
Product Table
ProductId (PK)
ProductName
Details
Property Table
PropertyId (PK)
ProductId (FK)
ParentPropertyId (FK - Self referenced to categorize properties)
PropertyName
PropertyValue
PropertyValueTypeId
Property Value Lookup Table
PropertyValueLookupId (PK)
PropertyId (FK)
LookupValue
And then have a dynamic view based on this. You could use the PropertyValueTypeId coloumn to identify the type, using a convention, like (0- string, 1-integer, 2-float, 3-image etc) - But ultimately you can store everything untyped only. You could also use this column to select the control template to render the corresponding property to the user.
You can use the Value lookup table to keep lookups for a specific property (so that user can choose it from a list)
Summarizing lets look at the options under consideration for storing product information:
1) some xml format in the database
2) similar to the post above about having x number of type defined columns (sharepoint approach)
3) via generic table with name and type definitions stored in lookup table and values in secondary table with columns id, propertyid, value (similar to #2 however this approach would provide unlimited property information
4) some hybrid of the above option where product table would have x common columns (for storage of properties common with all products) with y user defined columns (this could be m of integer type and n of varchar types). This may be taking the best of #2 and a normalzied structure as if you knew all the properties of all products. You would be getting the best sql performance for the properties that you use the most (probably those that are common across all products) while still allowing custom columns for specific properties with each product.
Are there other options? In my opinion I would consider 4 above as the best hybrid of the combinations.
dave
Put as much of the shared anticipated structure in traditional normalized 3NF model, then augment with XML columns as appropriate.
I don't see MEF (or any other ORM) being able to do all this transparently.
I think you should avoid the Inner Platform Effect and actually build tables for your specialized entities. You'll be writing specific code to manage them so why not have proper backing tables too?
It will make your deployment slightly harder - drop in an assembly and run a script - but it will probably save you a lot of pain in the long run.
Jeff,
we currently use a XML field in the Products table to handle all product-specific data. So our Products table has a few common fields that all products share, an XML which contains whatever a particular product needs additionally, and a few computed fields that grab into the XML and surface some of the frequently queried fields as "virtual" fields on the Products table (e.g. "Style" would be set to whatever the current product defines, or NULL, if the product doesn't have a Style property).
So far, we've been quite flexible with that approach - if you create some decent XSD schemas for your XML, you can even create C# proxy classes for these fields.
Works nicely for us - joining the best of both the relational and XML worlds.
Marc