I`m new to data warehousing, but I think my question can be relatively easy answered.
I built a star schema, with a dimension table 'product'. This table has a column 'PropertyName' and a column 'PropertyValue'.
The dimension therefore looks a little like this:
surrogate_key | natural_key (productID) | PropertyName | PropertyValue | ...
1 5 Size 20 ...
2 5 Color red
3 6 Size 20
4 6 Material wood
and so on.
In my fact table I always use the surrogate keys of the dimensions. Cause of the PropertyName and PropertyValue columns my natural key isn`t unique / identifying anymore, so I get way too much rows in my fact table.
My question now is, what should I do with the property columns? Would it be best, to put each property into separate dimensions, like dimension size, dimension color and so on? I got about 30 different properties.
Or shall I create columns for each property in the fact table?
Or make one dimension with all properties?
Thanks in advance for any help.
Your dimension table 'product' should look like this:
surrogate_key | natural_key (productID) | Color | Material | Size | ...
1 5 red wood 20 ...
2 6 red ...
If you have to many properties, try to group them in another dimension. For example Color and Material can be attributes of another dimension if you can have the same product with same id and same price in another color or material. Your fact table can identify product with two keys: product_id and colormaterial_id...
Reading recommendation:
The Data Warehouse Toolkit, Ralph Kimball
Your design is called EAV (entity-attribute-value) table.
It's a nice design for the sparse matrices (large number of properties with only few of them filled at the same time).
However, it has several drawbacks.
It cannot be indexed (and hence efficiently searched) on two or more properties at once. A query like this: "get all products made of wood and having size or 20" will be less efficient.
Implementing constraints involving several attributes at once is more complex
etc.
If it's not a problem for you, you can use EAV design.
Related
I would like to create a highly scalable system for storing "candidates" the problem is each candidate has different "features" and sometimes these have different data types. One idea I'd like to try would involve something like this:
candidates:
| id | cType |
1 'fabric'
2 'belt'
candidateFeatures:
| candidateId | featureTable | featureId
1 'city' 1
1 'colour' 1
1 'colour' 2
2 'city' 2
2 'size' 1
city:
|id | lat | lng | name |
1 x x 'London'
1 x x 'Paris'
colour:
|id | name |
1 'Red'
2 'Green'
size:
|id | value |
1 10
2 12
Here you can see that there is one fabric candidate in London with Red and Green features and a belt candidate in Paris with size 10.
we do this because we get feedback in a universal way and I'm trying to write a scalable machine learning solution that will allow new types of candidates to be added seamlessly, as well as new candidate feature types - as they are discovered and added to the db. A candidate is assumed to be able to have more than one of each feature type.
Ultimately I need to be able to extract the data (probably through a materialised view) so that if I want all 'fabric' candidates I would end up with something like:
'id' | colourIds | cityIds |
1 [1, 2] [1]
4 [3] [4, 5]
but then if one day I find a fabric that doesn't have a colour but instead has a pattern I can easily get a new table for patterns and just add the features to my "candidateFeatures" table:
'id' | colourIds | cityIds | patternIds
1 [1, 2] [1] null
4 [3] [4, 5] null
14 null [6] [1]
This format is suitable for the front end, and the format of "candidateFeatures" is very useful for the backend. we can use it to easily scale without modifying existing tables and for scalable data analysis. Specifically when looking for correlations between user responses to candidates and presence of categorical features - or values of continuous features.
To me this seems like a really clever idea that hasn't got proper support in sql… which makes me think it's probably a really dumb idea in disguise. I think it's possible to do this using EXEC, but that does have some risks. Does anyone know of a smarter way to achieve the same result? or actually how to achieve this?
Since execution time isn't such a big concern I can always run it through a third party program e.g. in python and put the results into new tables. But ideally I'd use a bunch of materialised views and have them update periodically because that feels like it would scale better with more data.
This is too long for a comment.
It is neither a good idea nor an awful idea. It is simply not how SQL works. The problem is that queries have a well-defined set of tables and column references. This is quite important for optimizing the query -- a step that generally happens before the query is run.
Queries are not merely strings that permit dynamic substitution when they are processing data.
There are ways to address the data modeling:
Have separate tables for the features and association tables to match them back to the original data.
Use an entity-attribute-value model, which basically stored key-value pairs.
Use a flexible storage mechanism, such as JSON or arrays.
In addition, Postgres supports something called inheritance, which might be useful for representing this type data.
I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need
I am trying to figure out how to do something that I would think is commonplace, but I cannot find how to do.
Given two Custom Lists, one with a field that is essentially a primary key, and the other with what is essentially a foreign key, I want to show all the rows from the first in one area of the display, and the related records for the selected row of the first, in a second part of the screen.
I am thinking this would be side–by–side web parts on a web-part page.
So:
ID pkID Data ID fkID Data
___________________ ______________________________
| 1 100 Row one. | | 8 100 Related one/one |
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ | 9 100 Related one/two |
2 113 Row two. | 10 100 Related one/three |
3 118 Row n. ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
11 113 Related two/one
12 113 Related two/two
13 118 Related n/one
(That is my attempt to show what is established between the two lists. Top row selected on the left, related records from the other row on the right.)
Surely this is common enough that there is a way to readily do this?
I suppose I might need to create a means of asserting that a row is 'selected.'
You will note that I am not useing the ID field that "belongs" to SharePoint.
You can create look up fields to establish that relationship, sharepoint 2010 even allows you to enforce the relationship like in a SQL database. so for instace you can declare what happens if you try to delete a parent if there is childs (Cascade, Prevent, etc).
Have a read here:
http://office.microsoft.com/en-au/sharepoint-server-help/create-list-relationships-by-using-unique-and-lookup-columns-HA101729901.aspx
About visually displaying them, you might have to create some webparts for it, as the only support OOB is the link to the child entity from the main entity on the parent list.
I want to structure a table to mimic column level filters as row level filter just to avoid adding new columns.
Let's say i have following table to store cars' details
-------------------------------------
Type Color Year
-------------------------------------
Mini Silver 2010
Standard Silver 2011
Fullsize White 2011
Luxury Black 2010
Sports Red 2011
Convertible Red 2009
If i want to store Make of these cars as well and for this i have to add an additional column and another column if i have automobiles other than cars.
So the question is how can i structure this table to avoid adding new columns? The structure should require only to add rows to define properties of my records.
[Hint] The structure may have multiple tables, one to store rows/records and other to store columns/properties and then some kind of mapping between them OR entirely new structure.
EDIT
Some of the properties of my data are fixed and some are dynamic. Fixed properties can be mapped to the given sample Car model as Availability, Condition and the dynamic could be anything which a person may ask about an automobile. Now i don't need all columns to be mapped as rows but few and these are dynamic and i don't even know all of them. My apologies that i didn't mention this earlier.
You could use the entity-attribute-value design (EAV).
entity attribute value
1 Type Mini
1 Color Silver
1 Year 2010
1 Make Foobar
2 Type Standard
2 Color Silver
etc...
You may also wish to store the attribute names in a separate table.
However you should consider carefully if you really need this, as there are a few disadvantages. The value column must have a type that can store all the different types of values (e.g. string). It is much more cumbersome to write queries as you will need many joins, and these queries will run more slowly as compared to a traditional database design.
To give you a head start: Think about redesigning to allow multi-colored vehicles like motorbikes:
vehicle
Id Year vehicle_type vehicle_make
-------------------------------------------
1 2010 1 1
2 2011 2 2
color
Id Name
-----------
1 Black
2 White
3 Red
4 Blue
vehicle_color
vehicle_id color_id
-----------------------
1 3
2 1
2 2
vehicle_type
Id Name
-----------
1 Car
2 Motorbike
vehicle_make
Id Name
-----------
1 Porsche
2 BMW
Bonus
Since I'm quite familiar with the car domain, I'll throw in an extension for your vehicle colors: There are tons of color names ("Magentafuzzyorangesunset") invented by manufacturers and you'll want to map them to "real" base color names ("Red, "Blue", "Green", etc.) to enable searching for both.
Your color table then could look like that
Id Name base_color
-----------------------------
1 Midnight 1
2 Snow 2
and you'll add a base_color table
Id Name
-----------
1 Black
2 White
I have an ecommerce store that I am building. I am using Rails/ActiveRecord, but that really isn't necessary to answer this question (however, if you are familiar with those things, please feel free to answer in terms of Rails/AR).
One of the store's requirements is that it needs to represent two types of products:
Simple products - these are products that just have one option, such as a band's CD. It has a basic price, and quantity.
Products with variation - these are products that have multiple options, such as a t-shirt that has 3 sizes and 3 colors. Each combination of size and color would have its own price and quantity.
I have done this kind of thing in the past, and done the following:
Have a products table, which has the main information for the product (title, etc).
Have a variants table, which holds the price and quantity information for each type of variant. Products have_many Variants.
For simple products, they would just have one associated Variant.
Are there better ways I could be doing this?
I worked on an e-commerce product a few years ago, and we did it the way you described. But we added one more layer to handle multiple attributes on the same product (size and color, like you said). We tracked each attribute separately, and we had a "SKUs" table that listed each attribute combination that was allowed for each product. Something like this:
attr_id attr_name
1 Size
2 Color
sku_id prod_id attr_id attr_val
1 1 1 Small
1 1 2 Blue
2 1 1 Small
2 1 2 Red
3 1 1 Large
3 1 2 Red
Later, we added inventory tracking and other features, and we tied them to the sku IDs so that we could track each one separately.
Your way seems pretty flexible. It would be similar to my first cut.