I want to structure a table to mimic column level filters as row level filter just to avoid adding new columns.
Let's say i have following table to store cars' details
-------------------------------------
Type Color Year
-------------------------------------
Mini Silver 2010
Standard Silver 2011
Fullsize White 2011
Luxury Black 2010
Sports Red 2011
Convertible Red 2009
If i want to store Make of these cars as well and for this i have to add an additional column and another column if i have automobiles other than cars.
So the question is how can i structure this table to avoid adding new columns? The structure should require only to add rows to define properties of my records.
[Hint] The structure may have multiple tables, one to store rows/records and other to store columns/properties and then some kind of mapping between them OR entirely new structure.
EDIT
Some of the properties of my data are fixed and some are dynamic. Fixed properties can be mapped to the given sample Car model as Availability, Condition and the dynamic could be anything which a person may ask about an automobile. Now i don't need all columns to be mapped as rows but few and these are dynamic and i don't even know all of them. My apologies that i didn't mention this earlier.
You could use the entity-attribute-value design (EAV).
entity attribute value
1 Type Mini
1 Color Silver
1 Year 2010
1 Make Foobar
2 Type Standard
2 Color Silver
etc...
You may also wish to store the attribute names in a separate table.
However you should consider carefully if you really need this, as there are a few disadvantages. The value column must have a type that can store all the different types of values (e.g. string). It is much more cumbersome to write queries as you will need many joins, and these queries will run more slowly as compared to a traditional database design.
To give you a head start: Think about redesigning to allow multi-colored vehicles like motorbikes:
vehicle
Id Year vehicle_type vehicle_make
-------------------------------------------
1 2010 1 1
2 2011 2 2
color
Id Name
-----------
1 Black
2 White
3 Red
4 Blue
vehicle_color
vehicle_id color_id
-----------------------
1 3
2 1
2 2
vehicle_type
Id Name
-----------
1 Car
2 Motorbike
vehicle_make
Id Name
-----------
1 Porsche
2 BMW
Bonus
Since I'm quite familiar with the car domain, I'll throw in an extension for your vehicle colors: There are tons of color names ("Magentafuzzyorangesunset") invented by manufacturers and you'll want to map them to "real" base color names ("Red, "Blue", "Green", etc.) to enable searching for both.
Your color table then could look like that
Id Name base_color
-----------------------------
1 Midnight 1
2 Snow 2
and you'll add a base_color table
Id Name
-----------
1 Black
2 White
Related
How would one go about telling a CAML query to sort the results in a thoroughly custom order?
.
For instance, for a given field:
-- when equal to 'Chestnut' at the top,
-- then equal to 'Zebra' next,
-- then equaling 'House'?
Finally, within those groupings, sort on a second condition (such as 'Name'), normally ascending.
So this
ID Owns Name
————————————————————
1 Zebra Sue
2 House Jim
3 Chestnut Sid
4 House Ken
5 Zebra Bob
6 Chestnut Lou
becomes
ID Owns Name
————————————————————
6 Chestnut Lou
3 Chestnut Sid
5 Zebra Bob
1 Zebra Sue
2 House Jim
4 House Ken
In SQL, this can be done with Case/When. But in CAML? Not so much!
CAML does not have such a sort operator by my knowledge. The workaround might be that you add a calculated column to the list with a number datatype and formula
=IF(Owns="Chestnut",0,IF(Owns="Zebra",1,IF(Owns="House",3,999))).
Now it is possible to order on the calculated column, which translates the custom sort order to numbers. Another solution is that you create a second list with the items to own, and a second column which contains their sort order. You can link these two lists and order by the item list sort order. The benefit is that a change in the sort order is as easy as editing the respective listitems.
Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.
One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.
Say I have a Person table that stores information about that person (weird right?). I have select boxes for things like gender, hair color, and eye color. Instead of creating separate tables with a description field for each, is there a good way to use a single table? Maybe a Resources table with a Name and Description fields? Is it just that simple?
Resources
=========
ID Name Description
--------------------
1 Gender Male
2 Gender Female
3 Eye Color Blue
4 Eye Color Green
5 Eye Color Brown
6 Hair Color Black
7 Hair Color Brunette
8 Hair Color Blonde
9 Hair Color Red
Person
=========
ID Name Gender Eye_Color Hair_Color
-----------------------------------------------
1 Ryan 1 3 8
Is this the recommended way or is there something better for this?
Yes it is that simple, IMO your approach is correct. But please note you approach will not work if you get to select Ex: multiple hair colors for one person.
But I believe keeping code simple until you get a requirement to change it, read about YAGNI when u have some time :)
You could do it that way and it would be a polymorphic association.
If you don't need to query this information but just be able to access it you can use serialize and just store all the values in one column.
So a person record would have a column, let's call it attributes, that would have "eye_color: blue, gender: male", etc...
I'd create a separate table called Physical_attributes and an assossiative one between Person and Physical_attributes, personal_physical_attributes, where I'd store the person's id, the Physical_attribute's id and the description for that Physical_attribute.
I cant seem to group by multiple data fields and sum a particular grouped column.
I want to group Person to customer and then group customer to price and then sum price. The person with the highest combined sum(price) should be listed in ascending order.
Example:
table customer
-----------
customer | common_id
green 2
blue 2
orange 1
table invoice
----------
person | price | common_id
bob 2330 1
greg 360 2
greg 170 2
SELECT DISTINCT
min(person) As person,min(customer) AS customer, sum(price) as price
FROM invoice a LEFT JOIN customer b ON a.common_id = b.common_id
GROUP BY customer,price
ORDER BY person
The results I desire are:
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue,$170
The colors are the customer, that GREG and Bob handle. Each color has a price.
There are two issues that I can see. One is a bit picky, and one is quite fundamental.
Presentation of data in SQL
SQL returns tabular data sets. It's not able to return sub-sets with headings, looking something a Pivot Table.
The means that this is not possible...
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue, $170
But that this is possible...
Bob, Orange, $2230
Greg, Green, $360
Greg, Blue, $170
Relating data
I can visually see how you relate the data together...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 360 2
blue 2 greg 170 2
orange 1 bob 2330 1
But SQL doesn't have any implied ordering. Things can only be related if an expression can state that they are related. For example, the following is equally possible...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 170 2 \ These two have
blue 2 greg 360 2 / been swapped
orange 1 bob 2330 1
This means that you need rules (and likely additional fields) that explicitly state which customer record matches which invoice record, especially when there are multiples in both with the same common_id.
An example of a rule could be, the lowest price always matches with the first customer alphabetically. But then, what happens if you have three records in customer for common_id = 2, but only two records in invoice for common_id = 2? Or do the number of records always match, and do you enforce that?
Most likely you need an extra piece (or pieces) of information to know which records relate to each other.
you should group by using all your selected fields except sum then maybe the function group_concat (mysql) can help you in concatenating resulting rows of the group clause
Im not sure how you could possibly do this. Greg has 2 colors, AND 2 prices, how do you determine which goes with which?
Greg Blue 170 or Greg Blue 360 ???? or attaching the Green to either price?
I think the colors need to have unique identofiers, seperate from the person unique identofiers.
Just a thought.
I`m new to data warehousing, but I think my question can be relatively easy answered.
I built a star schema, with a dimension table 'product'. This table has a column 'PropertyName' and a column 'PropertyValue'.
The dimension therefore looks a little like this:
surrogate_key | natural_key (productID) | PropertyName | PropertyValue | ...
1 5 Size 20 ...
2 5 Color red
3 6 Size 20
4 6 Material wood
and so on.
In my fact table I always use the surrogate keys of the dimensions. Cause of the PropertyName and PropertyValue columns my natural key isn`t unique / identifying anymore, so I get way too much rows in my fact table.
My question now is, what should I do with the property columns? Would it be best, to put each property into separate dimensions, like dimension size, dimension color and so on? I got about 30 different properties.
Or shall I create columns for each property in the fact table?
Or make one dimension with all properties?
Thanks in advance for any help.
Your dimension table 'product' should look like this:
surrogate_key | natural_key (productID) | Color | Material | Size | ...
1 5 red wood 20 ...
2 6 red ...
If you have to many properties, try to group them in another dimension. For example Color and Material can be attributes of another dimension if you can have the same product with same id and same price in another color or material. Your fact table can identify product with two keys: product_id and colormaterial_id...
Reading recommendation:
The Data Warehouse Toolkit, Ralph Kimball
Your design is called EAV (entity-attribute-value) table.
It's a nice design for the sparse matrices (large number of properties with only few of them filled at the same time).
However, it has several drawbacks.
It cannot be indexed (and hence efficiently searched) on two or more properties at once. A query like this: "get all products made of wood and having size or 20" will be less efficient.
Implementing constraints involving several attributes at once is more complex
etc.
If it's not a problem for you, you can use EAV design.