I was requested to create a table that will contain many repeated values and I'm not sure if this is the best way to do it.
I must use SQL Server. I would love to use Azure Table Storage and partition keys, but I'm not allowed to.
Imagine that the table Shoes has the columns
id int, customer_name varchar(50), shoe_type varchar(50)
The problem is that the column shoe_type will have millions of repeated values, and I want to have them in their own partition, but SQL Server only allows ranged partitions afaik.
I don't want the repeated values to take more space than needed, meaning that if the column value is repeated 50 times, I don't want it to take 50 times more space, only 1 time.
I thought about using a relationship between the column shoe_type (as an int) and another table which will have its string value, but is that the most I can optimize?
EDIT
Shoes table data
id customer_name shoe_type
-----------------------------
1 a nike
2 b adidas
3 c adidas
4 d nike
5 e adidas
6 f nike
7 g puma
8 h nike
As you can see, the rows contain repeated shoe_type values (nike, adidas, puma).
What I thought about is using the shoe_type column as an int foreign key to another table, but I'm not sure if this is the most efficient way to do it, because in Azure Table Storage you have partitions and partition keys, and in MS SQL Server you have partitions, but they are ranged only.
The sample data you provide suggests that there is a "shoe type" entity in the business domain, and that all shoes have a mandatory relationship to a single shoe type. It would be different if the values were descriptive text - e.g. "Attractive running shoe, suitable for track and leisure wear". Repeated values are often (but of course not always) an indicator that there is another entity you can extract.
You suggest that the table will have millions of records. In very general terms, I recommend designing your schema to reflect the business domain, and only go for exotic optimization options once you know, and can measure, that you have a performance problem.
In your case, I'd suggest factoring out a separate table called "shoe_types", and to include a foreign key relationship from "shoes" to "shoe_types". The primary key for "shoe_types" should be a clustered index, and the "shoe_type_id" in "shoe_types" should be a regular index. All things being equal, with (tens of) millions of rows, that hit the foreign key index should be very fast.
In addition, supporting queries like "find all shoes where shoe type name starts with 'nik%'" should be much faster, because the shoe_types table should have far fewer rows than "shoes".
Related
I want to represent a vehicle (think car or truck) in a database. I have up to 62 pieces of information I'd like to store for each. Examples: year, make, model, drive type, brake system, Mfr. body code, steering type, wheel base, etc. The information are Ids which reference a 3rd party database which provides the labels for each Id. The provider has 1 table to list all makes, 1 table to list all "Steering types", etc.
All vehicles will populate the year, make, and model columns. Almost no record (if any) will populate more than 10 columns. But if I looked at all vehicles, then every column would be used by at least one record.
One approach would be to have a single table that has 62 columns. Again most records will have NULL values in most columns.
Alternatively I can do something like this (ignoring indices and primary key for sake of example):
create table vehicles (
id identity(1,1) int,
year int,
make int,
model int
)
create table constraints (
id identity(1,1) int,
vehicleId int, -- foreign key to vehicles.id
constraintTypeId int, -- foreign key to constraintTypes.id
value int
)
create table constraintTypes (
id identity(1,1) int,
name nvarchar(200) -- Example: "wheel base", "brake system" etc
)
With this second method if a vehicle only stores 2 pieces of information (aside from year, make, model), then it would have 2 records in table constraints.
Users wish to have a page to view all applications. If I have a table with 62 columns I'd need 62 joins in the query to get the labels. I could store labels on the vehicle to make retrieval faster, but than when labels change in the source data it might be slow to update my vehicles table.
At current there are over 12 million vehicle records, and the source data changes monthly (additions, deletes, and a few label changes).
Is it better a better design to have more columns, even if most are always just NULL. Or is the second approach better? How does one even calculate the best approach? Even if I had 62 columns they are all valid to every vehicle, but for cataloging purposes most are left empty. For example if a record should match any "1999 Dodge Viper" (regardless of steering type, or body style, etc) the user doesn't want to have to populate all 62 columns, they want to just see one record for "1999 Dodge Viper".
Your question is a specific case of the general issue related to data anomalys and normalisation. https://en.wikipedia.org/wiki/Database_normalization
And there is no 'right' answer although experience suggests there are 'better' and 'worse' answers. So a question to help you with your planning.
Will the requirements ever change? E.g. will someone one day want to
record the brake shoe type, or drivers seat type? If yes what are the
implications of your 62 column table becoming a 63 (or 99) column
table. (In my mind this leads me towards your second method)
Also remember thanks to Views the presentation of the data, even in the DB, does not have to match it storage. E.g. you can have well normalised tables and a view to show users 62 (or 63 or 99) columns.
Before I go asking more questions about the coding, I'd like to first figure out the best method for me to follow for making my database. I'm running into a problem with how I should go about structuring it to keep everything minimized and due to its' nature I have lots of re-occurring data that I have to represent.
I design custom shirts and have a variety of different types of shirts for people to choose from that are available in both adult and child sizes of both genders. For example, I have crewneck shirts, raglan sleeves, ringer sleeves and hoodies which are available for men, women, boys, girls and toddlers. The prices are the same for each shirt from the toddler sizes up to 1x in the adult sizes, then 2x, 3x, 4x and 5x are each different prices. Then there's the color options for each kind of shirt which varies, some may have 4 color options, some have 32.
So lets take just the crewneck shirts for an example. Men s-1x, Women s-1x, Boys xs-1x, girls xs-1x and toddlers NB-18months is a total of 22 rows that will be represented in the table and are all the same price. 2X and up only apply to men and women so that's 8 more rows which makes 30 rows total for just the crewneck shirts. When it gets into the color options, there's 32 different colors available for them. If I were to do each and every size for all of them that would be 960 total rows just for the crewneck shirts alone with mainly HIGHLY repeated data for just one minor change.
I thought about it and figured It's best to treat these items on the table as actual items in a stock room because THEY'RE REALLY THERE in the stock room... you don't have just one box of shirts that you can punch a button on the side to turn to any size of color, you have to deal with the actual shirt and tedious task of putting them somewhere, so I deciding against trying to get outrageous with a bunch of foreign keys and indexes, besides that it gets just as tedious and you wind up having to represent just as much but with a lot more tables when you could've just put the data it's linking to in the first table.
If we take just the other 3 kinds of shirts and apply that same logic with all the colors and sizes just for those 4 shirts alone there will be 3,840 rows, with the other shirts left I'm not counting in you could say I'm looking at roughly 10,000 rows of data all in one table. This data will be growing over time and I'm wondering what it might turn into trying to keep it all organized. So I figured maybe the best logic to go with would be to break it down like the do in an actual retail store, which is to separate the departments into men, women, boys, girls and babies. so that way I have 5 separate tables that are only called when the user decides to "go to that department" so if there's a man who wants the men shirts he doesn't have 7,000+ rows of extra data present that doesn't even apply to what he's looking for.
Would this be a better way of setting it up? or would it be better to keep it all as one gigantic table and just query the "men" shirts in the php from the table in the section for men and the same with women and kids?
My next issue is all the color options that may be available, as I said before some shirts will have as few as 4 some will have as many as 32, so some of those are enough data to form a table all on their own, so I could really have a separate table for every kind of shirt. I'll be using a query in php to populate my items from the tables so I don't have to code so much in the html and javascript. That way I can set it to SELECT ALL * table WHERE type=men and it will take all the men shirts and auto populate the coding for each one. That way as I add and take things to and from the tables it'll automatically be updated. I already have an idea for HOW I'm going to do that, but I can only think so far into it because I haven't decided on a good way to set the tables up which is what I'd have to structure it to call from.
For example, if I have all the color options of each shirt all on the same table versus having it broken down and foreign keys linking to other tables to represent them. that would be two totally different ways of having to call it forth, so I'm stuck on this and don't really know where to go with it. any suggestions?
Typically retail organization is by SKU (stock keeping unit). Department and color are attributes of a garment, not the way you identify the garment for the purpose of accounting or stocking.
CREATE TABLE Skus (
sku BIGINT UNSIGNED PRIMARY KEY,
description TEXT,
department VARCHAR(10) NOT NULL,
color VARCHAR(10) NOT NULL,
qty_in_stock INT UNSIGNED NOT NULL DEFAULT 0,
unit_price NUMERIC(9,2) NOT NULL,
FOREIGN KEY (department) REFERENCES Departments(department),
FOREIGN KEY (color) REFERENCES Colors(color)
);
This is better than splitting into five tables, because:
You can quickly get a sum of the total value of all your stock.
You can switch the department of a given SKU easily.
When someone buys a few garments, their order lineitems reference a single table instead of five different tables (that would be invalid for a relational database).
There are lots of other examples of tasks that are easier if similar entities are stored in one table.
I know you don't want to break it out into separate tables, but I think going the multiple table route would be the best. However, I don't think it is as bad as you think. My suggestion would be the following. Obviously, you want to change the names of the fields, but this is a quick representation:
Shirts
- id (primary key)
- description
- men (Y/N)
- women (Y/N)
- boy (Y/N)
- girl (Y/N)
- toddlers (Y/N)
Sizes
- id (primary key)
- shirt_id (foreign key)
- Size
Colors
- id (primary key)
- shirt_id (foreign key)
- Color
Price
- id (primary key)
- shirt_id (foreign key)
- size_id (foreign key)
- price
Having these three tables makes it so that you won't have to store all 10,000 rows in one single table and maintain it, but the data is still all there. Keeping your data separated into their proper places keeps from replicating needless information.
Want to pull all men's shirts?
SELECT * FROM shirts WHERE men = '1'
To be honest, you should really have at least 5 or 6 tables. One/two containing the labels for sizes and colors (either one table containing all, or one for each one) and the other 4 containing the actual data. This will keep your data uniform across everything (example: Blue vs blue). You know what they say, there is more than one way to skin a cat.
You need to think about a database term called 'normalization'. Normalization means that everything has it's place in the database and should not be listed twice but reused as needed. The most common mistake people make is to not ask or think about what will happen down the road and they put up a database that has next to no normalization, has massive memory consumed do to large datatypes, no seeding done, and is completely inflexible and comes at a great cost to change later because it was made without thinking of the future.
There are many levels of normalization but the most consistent thing is to think about a simple example I could give you to explain some simple concepts that can be applied to larger things later. This is assuming you have access to SQL management studio, SSMS, HOWEVER if you are using MYSQL or Oracle the principles are still very similar and the comments sections will show what I am getting at. This example you can self run if you have SSMS and just paste it in and hit F5. If you don't just look at the comments section although these concepts are better to see in action than to try to just envision what they mean.
Declare #Everything table (PersonID int, OrderID int, PersonName varchar(8), OrderName varchar(8) );
insert into #Everything values (1, 1, 'Brett', 'Hat'),(1, 2, 'Brett', 'Shirt'),(1, 3, 'Brett', 'Shoes'),(2,1,'John','Shirt'),(2,2,'John','Shoes');
-- very basic normalization level in that I did not even ATTEMPT to seperate entities into different tables for reuse.
-- I just insert EVERYTHING as I get in one place. This is great for just getting off the ground or testing things.
-- but in the future you won't be able to change this easily as everything is here and if there is a lot of data it is hard
-- to move it. When you insert if you keep adding more and more and more columns it will get slower as it requires memory
-- for the rows and the columns
Select Top 10 * from #Everything
declare #Person table ( PersonID int identity, PersonName varchar(8));
insert into #Person values ('Brett'),('John');
declare #Orders table ( OrderID int identity, PersonID int, OrderName varchar(8));
insert into #Orders values (1, 'Hat'),(1,'Shirt'),(1, 'Shoes'),(2,'Shirt'),(2, 'Shoes');
-- I now have tables storing two logic things in two logical places. If I want to relate them I can use the TSQL language
-- to do so. I am now using less memory for storage of the individual tables and if one or another becomes too large I can
-- deal with them isolated. I also have a seeding record (an ever increasing number) that I could use as a primary key to
-- relate row position and for faster indexing
Select *
from #Person p
join #Orders o on p.PersonID = o.PersonID
declare #TypeOfOrder table ( OrderTypeID int identity, OrderType varchar(8));
insert into #TypeOfOrder values ('Hat'),('Shirt'),('Shoes')
declare #OrderBridge table ( OrderID int identity, PersonID int, OrderType int)
insert into #OrderBridge values (1, 1),(1,2),(1,3),(2,2),(2,3);
-- Wow I have a lot more columns but my ability to expand is now pretty flexible I could add even MORE products to the bridge table
-- or other tables I have not even thought of yet. Now that I have a bridge table I have to list a product type ONLY once ever and
-- then when someone orders it again I just label the bridge to relate a person to an order, hence the name bridge as it on it's own
-- serves nothing but relating two different things to each other. This method takes more time to set up but in the end you need
-- less rows of your database overall as you are REUSING data efficiently and effectively.
Select Top 10 *
from #Person p
join #OrderBridge o on p.PersonID = o.PersonID
join #TypeOfOrder t on o.OrderType = t.OrderTypeID
I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!
This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.
Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.
I am working with an extensive amount of third party data. Each data set has items with unique identifiers. So it is very easy for me to utilise UNIQUE column in SQLITE to enforce some data integrity.
Out of thousands of records I have id from third party source A matching 2 unique ids from third party source B.
Is there a way of bending the rules, and allowing a duplicate entry in a unique column? If not how should I reorganise my data to take care of this single edge case.
UPDATE:
CREATE TABLE "trainer" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"name" TEXT NOT NULL,
"betfair_id" INTEGER NOT NULL UNIQUE,
"racingpost_id" INTEGER NOT NULL UNIQUE
);
Problem data:
Miss Beverley J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=20514
Miss B J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=11096
vs. Miss Beverley J. Thomas http://form.horseracing.betfair.com/form/trainer/1/00008861
Both Racingpost entires (my primary data source) match a single Betfair entry. This is the only one (so far) out of thousands of records.
If racingpost should have had only 1 match it is an error condition.
If racingpost is allowed to have 2 matches per id, you must either have two ids, select one, or combine the data.
Since racingpost is your primary source, having 2 ids may make sense. However if you want to improve upon that data set combining that data or selecting the most useful may be more accurate. The real question is how much data overlaps between these two records and when it does can you detect it reliably. If the overlap is small or you have good detection of an overlap condition, then combining makes more sense. If the overlap is large and you cannot detect it reliably, then selecting the most recent updated or having two ids is more useful.
I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500
The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.
I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.
I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.
It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.
Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.
You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.