How distribute many columns in one table in SQL Server 2012 database - sql

I am thinking about problem in our database.
I have one table for our products. It has few columns that it's for all products common. But products belongs to a manufacture. And each manufacture need some columns for the specification of product. So I am thinking about distributions for our table..
I think that have it all in one table is waste for memory. Because for example I have 20k products for Apple and 30k for Asus, 40k for MSI.. so if I have it all in one table for columns for apple will be NULL for 70k records..
Another idea was that I have few tables for each manufacture and in products has some key that pointing to specific table with columns for Apple.. for example key can be apple1, apple2 and so on. But with this idea it was quite difficult to show all products with theirs specific columns.
So I want to ask if someone thinking about this problem in database.
I am using SQL Server 2012 for our database.
Thanks for any help to this problem.

You can use such structure...
It's just example...
If You want to compare specifications, table with product data will be more complex...

Databases are built to handle information as a collection of related sets ... and to use selection criteria to get you what you want to work with. You should design depending on how your information elements work .. something like:
Manufacturer table (with information that may be peculiar to a
manufacturer (e.g. address, telephone, etc.)
Product table , with a foreign key reference to Manufacturer
You can then select information for, say, Apple with something like
select xxx
from Product p, Manufacturer m
where p.manufacturerID = m.manufacturerID and m.name = "Apple"

Related

How to structure SQL tables with one (non-composite) candidate key and all non-primary attributes?

I'm not very familiar with relational databases but here is my question.
I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns.
For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total.
So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct?
However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key.
Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway.
So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage.
As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers.
An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows.
But the first usage is more effective and most common usage
customer_id property_type properity_value
1 num_of_child 3
1 age 22
1 marial_status Single
.
.
.

Cross-querying multiple tables in an SQLite database without indexed foreign keys

I am working on a research project, using the IMDb dataset as my source of secondary data. I downloaded the entire database in text format from .ftp servers provided by IMDb itself, and used the IMDbPY python package to compile all of the unsorted information into a relational database. I chose to use SQLite as my SQL engine, as it seemed like the least cumbersome option thanks to its ability to create locally-stored databases. After a bit of poking around and a lot of documentation-reading, I ended up with a 9.04 GB im.db file, hosting the entirety of IMDb.
Now I need to isolate my dataset according to my requirements, but due to my lack of experience with SQL I'm finding it difficult to figure out the most optimal way of doing so.
Specifically, I want to look at:
Movies only (i.e. exclude TV series, episodes, etc.);
Produced in the period between 2000-2015, inclusive;
Of feature length (i.e. running time over 40 minutes);
Non-adult (I didn't even know IMDb hosts information on these, but apparently so);
Produced in the USA;
With complete information on crew.
Here's a representation of my database schema. I was confused by some of the database design choices that IMDbPY creators made, but I'm no SQL expert, and this is what I get to work with. Some clarifications:
The title table holds basic information about every instance of films, shows, episodes, and so on, 3,673,485 rows in total. The id column is an auto-incremented primary key, which is referenced as the movie_id foreign key in all other relevant tables. However, it seems like that none of the foreign keys in other tables are indexed properly, so I can't use simple query statements to properly get necessary information just by knowing a particular film's id value.
Running SELECT count(*) FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015; tells me that there are 442,135 instances of movies, produced between 2000-2015. So far so good.
The complete_cast and comp_cast_type tables hold info about the completion status of a film's crew/cast list. Since I only need to consider films with complete crew information, I need to isolate only those instances, where (i) movie_id exists in my previous query (i.e. out of the 442,135 movie rows); (ii) subject_id=2; and (iii) status_id=3 or 4.
This is where it gets tricky for me. The movie_info table holds 20 million rows of information about films and TV shows, including runtimes, genres, countries of production, years of production, etc. Basically all of the information that I need to isolate my dataset. Within that table (i) id is an arbitrary auto-incremented primary key; (ii) movie_id refers to the id values from title; (iii) info_type_id refers to one of the 113 types of information as listed in the table info_type; (iv) info holds the actual information, as integers or strings.
For example: Running SELECT id FROM title WHERE title='2001: A Space Odyssey' AND kind_id=1; returns '2484213'. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=1; returns '142, 161, 149', indicating the running times in minutes of the three available versions of the film. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=8 returns 'USA, UK', indicating the countries involved in production. And so on.
Basically I'm trying to create a new table, populated only with films that fall under my requirements, and I'm having a hard time figuring out the most efficient way of doing so. Here's how I translated my requirements into basic SQL syntax:
SELECT * FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
Then a bunch of requirements from the movie_info table, which cross-references only those instances, where movie_id exists as id in the query above, and (i) info_type_id=1 AND info>40; (ii) info_type_id=3 AND info!='Adult'; (iii) info_type_id=8 AND info='USA';
Finally, I need to make sure that all of the selections exist in the complete_cast table, and WHERE subject_id=2 AND status_id=3 OR 4;
I've been reading SQLite documentation, and suspect that I need to use some combination of INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements, but not sure how to approach this exactly. I would like to write this code efficiently, since brute-forcing queries requirement by requirement takes a while for my computer to process. Thank you in advance for your help.
TL;DR. I can't figure out an efficient way of using INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements to help me isolate a smaller dataset in accordance with multiple requirements, to satisfy which I need to cross-query a number of existing tables without properly indexed foreign keys.
Is an inner join with all the required tables too slow for your needs?
You could create tables that just contain the subset data that you need and then run an inner join on those.
So create a table "movie" and insert only those records from "title" with kind_id of 1. Then do something like
Select *
FROM
movie m
inner join movie_info mi
on m.id = mi.movie_id
inner join complete_cast cc
on m.id = cc.id
WHERE
...
Providing your new tables don't have the same kind of volume of data, it should perform better.

Data Warehouse Design/Modeling (based on Figure in Data Mining textbook)

I found a schema in Google Images (see below) that can illustrate a problem I having in my data warehouse design:
My design is different, but this is the simplest figure I could find to convey my question, which is given the figure, I'm wondering how could the schema accommodate the following scenario: if a product had a unique number assigned to it by the SalesOrg (salesOrg_product_number)...For example, a salesOrg sells food items and assigns all food items of the same kind the same unique salesOrg_product_number. A different salesOrg would have a different salesOrg_product_number for that type of product.
I'm inclined to place the salesOrg_product_number attribute in the Product dimension table, but part of me thinks it should be in the salesOrg dimension table instead. I'm wondering which one of these is correct way in a data warehouse (not relational db) design to maintain the star schema?
In a perfect world the Primary Keys of a dimension table should be just surrogate key, without any meaning for the business. Table IDs should be invisible for the final users, but business code should be of course available.
A possible solution would be to have a product table with a structure like:
Product_id
Product_desc
Product_SO1_number
Product_SO2_number
...
Of course this will require to show the correct field to the correct Sales Organization. Depending on your reporting tool this can be more or less difficult. For example if you write your query manually you need just to put the right column in your select.
Another possibility would be to have a product/sales_org table, a table which combine the Product and the Sales_Org one:
Product_Sales_Org_id
Product_id
Sales_Org_id
Product_SO_number
...
This table will be child of the two dimension table and on the fact table you will have Product_Sales_Org_id column. Depending on Product and Sales Organization the Product_SO_number will return the correct number per SO.
If you want to have this in a star schema structure you can put Product/Sales_Org/Product_Sales_Org together in only one table like:
Product_Sales_Org_id
Product_id
Sales_Org_id
Product_desc
Sales_Org_desc
Product_SO_number
...
Sincerely I would go for the second solution, keep the Product and the Sales_Org tables separated, because they are two different business entities and implement the relationship table in the middle.
I hope this helps.

Product Table Linking Different Types

I have a problem, I am designing a database which will store different products and each product may have different details.
As an example it will need to store books with multiple authors and store software with different types of descriptions.
This is my current design:
Product_table
|ID|TYPE|COMPANY|
|1|1|1|
attr_table
|ID|NAME|
|1|ISBN10|
|2|ISBN13|
|3|Title|
|4|Author|
details_table
|ID|attr_id|value
|1|3|Book of adventures|
Connector_table
|id|pro_id|detail_id|
|1|1|1|
So the product table would only store the main product id, the company it belongs to and the type of product it is.
Then I would have the attribute table which lists each attribute a product could have, this will make it easier to add new types of products.
The details table will the hold all the values such as different authors, titles isbn10s etc.
And then the connector table would connect the product table and the details table.
My main worry is that the details table will get very large and will be storing lots of different data types.
What i would like would be to split up all of the different types into tables such as ISBN table and author tables.
If this is the case how could i link these tables up to the attr_table
Any help would be greatly appreciated.
Don't bother. You do not say what database you are using, but any reasonable database will be able to handle the details table. Databases are designed to handle big tables efficiently.
If it is really big, you might want to consider partitioning the table by some sort of theme.
Otherwise, just be sure that you have an index on the id in the table and probably on the attr_id as well. The structure should work fine.

Doubt regarding a database design

I have a doubt regarding a database design, suppose a finance/stock software
in the software, the user will be able to create orders,
those orders may contain company products or third-party products
typical product table:
PRIMARY KEY INT productId
KEY INT productcatId
KEY INT supplierId
VARCHAR(20) name
TEXT description
...
but i also need some more details in the company products like:
INT instock
DATETIME laststockupdate
...
The question is, how should i store the data?
I'm thinking in 2 options:
1 -
Have both company and third-party, products in a single table,
some columns will not be used by third-party products
identify the company products are identified by a supplier id
2 -
Have the company products and third-party in separated tables
3 - [new, thanks RibaldEddie]
Have a single product table,
company products have additional info in a separated table
Thanks in advance!
You didn't mention anything about needing to store separate bits of Vendor information, just that a type of product has extra information. So, you could have one products table and an InHouseProductDetails table that has a productId foreign key back to the products table that stores the company specific information. Then when you run your queries you can join the products table to the details table.
The benefit is that you don't have to have NULLable columns in the products table, so your data is safer from corruption and you don't have to store the products themselves in two separate tables.
Oooo go with 3! 3 is the best!
To be honest, I think the choice of #1 or #2 are completely dependent upon some other factors (I can only thing of 2 at the moment):
How much data is expected (affecting speed of queries)
Is scalability going to be a concern anywhere in the near future (I'd guess within 5 years)
If you did go with a single table for all inventory, then later decided to split them, you can. You suggested a supplier identifier of some sort. List suppliers in a table (your company included) with keys to your inventory. Then it really won't matter.
As far as UNION goes, it's been a while since I've written raw Sql - so I'm not sure if UNION is the correct syntax. However, I do know that you can pull data from multiple tables. Actually just found this: Retrieving Data from Multiple Tables with Sql Joins
I agree with RibaldEddie. Just one thing to add: put a unique constraint on that foreign key in your InHouseProductDetails table. That'll enforce that it's a one-to-one relationship between the two tables, so you don't accidently end up with two InHouseProductDetails records for one product (maybe from some dataload gone awry or something)
Constraints are like defensive driving; they help prevent the unexpected...
I would advice on using point #1. What happens when another supplier comes along? It's also more easy to extend on one product table/produst class.
Take into account the testing of your application also. Having all data in one table raises the possible requirement of testing both the 3rd Party & Company elements of your app for any change to either.
If you're happy that your Unit test would cover this off its not so much of a worry... if you're relying on a human tester then it becomes more of an issue when sizing the impact of changes.
Personally I'd go for the one products table with common details and separate tables for the 3rd party & Company specifics.
one table for products with a foreign key to the Vendor table; include your own company in the Vendor table
the Stock table can then be used to store information about stock levels for any product, not just yours
Note that you need the Stock table anyway, this just make the DB model more company-agnostic - so if you ever need to store stock level information about third-party products, there's no DB change required