I have a transaction table and a inventory table that I would like to 'JOIN' together. The tables need to 'JOIN' on three primary keys.
My question is: should I create a unique key (concatenation of the three fields) and create a 'INDEX' on the unique key or would I just create a non-clustered 'INDEX' on all three fields?
I'm currently using SQL Server 2014
I'm guessing the Transaction table is the biggest and the Inventory is the smaller. A lot depends on what proportion of the data would you expect to be returned by your join - If its most then a table scan will probably occur so an index wont help much. If your going to try and get a small subset of date then create an index on the 3 columns on both tables and create a foreign key from Trans to Inventory on the 3 cols. (SQL Server needs an index as well as a FK)
Pick the most granular column as the first in your index as this will encourage SQL servers Optimiser to use the index.
Related
I have a unique index on (id, name) columns. I have a date column that I want to add to the index since I want the uniqueness to be based on (id, name, date) columns. The date column contains a lot of null values. How would it affect the index?
If you are using SQL Server, so in SQL Server null values are not included in the index structure, But SQL Server has some new features, one of the filtering index. If a field has many null values so recommended creating an additional filtering index using where the field is null condition.
For more information about filtering index visit this link
Final result: You can do your add index operations comfortably, without problems, in many Databases null values don't affect performance.
I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here
I have a SQL Server database that imports data from an old VMS database.
The data has many many tables that need to be joined for reporting.
The common Id for all table Joins goes like this
D-100344-1
,D-100344-2
,D-100345-3
,D-100346-1
,N-100346000-1
,N-100344001-1
,N-100344001-2
,N-100345001-3
,N-100346000-1
About 1.2 million of these lines come in each day, across 827 tables.
Many times a line will come in with updated data, and I will insert and remove the earlier line as the table doesn't need duplicates.
To better facilitate joins between the tables I looked to add a non clustered index on this ID.
it became 20% fragmented after one days insert (because of course it would.
What are my options here.
fyi. I use an incrementing TableID column for a clustered index on the table so my inserts arent so terrible but that id has no relation to the other tables for joining
I created a non-clustered index on "last_name" column in the table "Persons"
Select * From Persons
Where last_name = 'Hogg'
So why is the index incapable of returning all the columns simultaneously and instead does a RID lookup?
How does indexing work here?
The index only covers the column last_name, and only contains data about that column. You can conceptually think about the index that you've described as a series of pairs: (last_name,row), where row is a reference to a particular row in the actual table. The index stores the pairs sorted by last_name, but stores no additional information about the table.
Your query requests all of the columns of Persons. The index is used to locate the row or rows where last_name is "Hogg", but the database has to reference the table to retrieve the additional columns.
What you appear to want is a covering index for the columns of interest. The term "RID lookup" implies SQL Server. Perhaps the question What are Covering Indexes and Covered Queries in SQL Server? and the page it points to: Using Covering Indexes to Improve Query Performance will help.
I'm drawing up plans for a few new features on my site, and one could be "solved" using a join table.
Example schema:
Person table
PK PersonId
Name
Age ...
PersonCheckin table
PK FK PersonId
PK FK CheckinId
Date ...
Checkin table
PK CheckinId
CheckedInto ...
A join would be run to get the check in data for a person (connected by the PersonCheckin table). Since every person could check in an unlimited number of times, the PersonCheckin table could become very large.
I'd imagine this would cause some performance issues. What are typical ways this is handled to keep performance high?
A join is considering the best performing means of connecting related tables.
But it really depends on the query, because it might not need to be a JOIN -- JOINing can inflate the record set on the parent tables side if there are more than one child record related, which means there could be a need for either GROUP BY or DISTINCT. EXISTS or IN is a better choice in such situations...
Indexes can help on the column(s) used in the JOIN criteria, on both sides of the relationship. In this example both sides are primary keys, which typically have the best index automatically created when the primary key is defined...
If you are going to execute this query very often and you want to achieve better performance just create a view on the database where you write the join query