I have a table
create table Objects (
ObjectID bigint not null primary key,
ObjectRef1 varchar(50) not null,
ObjectRef2 varchar(50) not null,
ObjectRef3 varchar(250) not null
)
All fields are unique. The table has approximately 100 million rows. All columns have unique indexes, and are used frequently for queries.
What is faster? To normalize each of the varchar fields into seperate tables, or keep them as they are? If normalized, the table will only have the ObjectID column and ID's to the normalized tables, and I would do inner joins to get the values of ObjectRefX.
Should I consider other databases like Hadoop for this amount of data?
The one thing about performance is on cannot predict until and unless query runs, I would suggest you to please keep the table as it is, as normalizing this data in different table would increase the dependency as you will be connecting the table with foreign keys. and more over all the columns are unique so there is no redundancy that could be reduced. Place indexes . and try to optimize the query rather then the schema here .
any correction to above answer is welcome.
hope I could be of any help
Thanks
Ashutosh Arya
Related
I have a table with 100 columns where 80% column is nvarchar(max) and there is no way to change this data type cause i am getting this data from MySQL text type column. This table contains almost 30lacs records, so when I am selecting all the columns it takes too much time to show recordset. In this circumstance, i was interested to change this table as column store table but column store does not support nvarchar(max) data type and now i am finding the way how I can design this table which will help to make query fast.
Note I have tried with non clustered indexing by different column which has also not impacted in makig query fast.
Any help will be appreciated
Why not just use two tables? If your original table as a primary key, define a new table as:
create table t_text (
original_id int primary key,
value nvarchar(max),
foreign key (id) references original_table(original_id)
);
You would then join in this table when you want to use the column.
For inserting or updating the table, you can define a view that includes the text value. With a trigger on the view you can direct updates to the correct table.
What you really want is vertical partitioning -- the ability to store columns in separate partitions. This is a method for implementing this manually.
Which is the best index and distribution design for relatively small fact tables (on average 30 million rows per table). The structure of each table is similar to the following:
CREATE TABLE FactTable (
TimeDimensionID INT NOT NULL,
DimensionID1 VARCHAR (10) NOT NULL,
DimensionID2 VARCHAR (10) NOT NULL,
DimensionID3 VARCHAR (10) NOT NULL,
DimensionID4 VARCHAR (10) NOT NULL,
Measure1 INT,
Measure2 FLOAT,
Measure3 DECIMAL (10.2),
Measure4 DECIMAL (10,2)
)
The union of TimeDimensionID, DimensionID1, DimensionID2, DimensionID3 and DimensionID4 is unique in the fact table. Currently we have a clustered and unique primary key in the 5 fields.
What is the best indexing and distribution to migrate those tables to SQL Azure Data Warehouse? We are thinking about using CLUSTERED INDEX (DimensionID1, DimensionID2, DimensionID3 and DimensionID4) for the index and hash distribution using the TimeDimensionID field.
The CLUSTERED INDEX must include the TimeDimensionID field even though the hash distribution is for that field?
Is this design correct or should we use COLUMN STORE INDEX even though the tables actually have less than 100 millions of rows?
We should consider using replicated tables for the fact tables?
Some recommendations:
If possible, please move your DimensionIDs from varchar to int/bigint. You'll get better performance, less storage, and lower costs.
Forget about clustered indexes for now.
Create your table hash-distributed, but not on date, that will hotspot your data.
Create your table as a CLUSTERED COLUMNSTORE INDEX
Don't replicate your FACT table, but replicate your DIMENSIONS instead.
I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...
Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).
If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;
I'm designing a database and have come across a performance related problem. Please note that we are still in the phase of designing not implementing so I can't test anything yet.
I have the following table structure
Events
----------------------
EventID INT PK
SourceID INT FK
TypeID INT FK
Date DATETIME
The table is expected to contain tens of millions of entries. SourceID and TypeID both reference tables with at most hundreds of entries.
What I want is to have the tuple (SourceID, TypeID, Date) unique across the table. The question is: can I somehow specify which of the three columns will be used as the first to determine uniqueness when I would insert a new item in the table?
Because if the index compared the Date first, then the addition would be much faster, than if it for example used TypeID first, right? Or is this a wrong question altogether and I should trust the SQL server to optimize this itself?
Any feedback is appreciated.
The underlying index created to support the unique constraint will have the same column order as defined by the constraint.
I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables
In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...
Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.
Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.
Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.
I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).
First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob