I have a table on an Azure-SQL database that is growing much faster than expected.
The table has the following columns:
id PK INT
fk_id1 FK INT
fk_id2 FK INT
fk_id3 FK INT
vc_field1 VARCHAR(100)
vc_field2 VARCHAR(15)
bit_field1 BIT
dt_field1 DATETIME
The table has just over 20000 rows and the size of the table is about 1 GB with two foreign key indexes in the 300 MB range. Based on the size of tables that are similar in field size and row count I would expect to see a table size of just a couple of MB (less that 50 MB). Any thoughts or help in this regard would be much appreciated. The nature of this table is that I only ever do inserts into the table. No need for updates or deletes.
Related
I have a table with 100 columns where 80% column is nvarchar(max) and there is no way to change this data type cause i am getting this data from MySQL text type column. This table contains almost 30lacs records, so when I am selecting all the columns it takes too much time to show recordset. In this circumstance, i was interested to change this table as column store table but column store does not support nvarchar(max) data type and now i am finding the way how I can design this table which will help to make query fast.
Note I have tried with non clustered indexing by different column which has also not impacted in makig query fast.
Any help will be appreciated
Why not just use two tables? If your original table as a primary key, define a new table as:
create table t_text (
original_id int primary key,
value nvarchar(max),
foreign key (id) references original_table(original_id)
);
You would then join in this table when you want to use the column.
For inserting or updating the table, you can define a view that includes the text value. With a trigger on the view you can direct updates to the correct table.
What you really want is vertical partitioning -- the ability to store columns in separate partitions. This is a method for implementing this manually.
Which is the best index and distribution design for relatively small fact tables (on average 30 million rows per table). The structure of each table is similar to the following:
CREATE TABLE FactTable (
TimeDimensionID INT NOT NULL,
DimensionID1 VARCHAR (10) NOT NULL,
DimensionID2 VARCHAR (10) NOT NULL,
DimensionID3 VARCHAR (10) NOT NULL,
DimensionID4 VARCHAR (10) NOT NULL,
Measure1 INT,
Measure2 FLOAT,
Measure3 DECIMAL (10.2),
Measure4 DECIMAL (10,2)
)
The union of TimeDimensionID, DimensionID1, DimensionID2, DimensionID3 and DimensionID4 is unique in the fact table. Currently we have a clustered and unique primary key in the 5 fields.
What is the best indexing and distribution to migrate those tables to SQL Azure Data Warehouse? We are thinking about using CLUSTERED INDEX (DimensionID1, DimensionID2, DimensionID3 and DimensionID4) for the index and hash distribution using the TimeDimensionID field.
The CLUSTERED INDEX must include the TimeDimensionID field even though the hash distribution is for that field?
Is this design correct or should we use COLUMN STORE INDEX even though the tables actually have less than 100 millions of rows?
We should consider using replicated tables for the fact tables?
Some recommendations:
If possible, please move your DimensionIDs from varchar to int/bigint. You'll get better performance, less storage, and lower costs.
Forget about clustered indexes for now.
Create your table hash-distributed, but not on date, that will hotspot your data.
Create your table as a CLUSTERED COLUMNSTORE INDEX
Don't replicate your FACT table, but replicate your DIMENSIONS instead.
I have a table structure like below :
FeatureList
ID - BIGINT - Primary Key - Clustered Index
VIN - VARCHAR(50)
Text - VARCHAR(50)
Value - VARCHAR(50)
Most of the query I execute on this are like :
SELECT * FROM FeatureList WHERE VIN = 'ABCD' --- Will give multiple records
OR
DELETE FROM FeatureList WHERE VIN = 'ABCD'
I want to know, is VIN column is a good candidate for nonclustered index? Or it might degrade the performance?
Not declaring an index on VIN absolutely will drastically degrade performance. You take a small performance hit on each insert, delete, or update involving VIN. Reads (especially once you get into millions of records) will run orders of magnitude faster.
As for BIGINT versus INT, I generally go for BIGINT. Yes, it takes up a bit more disk space. Yes, it takes up a bit more memory. The plus side for me, though, is that I never, ever have to worry about migrating the table (and every other table that takes ID as a foreign key) to BIGINT. Been there. Done that. The extra space is worth it.
I have a table
create table Objects (
ObjectID bigint not null primary key,
ObjectRef1 varchar(50) not null,
ObjectRef2 varchar(50) not null,
ObjectRef3 varchar(250) not null
)
All fields are unique. The table has approximately 100 million rows. All columns have unique indexes, and are used frequently for queries.
What is faster? To normalize each of the varchar fields into seperate tables, or keep them as they are? If normalized, the table will only have the ObjectID column and ID's to the normalized tables, and I would do inner joins to get the values of ObjectRefX.
Should I consider other databases like Hadoop for this amount of data?
The one thing about performance is on cannot predict until and unless query runs, I would suggest you to please keep the table as it is, as normalizing this data in different table would increase the dependency as you will be connecting the table with foreign keys. and more over all the columns are unique so there is no redundancy that could be reduced. Place indexes . and try to optimize the query rather then the schema here .
any correction to above answer is welcome.
hope I could be of any help
Thanks
Ashutosh Arya
On our SQL SERVER 2008 R2 database we have an COUNTRIES referential table that contains countries. The PRIMARY KEY is a nvarchar column:
create table COUNTRIES(
COUNTRY_ID nvarchar(50) PRIMARY KEY,
... other columns
)
The primary key contains values like 'FR', 'GER', 'US', 'UK', etc. This table contains max. 20 rows.
We also have a SALES table containing sales data:
create table SALES(
ID int PRIMARY KEY
COUNTRY_ID nvarchar(50),
PRODUCT_ID int,
DATE datetime,
UNITS decimal(18,2)
... other columns
)
This sales table contains a column named COUNTRY_ID, also of type nvarchar (not a primary key). This table is much larger, containing around 20 million rows.
Inside our app, when querying on the SALES table, we filter almost every time on the COUNTRY_ID. Even like this it takes too long to perform most of aggregation queries (even with the proper indexes in place)
We're in a development phase to improve the query performance on the SALES table. My question is:
Does it worth switching the COUNTRY_ID type from nvarchar(50) to the type int? If the column COUNTRY_ID is converted in both tables to the type int, can I expect a better performance when joining the two tables?
I would personally recommend changing COUNTRY_ID from nvarchar(50) to an INT. An int uses 4bytes of data and is usually quicker to JOIN than VARCHAR.
You can also check to see if the space used is reduced by using the stored procedure sp_spaceused
EXEC sp_spaceused 'TableName'