Changed Int to uniqueidentifier in a Database Project - sql

I would like to change a primary key column from int to uniqueidentifier in a database project. I tried to change the type but got predictable error because SQL Server can't convert int to guid. (Original Image):
+ ------------ + -------------- + ----------- + ------- +
| Name | Data_Type | Allow Nulls | Default |
+ ------------ + -------------- + ----------- + ------- +
| OrderImageId | int | No | |
| OrderId | int | No | |
| Image | varbinary(MAX) | Yes | |
| FileName | nvarchar(Max) | Yes | |
+ ------------ + -------------- + ----------- + ------- +
create table [dbo].[OrderImages]
(
[OrderImageId] int not null primary key identity,
[OrderId] int not null
Constraint [FK_OrderImages_Orders] foreign key (OrderId) references [Orders]([OrderId]),
[Image] varbinary(Max) null,
[FileName] nvarchar(max) null
)
I know how to do it in SQL Server Management Studio (create a separate guid column, fill it, delete PK, set PK for guid column etc), but is it possible to do it in a database project? And what if my PK column has FKs?

To do this you will need to create a new table and migrate the data, substituting the guids that you want to use to replace the existing integer data.
That being said, using a uniqueidentifier here is not a good idea. It is 128 bits of essentially random data that will cause fragmentation. If you expect to have more than 4 billion images with possibly multiple images per order, you can use a bigint.
If there will be no more than one image per order, you can use the OrderID as the primary key (without the identity constraint) and avoid needing to add a nonclustered index on OrderID.

Related

Oracle - fast insert and fast latest records lookup

I have a table with logs which grew in size (~100M records) to the point where querying even the latest entries takes a considerable amount of a time.
I am wondering is there a smart way to make access to latest records fast (largest PK values) while also make inserts (appends) to it fast? I do not want to delete any data if possible, actually there is already a mechanism which monthly deletes logs older than N days.
Ideally what I mean is have the query
select * from t_logs order by log_id desc fetch first 50 rows only
to run in a split second (up to reasonable row count, say 500, if that matters).
The table is defined as follows:
CREATE TABLE t_logs (
log_id NUMBER NOT NULL,
method_name VARCHAR2(128 CHAR) NOT NULL,
msg VARCHAR2(4000 CHAR) NOT NULL,
type VARCHAR2(1 CHAR) NOT NULL,
time_stamp TIMESTAMP(6) NOT NULL,
user_created VARCHAR2(50 CHAR) DEFAULT user NOT NULL
);
CREATE UNIQUE INDEX logs_pk ON t_logs ( log_id ) REVERSE;
ALTER TABLE t_logs ADD (
CONSTRAINT logs_pk PRIMARY KEY ( log_id )
);
I am not really a DBA, so I am not familiar with all the performance tuning methods. I just use logs a lot and I was wondering if I could do something data-not-invasive to ease my pain. Up to my knowledge, what I did: tried re-computing statistics/re-analyze table (no effect), looked into query plan
-------------------------------------------
| Id | Operation | Name |
-------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | VIEW | |
| 2 | WINDOW SORT PUSHED RANK| |
| 3 | TABLE ACCESS FULL | T_LOGS |
-------------------------------------------
I would expect query to leverage index to perform the lookup, why doesn't it? Maybe this is a reason it takes so long to find the results?
Version: Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Mr Cave, in the accepted answer, seems to be right
alter table t_logs drop constraint log_pk;
drop index log_pk;
create unique index logs_pk on t_logs ( log_id );
alter table t_logs add (
constraint logs_pk primary key ( log_id )
);
Queries run super fast now, plan looks as expected:
-------------------------------------------------
| Id | Operation | Name |
-------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | VIEW | |
| 2 | WINDOW NOSORT STOPKEY | |
| 3 | TABLE ACCESS BY INDEX ROWID| T_LOGS |
| 4 | INDEX FULL SCAN DESCENDING| LOGS_PK |
-------------------------------------------------
100 million rows isn't that large.
Why are you creating a reverse-key index for your primary key? Sure, that has the potential to reduce contention on inserts but were you really constrained by contention? That would be pretty unusual. Maybe you have an unusual environment. But my guess is that someone was trying to prematurely optimize the design for inserts without considering what that did to queries.
My wager would be that a nice, basic design would be more than sufficient for your needs
CREATE TABLE t_logs (
log_id NUMBER NOT NULL,
method_name VARCHAR2(128 CHAR) NOT NULL,
msg VARCHAR2(4000 CHAR) NOT NULL,
type VARCHAR2(1 CHAR) NOT NULL,
time_stamp TIMESTAMP(6) NOT NULL,
user_created VARCHAR2(50 CHAR) DEFAULT user NOT NULL
);
CREATE UNIQUE INDEX logs_pk ON t_logs ( log_id );
ALTER TABLE t_logs ADD (
CONSTRAINT logs_pk PRIMARY KEY ( log_id )
);
If you can't recreate the primary key for some reason, create an index on time_stamp and change your queries to use that
CREATE INDEX log_ts ON t_logs( time_stamp );
SELECT *
FROM log_ts
ORDER BY time_stamp DESC
FETCH FIRST 100 ROWS ONLY;

how to create project id incrementally in sql based on identity column

I have a project table:
CREATE TABLE DOC.BRAND
(
ID int PRIMARY KEY IDENTITY (1, 1),
project_id varchar(150) ,
project_name varchar(250) ,
)
For example, project_id should be PRJ001, PRJ002 based on identity column value as shown here:
+----+-------------+---------------+
| ID | project_id | project_name |
+----+-------------+---------------+
| 1 | PRJ001 | PROJECT1 |
| 2 | PRJ002 | PROJECT2 |
+----+-------------+---------------+
How we can achieve that using a stored procedure or is there any table-level setting?
If you are using SQL Server (which seems likely based on the syntax), you can use a computed column:
CREATE TABLE DOC.BRAND (
ID int PRIMARY KEY IDENTITY (1, 1),
project_id as ('PRJ' + format(id, '000')),
project_name varchar(250)
);
Here is a db<>fiddle.

Snowflake: create a default field value that auto increments for each primary key, resets per primary key

I would like to create a table to house the following type of data
+--------+-----+----------+
| pk | ctr | name |
+--------+-----+----------+
| fish | 1 | herring |
| mammal | 1 | dog |
| mammal | 2 | cat |
| mammal | 3 | whale |
| bird | 1 | penguin |
| bird | 2 | ostrich |
+--------+----_+----------+
PK is the primary key string (100) not null
ctr is a field I want to auto increment by 1 for each pk row
I have tried the following
create or replace table schema.animals (
pk string(100) not null primary key,
ctr integer not null default ( select NVL(max(ctr),0) + 1 from schema.animals )
name string (1000) not null);
This produced the following error
SQL compilation error: error line 6 at position 52 aggregate functions
are not allowed as part of the specification of a default value
clause.
So i would have used the auto increment /identity property like so
AUTOINCREMENT | IDENTITY [ ( start_num , step_num ) | START num INCREMENT num ]
but it doesnt seem to be able to support the resetting per unique pk
looking for any suggestions on how to solve this, thanks for any help in advance
You cannot do this with an IDENTITY method. The suggested solution is to use INSTEAD OF trigger that will calculate ctr value on every row of INSERTED table. For example
CREATE TABLE dbo.animals (
pk nvarchar(100) NOT NULL,
ctr integer NOT NULL,
name nvarchar(1000) NOT NULL,
CONSTRAINT PK_animals PRIMARY KEY (pk, ctr)
)
GO
CREATE TRIGGER dbo.animals_before_insert ON dbo.animals INSTEAD OF INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO animals (pk, ctr, name)
SELECT
i.pk,
(ROW_NUMBER() OVER (PARTITION BY i.pk ORDER BY i.name) + ISNULL(a.max_ctr, 0)) AS ctr,
i.name
FROM inserted i
LEFT JOIN (SELECT pk, MAX(ctr) AS max_ctr FROM dbo.animals GROUP BY pk) a
ON i.pk = a.pk;
END
GO
INSERT INTO dbo.animals (pk, name) VALUES
('fish' , 'herring'),
('mammal' , 'dog'),
('mammal' , 'cat'),
('mammal' , 'whale'),
('bird' , 'pengui'),
('bird' , 'ostrich');
SELECT * FROM dbo.animals;
Result
pk ctr name
------- ----- ---------
bird 1 ostrich
bird 2 pengui
fish 1 herring
mammal 1 cat
mammal 2 dog
mammal 3 whale
Another method is to use scalar user-defined function as DEFAULT value but it is slow: the trigger fires once on all rows whereas the function is called on every row.
I have no idea why you would have a column called pk that is not the primary key. You cannot (easily) do what you want. I would recommend doing this as:
create or replace table schema.animals (
animal_id int identity primary key,
name string(100) not null primary key,
);
create view schema.v_animals as
select a.*, row_number() over (partition by name order by animal_id) as ctr
from schema.animals a;
That is, calculate ctr when you need to use it, rather than storing it in the table.

Creating new table vs adding new field

I am having following data :
Fabric Cost
time | No fabric|BangloreSilk|Chanderi|.... <- fabric types
--------------------------------------------
01/15 | 40 | 25 |...
02/15 | 45 | 30 |...
..... | ... | ... |...
Dyeing Cost
time | No fabric|BangloreSilk|Chanderi|.... <- fabric types
--------------------------------------------
01/15 | 40 | 25 |...
02/15 | 45 | 30 |...
..... | ... | ... |...
And here list of fabric types will be same for both the data.
Now to add this data I created following tables :
fabric_type
id int
fabric_type_name varchar
And then I have two approaches .
Approach 1 :
fabric_cost
id int
fabric_type_id int (foreign key to fabric_type)
cost int
deying_cost
id int
fabric_type_id int (foreign key to fabric_type)
cost int
Approach 2 :
fabric_overall_cost
id int
fabric_type_id int (foreign key to fabric_type)
cost int
fabric_or_dyeing bit (to represent 0 for fabric cost and 1 for dyeing cost)
Now the question is which approach will be better??
Maybe you can create another table - cost_subjects
cost_subjects
id byte
subject varchar
costs
id int
fabric_type_id int (foreign key to fabric_type)
cost int
cost_subject byte (foreign key to cost_subjects table)
And then you can extend the table with more subjects to include in costs of fabric
It really depends on your requirements. Are there other columns that are unique only for the fabric_cost table? Are there other columns that are unique only for the dyeing_cost table? Meaning will your 2 tables grow independently?
If yes, approach 1 is better. Otherwise, approach 2 is better because you won't need to do CRUD on 2 separate tables (for easier maintenance).
Another approach would be:
id int
fabric_type_id int (foreign key to fabric_type)
fabric_cost float/double/decimal
dyeing_cost float/double/decimal
This third approach is if you always have both costs. You might not want to use int for cost. Again, it depends on your requirements.

MySQL GROUP BY optimization

This question is a more specific version of a previous question I asked
Table
CREATE TABLE Test4_ClusterMatches
(
`match_index` INT UNSIGNED,
`cluster_index` INT UNSIGNED,
`id` INT NOT NULL AUTO_INCREMENT,
`tfidf` FLOAT,
PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);
The query I want to run
mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000)
GROUP BY `match_index`;
The Problem with the query
It uses temporary and filesort so its to slow+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 51540 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
With the current indexing the query would need to sort by cluster_index first to eliminate the use of temporary and filesort, but doing so gives the wrong results for sum(tfidf).
Changing the primary key to
PRIMARY KEY (`match_index`,`cluster_index`,`id`)
Doesn't use file sort or temp tables but it uses 14,932,441 rows so it is also to slow
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | SIMPLE | Test5_ClusterMatches | index | NULL | PRIMARY | 16 | NULL | 14932441 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
Tight Index Scan
Using tight index scan by running the search for just one index
mysql> explain SELECT match_index, SUM(tfidf) AS total
FROM Test4_ClusterMatches WHERE cluster_index =3000
GROUP BY match_index;Eliminates the temporary tables and filesort.
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | ref | PRIMARY | PRIMARY | 4 | const | 27 | Using where; Using index |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+ I'm not sure if this can be exploited with some magic sql-fu that I haven't come across yet?
Question
How can I change my query so that it use 3,000 cluster_indexes, avoids using temporary and filesort without it needing to use 14,932,441 rows?
Update
Using the table
CREATE TABLE Test6_ClusterMatches
(
match_index INT UNSIGNED,
cluster_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (id),
UNIQUE KEY(cluster_index,match_index)
);
The query below then gives 10 rows in set (0.41 sec) :)
SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;
but its using temporary and filesort
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | Test6_ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 78663 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
I'm wondering if theres anyway to get it faster by eliminating the using temporary and using filesort?
I had a quick look and this is what I came up with - hope it helps...
SQL Table
drop table if exists cluster_matches;
create table cluster_matches
(
cluster_id int unsigned not null,
match_id int unsigned not null,
...
tfidf float not null default 0,
primary key (cluster_id, match_id) -- if this isnt unique add id to the end !!
)
engine=innodb;
Test Data
select count(*) from cluster_matches
count(*)
========
17974591
select count(distinct(cluster_id)) from cluster_matches;
count(distinct(cluster_id))
===========================
1000000
select count(distinct(match_id)) from cluster_matches;
count(distinct(match_id))
=========================
6000
explain select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
where
cm.cluster_id between 5000 and 10000
group by
cm.match_id
order by
sum_tfidf desc limit 10;
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE cm range PRIMARY PRIMARY 4 290016 Using where; Using temporary; Using filesort
runtime - 0.067 seconds.
Pretty respectable runtime of 0.067 seconds but I think we can make it better.
Stored Procedure
You will have to forgive me for not wanting to type/pass in a list of 5000+ random cluster_ids !
call sum_cluster_matches(null,1); -- for testing
call sum_cluster_matches('1,2,3,4,....5000',1);
The bulk of the sproc isnt very elegant but all it does is split a csv string into individual cluster_ids and populate a temp table.
drop procedure if exists sum_cluster_matches;
delimiter #
create procedure sum_cluster_matches
(
in p_cluster_id_csv varchar(65535),
in p_show_explain tinyint unsigned
)
proc_main:begin
declare v_id varchar(10);
declare v_done tinyint unsigned default 0;
declare v_idx int unsigned default 1;
create temporary table tmp(cluster_id int unsigned not null primary key);
-- not every elegant - split the string into tokens and put into a temp table...
if p_cluster_id_csv is not null then
while not v_done do
set v_id = trim(substring(p_cluster_id_csv, v_idx,
if(locate(',', p_cluster_id_csv, v_idx) > 0,
locate(',', p_cluster_id_csv, v_idx) - v_idx, length(p_cluster_id_csv))));
if length(v_id) > 0 then
set v_idx = v_idx + length(v_id) + 1;
insert ignore into tmp values(v_id);
else
set v_done = 1;
end if;
end while;
else
-- instead of passing in a huge comma separated list of cluster_ids im cheating here to save typing
insert into tmp select cluster_id from clusters where cluster_id between 5000 and 10000;
-- end cheat
end if;
if p_show_explain then
select count(*) as count_of_tmp from tmp;
explain
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
end if;
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
drop temporary table if exists tmp;
end proc_main #
delimiter ;
Results
call sum_cluster_matches(null,1);
count_of_tmp
============
5001
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE tmp index PRIMARY PRIMARY 4 5001 Using index; Using temporary; Using filesort
1 SIMPLE cm ref PRIMARY PRIMARY 4 vldb_db.tmp.cluster_id 8
match_id sum_tfidf count_tfidf
======== ========= ===========
1618 387 64
1473 387 64
3307 382 64
2495 373 64
1135 373 64
3832 372 57
3203 362 58
5464 358 67
2100 355 60
1634 354 52
runtime 0.028 seconds.
Explain plan and runtime much improved.
If the cluster_index values in the WHERE condition are continuous, then instead of IN use:
WHERE (cluster_index >= 1) and (cluster_index <= 3000)
If the values are not continuous then you can create a temporary table to hold the cluster_index values with an index and use an INNER JOIN to the temporary table.