How to find the columns that need to be indexed? - sql

I'm starting to learn SQL and relational databases. Below is the table that I have, and it has around 10 million records. My composite key is (reltype, from_product_id, to_product_id).
What strategy should I follow while selecting the columns that needs to be indexed? Also, I have documented the operations that would be performed on the table. Please help in determining which columns or combination of columns that need to be indexed?
Table DDL is shown below.
Table name: prod_rel.
Database schema name : public
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Operations performed on table:
select distinct(reltype )
from public.prod_rel;
update public.prod_rel
set status = ? , starttime = ?
where from_product_id = ?;
update public.prod_rel
set status = ? , endtime = ?
where from_product_id = ?;
select *
from public.prod_rel
where from_product_id in (select distinct (from_product_id)
from public.prod_rel
where status = ?
and action in ('A', 'E', 'C', 'P')
and reltype = ?
fetch first 1000 rows only);
Note: I'm not performing any JOIN operations. Also please ignore the uppercase for table or column names. I'm just getting started.

Ideal would be two indexes:
CREATE INDEX ON prod_rel (from_product_id);
CREATE INDEX ON prod_rel (status, reltype)
WHERE action IN ('A', 'E', 'C', 'P');
Your primary key (which also is implemented using an index) cannot support query 2 and 3 because from_product_id is not in the beginning. If you redefine the primary key as from_product_id, to_product_id, reltype, you don't need the first index I suggested.
Why does order matter? Imagine you are looking for a book in a library where the books are ordered by “last name, first name”. You can use this ordering to find all books by “Dickens” quickly, but not all books by any “Charles”.
But let me also comment on your queries.
The first one will perform badly if there are lots of different reltype values; try raising work_mem in that case. It is always a sequential scan of the whole table, and no index can help.

I have changed the order of primary columns as shown below as per #a_horse_with_no_name 's suggestion and created only one index for (from_product_id, reltype, status, action) columns.
CREATE TABLE public.prod_rel (
reltype varchar NULL,
assocsequence float4 NULL,
action varchar NULL,
from_product_id varchar NOT NULL,
to_product_id varchar NOT NULL,
status varchar NULL,
starttime varchar NULL,
endtime varchar null,
primary key reltype, from_product_id, to_product_id)
);
Also, I have gone thorough the portal suggested by #a_horse_with_no_name. It was amazing. I came to know lot of new things on indexing.
https://use-the-index-luke.com/

Related

How to use RODBC to save dataframe to table with primary key generated at database

I would like to enter a data frame into an existing table in a database using an R script, and I want the table in the database to have a sequential primary key. My problem is that RODBC doesn't seem to allow the primary key constraint.
Here's the SQL for creating the table I want:
CREATE TABLE [dbo].[results] (
[ID] INT IDENTITY (1, 1) NOT NULL,
[FirstName] VARCHAR (255) NULL,
[LastName] VARCHAR (255) NULL,
[Birthday] DATETIME NULL,
[CreateDate] DATETIME NULL,
CONSTRAINT [PK_dbo.results] PRIMARY KEY CLUSTERED ([ID] ASC)
);
And a test with some R code:
ConnectionString1="Driver=ODBC Driver 11 for SQL Server;Server=myserver; Database=TestDb; trusted_connection=yes"
ConnectionString2="Driver=ODBC Driver 11 for SQL Server;Server=notmyserver; Database=TestDb; trusted_connection=yes"
db1=odbcDriverConnect(ConnectionString1)
query="SELECT a.[firstname] as FirstName
, a.[lastname] as LastName
, Cast(a.[dob] as datetime) as Birthday
, cast(a.createDate as datetime) as CreateDate
FROM [dbo].[People] a"
results=NULL
results=sqlQuery(db1,query,stringsAsFactors=FALSE)
close(db1)
db2=odbcDriverConnect(ConnectionString)
sqlSave(db2,
results,
append = TRUE,
varTypes=c(Birthday="datetime", CreateDate="datetime"),
colnames = FALSE,
rownames = FALSE,fast=FALSE)
close(db2)
The first part of the R code is just getting some test data into a dataframe--it works fine and it's not part of my question here (I'm just including it here so you can see what format the test data is). When I run the sqlSave function I get an error message:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
However, if I remove the primary key from the database, everything works fine with this table:
CREATE TABLE [dbo].[results] (
[FirstName] VARCHAR (255) NULL,
[LastName] VARCHAR (255) NULL,
[Birthday] DATETIME NULL,
[CreateDate] DATETIME NULL
);
Clearly the primary key is the issue. Normally with entity framework or whatever (as I understand it), the primary key is created at the database when you enter data.
I'd like a way to append data to a table with a primary key using only an R script. Is that possible? There could already be data in the table I'm adding to, so I don't really see a way to create keys in R before trying to append to the table.
The problem is line 361 in http://github.com/cran/RODBC/blob/master/R/sql.R - the data.frame and the DB table must have exactly the same number of columns otherwise you get this error with this stacktrace:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
3. `colnames<-`(`*tmp*`, value = c("ID", "FirstName", "LastName",
"Birthday", "CreateDate")) at sql.R#361
2. sqlwrite(channel, tablename, dat, verbose = verbose, fast = fast,
test = test, nastring = nastring) at sql.R#211
1. sqlSave(db2, results, append = TRUE, varTypes = c(Birthday = "datetime",
CreateDate = "datetime"), colnames = FALSE, rownames = FALSE,
fast = FALSE, verbose = TRUE)
If you add the ID column to your data.frame you can no longer use the autoinc ID column so this is no solution (or workaround).
A "simple" workaround to the "same columns" limitation of RODBC::sqlSave is:
Use sqlSave to save the new rows into another table name
Send an insert into ... select from ... via RODBC::sqlQuery to append the new rows to your original table that includes the autoinc ID
column
Delete the table with the new rows again (drop table...)
A better option would be to use the new odbc package which also offers better performance through bulk-alike inserts instead of sending single insert statements like RODBC does:
https://github.com/r-dbi/odbc
Look for the function dbWriteTable (which is an implementation of the interface DBI::dbWriteTable).

PL/SQL function that returns a value from a table after a check

I am new to php and sql and I am building a little game to learn a little bit more of the latter.
This is my simple database of three tables:
-- *********** SIMPLE MONSTERS DATABASE
CREATE TABLE monsters (
monster_id VARCHAR(20),
haunt_spawn_point VARCHAR(5) NOT NULL,
monster_name VARCHAR(30) NOT NULL,
level_str VARCHAR(10) NOT NULL,
creation_date DATE NOT NULL,
CONSTRAINT monster_id_pk PRIMARY KEY (monster_id)
);
-- ****************************************
CREATE TABLE spawntypes (
spawn_point VARCHAR(5),
special_tresures VARCHAR (5) NOT NULL,
maximum_monsters NUMBER NOT NULL,
unitary_experience NUMBER NOT NULL,
CONSTRAINT spawn_point_pk PRIMARY KEY (spawn_point)
);
-- ****************************************
CREATE TABLE fights (
fight_id NUMBER,
my_monster_id VARCHAR(20),
foe_spawn_point VARCHAR(5),
foe_monster_id VARCHAR(20) NOT NULL,
fight_start TIMESTAMP NOT NULL,
fight_end TIMESTAMP NOT NULL,
total_experience NUMBER NOT NULL
loot_type NUMBER NOT NULL,
CONSTRAINT my_monster_id_fk FOREIGN KEY (my_monster_id)
REFERENCES monsters (monster_id),
CONSTRAINT foe_spawn_point_fk FOREIGN KEY (foe_spawn_point)
REFERENCES spawntypes (spawn_point),
CONSTRAINT fight_id_pk PRIMARY KEY (fight_id)
);
Given this data how can I easily carry out this two tasks:
1) I would like to create a pl/sql function that passing only a fight_id as a parameter and given the foe_spawn_point (inside the fight table) return the unitary_experience that is related to this spawn point referencing the spawntypes table, how can I do it? :-/ [f(x)]
In order to calculate the total experience earned from a fight (unitary_experience * fight_length) I have created a function that given a particular fight will subtract the fight_end with the fight_start so now I know how long the fight lasted. [f(y)]
2) is it possible to use this two functions (multiply the result that they returns) during the database population task?
INSERT INTO fights VALUES(.... , f(x) * f(y), 'loot A');
in order to populate all the total_experience entries inside the fights table?
thank you for your help
In SQL, you don't generally talk about building functions to do things. The building blocks of SQL are queries, views, and stored procedures (most SQL dialects do have functions, but that is not the place to start).
So, given a variable with $FIGHTID you would fetch the unitary experience with a simple query that uses the join operation:
select unitary_experience
from fight f join
spawnTypes st
on st.spawn_point = f.foe_spawn_point
where fightid = $FIGHTID
If you have a series of values to insert, along with a function, I would recommend using the select form of insert:
insert into fights(<list of columns, total_experience)
select <list of values>,
($FIGHT_END - $FIGHT_START) * (select unitary_experience from spawnTypes where spawnType ='$SPAWN_POINT)
One comment about the tables. It is a good idea for all the ids in the table to be integers that are auto-incremented. In Oracle you do this by creating a sequence (and it is simpler in most other databases).

PL/SQL - Only one value for a person

I have the following table.
CLASS_HAS_STUDENTS (
PER_SSN INTEGER NOT NULL,
PER_YEAR INTEGER NOT NULL, /*These two are PKs for a student*/
SCHOOL_CODE INTEGER NOT NULL, /*PK for a school*/
CLASS_YEAR INTEGER NOT NULL,
CLASS_NUMBER INTEGER NOT NULL,
CLASS_TEACHTYPE CHAR(3) NOT NULL, /*These three are PKs for a class*/
STUDCLASS_STATUS CHAR(1) NOT NULL
constraint CKC_STUDCLASS_STATUS_CLASS_TI check (StudClass_Status IN ('E', 'Y', 'T', 'P', 'F')),
STUDCLASS_LISTNUMBER INTEGER NOT NULL,
STUDCLASS_ROLLNUMBER INTEGER NOT NULL
);
(This code lacks some minor constraints)
Now, I need a way to check that one PER_SSN/PER_YEAR (a person's PK) can only have one 'E' ("Enrolled") status. I can't do this with a trigger (given I'm selecting from the same table) and I don't know if I can do this with a check constraint (can I use COUNT() here?). Any help is appreciated.
You can create a function-based unique index to enforce this sort of thing. You can't create a constraint as such.
This takes advantage of the fact that Oracle b-tree indexes do not index NULL data so the index will only have entries for the rows where studclass_status is E.
CREATE UNIQUE INDEX idx_one_enrolled
ON class_has_students( CASE WHEN studclass_status = 'E'
THEN per_ssn
ELSE null
END,
CASE WHEN studclass_status = 'E'
THEN per_year
ELSE null
END );
I'm a little confused by your question. I'm guessing you either want to:
1) Prevent insert of more than one status per student (in which case a trigger would be appropriate)
or
2) Use a SELECT statement to find students already in the table, in which case you want to do something like:
SELECT PER_SSN, PER_YEAR, STUDCLASS_STATUS, COUNT(*)
FROM CLASS_HAS_STUDENTS
WHERE STUDCLASS_STATUS = 'E'
HAVING COUNT(*) > 1
GROUP BY PER_SSN, PER_YEAR, STUDCLASS_STATUS;
You should be able to do this with a partial unique index. To make sure you only have one enrolled class for every ssn this should work:
CREATE UNIQUE INDEX ssn_enrollments ON class_has_students(per_ssn)
WHERE studclass_status='E';
Note that this feature is not supported in all SQL implementations, but PostgreSQL has supported since at least version 8.

Index Guidance for SQL Query

Anyone have guidance on how to approach building indexes for the following query? The query works as expected, but I can't seem to get around full table scans. Working with Oracle 11g.
SELECT v.volume_id
FROM ( SELECT MIN (usv.volume_id) volume_id
FROM user_stage_volume usv
WHERE usv.status = 'NEW'
AND NOT EXISTS
(SELECT 1
FROM user_stage_volume kusv
WHERE kusv.deal_num = usv.deal_num
AND kusv.locked = 'Y')
GROUP BY usv.deal_num, usv.volume_type
ORDER BY MAX (usv.priority) DESC, MIN (usv.last_update) ASC) v
WHERE ROWNUM = 1;
Please request any more info you may need in comments and I'll edit.
Here is the create script for the table. The PK is VOLUME_ID. DEAL_NUM is not unique.
CREATE TABLE ENDUR.USER_STAGE_VOLUME
(
DEAL_NUM NUMBER(38) NOT NULL,
EXTERNAL_ID NUMBER(38) NOT NULL,
VOLUME_TYPE NUMBER(38) NOT NULL,
EXTERNAL_TYPE VARCHAR2(100 BYTE) NOT NULL,
GMT_START DATE NOT NULL,
GMT_END DATE NOT NULL,
VALUE FLOAT(126) NOT NULL,
VOLUME_ID NUMBER(38) NOT NULL,
PRIORITY INTEGER NOT NULL,
STATUS VARCHAR2(100 BYTE) NOT NULL,
LAST_UPDATE DATE NOT NULL,
LOCKED CHAR(1 BYTE) NOT NULL,
RETRY_COUNT INTEGER DEFAULT 0 NOT NULL,
INS_DATE DATE NOT NULL
)
ALTER TABLE ENDUR.USER_STAGE_VOLUME ADD (
PRIMARY KEY
(VOLUME_ID))
An index on (deal_num) would help the subquery greatly. In fact, an index on (deal_num, locked) would allow the subquery to avoid the table itself altogether.
You should expect a full table scan on the main query, as it filters on status which is not indexed (and most likely would not benefit from being indexed, unless 'NEW' is a fairly rare value for status).
I think it's running your inner subquery (inside not exists...) once for every run of the outer subquery.
That will be where performance takes a hit - it will run through all of user_stage_volume for each row in user_stage_volume, which is O(n^2), n being the number of rows in usv.
An alternative would be to create a view for the inner subquery, and use that view, or alternatively, to name a temporary view by using WITH.

MySQL query slow when selecting VARCHAR

I have this table:
CREATE TABLE `search_engine_rankings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword_id` int(11) DEFAULT NULL,
`search_engine_id` int(11) DEFAULT NULL,
`total_results` int(11) DEFAULT NULL,
`rank` int(11) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`indexed_at` date DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
KEY `search_engine_rankings_search_engine_id_fk` (`search_engine_id`),
CONSTRAINT `search_engine_rankings_keyword_id_fk` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`id`) ON DELETE CASCADE,
CONSTRAINT `search_engine_rankings_search_engine_id_fk` FOREIGN KEY (`search_engine_id`) REFERENCES `search_engines` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=244454637 DEFAULT CHARSET=utf8
It has about 250M rows in production.
When I do:
select id,
rank
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very quickly.
When I add the url column (VARCHAR):
select id,
rank,
url
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very slowly.
Any ideas?
The first query can be satisfied by the index alone -- no need to read the base table to obtain the values in the Select clause. The second statement requires reads of the base table because the URL column is not part of the index.
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
The rows in tbe base table are not in the same physical order as the rows in the index, and so the read of the base table can involve considerable disk-thrashing.
You can think of it as a kind of proof of optimization -- on the first query the disk-thrashing is avoided because the engine is smart enough to consult the index for the values requested in the select clause; it will already have read that index into RAM for the where clause, so it takes advantage of that fact.
Additionally to Tim's answer. An index in Mysql can only be used left-to-right. Which means it can use columns of your index in your WHERE clause only up to the point you use them.
Currently, your UNIQUE index is keyword_id,search_engine_id,rank,indexed_at. This will be able to filter the columns keyword_id and search_engine_id, still needing to scan over the remaining rows to filter for indexed_at
But if you change it to: keyword_id,search_engine_id,indexed_at,rank (just the order). This will be able to filter the columns keyword_id,search_engine_id and indexed_at
I believe it will be able to fully use that index to read the appropriate part of your table.
I know it's an old post but I was experiencing the same situation and I didn't found an answer.
This really happens in MySQL, when you have varchar columns it takes a lot of time processing. My query took about 20 sec to process 1.7M rows and now is about 1.9 sec.
Ok first of all, create a view from this query:
CREATE VIEW view_one AS
select id,rank
from search_engine_rankings
where keyword_id = 19000
and search_engine_id = 11
and indexed_at = "2010-12-03";
Second, same query but with an inner join:
select v.*, s.url
from view_one AS v
inner join search_engine_rankings s ON s.id=v.id;
TLDR: I solved this by running optimize on the table.
I experienced the same just now. Even lookups on primary key and selecting just some few rows was slow. Testing a bit, I found it not to be limited to the varchar column, selecting an int also took considerable amounts of time.
A query roughly looking like this took around 3s:
select someint from mytable where id in (1234, 12345, 123456).
While a query roughly looking like this took <10ms:
select count(*) from mytable where id in (1234, 12345, 123456).
The approved answer here is to just make an index spanning someint also, and it will be fast, as mysql can fetch all information it needs from the index and won't have to touch the table. That probably works in some settings, but I think it's a silly workaround - something is clearly wrong, it should not take three seconds to fetch three rows from a table! Besides, most applications just does a "select * from mytable", and doing changes at the application side is not always trivial.
After optimize table, both queries takes <10ms.