PostgreSQL - Right Index choice for a status field (varchar) - sql

I have a table with lots of entries and a varchar field with length 8 that represents different statuses. There are only about 5 different statuses, lets say 'STATUS1', 'STATUS2', ... and most of the time it is NULL.
When I index the field, it doesn't do much because there are a lot of equal values and then postgres doesn't use the index.
My question is: Is there a way to index such a field and make it faster? Most of the time I query over status IS NULL and I think I can't make that faster. But what if I check for status = 'STATUS1'?

You can use partial indexes in some cases. Let's say you have lots of queries similar to
SELECT *
FROM the_table
WHERE color in ('green', 'blue') AND status = 'STATUS1' ;
This query would most probably run (much) faster if you create a partial index:
CREATE TABLE the_table
(
color text,
status character varying(8)
/* and anything you need */
) ;
CREATE INDEX
ON public.the_table (color)
WHERE status = 'STATUS1' ;
If using PostgreSQL (o any other database which allows it), I'd probably be creating an enumerated type as well, instead of varchar. You have two advantages: only the enumerated values will be allowed (so "autochecking"), and the space needed to store the info (and index it) is less than varchar(8):
CREATE TYPE status_type AS ENUM
('STATUS1',
'STATUS2',
'STATUS3');
and then create the table with it:
CREATE TABLE the_table
(
color text,
status status_type
/* and anything you need */
) ;
If you need to know (programmatically) which are the allowed values for the enumeration (for instance, to create a menu), check here.
If the database wouldn't allow for enums, I'd normalize to a small[ish] table of (anonymous_id_PK, status_value) pairs.

Related

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

Postgres: SELECT or INSERT in high concurrent write load DB

We have a DB for which we need a "selsert" (not upsert) function.
The function should take a text value and return a id column of existing row (SELECT) or insert the value and return id of new row (INSERT).
There are multiple processes that will need to perform this functionality (selsert)
I have been experimenting with pg_advisory_lock and ON CONFLICT clause for INSERT but am still not sure what approach would work best (even when looking at some of the other answers).
So far I have come up with following
WITH
selected AS (
SELECT id FROM test.body_parts WHERE (lower(trim(part))) = lower(trim('finger')) LIMIT 1
),
inserted AS (
INSERT INTO test.body_parts (part)
SELECT trim('finger')
WHERE NOT EXISTS ( SELECT * FROM selected )
-- ON CONFLICT (lower(trim(part))) DO NOTHING -- not sure if this is needed
RETURNING id
)
SELECT id, 'inserted' FROM inserted
UNION
SELECT id, 'selected' FROM selected
Will above query (within function) insure consistency in high
concurrency write workloads?
Are there any other issues I must consider (locking?, etc, etc)
BTW, I can insure that there are no duplicate values of (part) by creating unique index. That is not an issue. What I am after is that SELECT returns existing value if another process does INSERT (I hope I am explaining this right)
Unique index would have following definition
CREATE UNIQUE INDEX body_parts_part_ux
ON test.body_parts
USING btree
(lower(trim(part)));

Making a column unique with one exception

We have an application whose work flow involves submitting information to an outside group and then inputting the user's id number into the system.
For that reason we allow a set default value "00000000" to be put into the id field as a tentative value before the entry is approved and a permanent one is put in.
What I'm looking for is essentially a way to ensure that the column remains unique except for that one value.
What I'm basically looking for is a UNIQUE constraint, however instead of NULL being the blank option it being "00000000". I've considered doing it as part of a CHECK constraint, however that seems like it'd be a big performance hit. (Under the assumption that UNIQUE does some kind of indexing)
Use Filtered Index
as the Following:-
CREATE UNIQUE NONCLUSTERED INDEX idx_yourcolumn_notspecificvalue
ON YourTable(yourcolumn)
WHERE yourcolumn != "00000000";
Example:
-- Create Table
Create table Test (id int identity, code varchar (100))
-- Create Unique Filtered Index
CREATE UNIQUE NONCLUSTERED INDEX idx_MyCol_Filtered
ON Test(code)
WHERE code != '00000000';
-- Insert Dumy Data >> '00000000' is repeated and '0101' is once
insert into Test (code)
Values ('00000000'),
('00000000'),
('00000000'),
('0101')
select * from Test
The Result:
-- Now try inserting '0101' again
insert into Test (code) Values ('0101')
The Result:
For more details:
Create Filtered Indexes
Approving the user entry through work flow sound like very crucial business logic. I would like to suggest that generate random but unique (like time stamp) number and insert with new user entry. Keep additional column which differentiate ( flag) approved entries from unapproved entries. Once the user gets approval from work flow, update the id and flag.

What the best way to self-document "codes" in a SQL based application?

Q: Is there any way to implement self-documenting enumerations in "standard SQL"?
EXAMPLE:
Column: PlayMode
Legal values: 0=Quiet, 1=League Practice, 2=League Play, 3=Open Play, 4=Cross Play
What I've always done is just define the field as "char(1)" or "int", and define the mnemonic ("league practice") as a comment in the code.
Any BETTER suggestions?
I'd definitely prefer using standard SQL, so database type (mySql, MSSQL, Oracle, etc) should't matter. I'd also prefer using any application language (C, C#, Java, etc), so programming language shouldn't matter, either.
Thank you VERY much in advance!
PS:
It's my understanding that using a second table - to map a code to a description, for example "table playmodes (char(1) id, varchar(10) name)" - is very expensive. Is this necessarily correct?
The normal way is to use a static lookup table, sometimes called a "domain table" (because its purpose is to restrict the domain of a column variable.)
It's up to you to keep the underlying values of any enums or the like in sync with the values in the database (you might write a code generator to generates the enum from the domain table that gets invoked when the something in the domain table gets changed.)
Here's an example:
--
-- the domain table
--
create table dbo.play_mode
(
id int not null primary key clustered ,
description varchar(32) not null unique nonclustered ,
)
insert dbo.play_mode values ( 0 , "Quiet" )
insert dbo.play_mode values ( 1 , "LeaguePractice" )
insert dbo.play_mode values ( 2 , "LeaguePlay" )
insert dbo.play_mode values ( 3 , "OpenPlay" )
insert dbo.play_mode values ( 4 , "CrossPlay" )
--
-- A table referencing the domain table. The column playmode_id is constrained to
-- on of the values contained in the domain table playmode.
--
create table dbo.game
(
id int not null primary key clustered ,
team1_id int not null foreign key references dbo.team( id ) ,
team2_id int not null foreign key references dbo.team( id ) ,
playmode_id int not null foreign key references dbo.play_mode( id ) ,
)
go
Some people for reasons of "economy" might suggest using a single catch-all table for all such code, but in my experience, that ultimately leads to confusion. Best practice is a single small table for each set of discrete values.
add a foreign key to "codes" table.
the codes table would have the PK be the code value, add a string description column where you enter in the description of the value.
table: PlayModes
Columns: PlayMode number --primary key
Description string
I can't see this as being very expensive, databases are based on joining tables like this.
That information should be in database somewhere and not on comments.
So, you should have a table containing that codes and prolly a FK on your table to it.
I agree with #Nicholas Carey (+1): Static data table with two columns, say “Key” or “ID” and “Description”, with foreign key constraints on all tables using the codes. Often the ID columns are simple surrogate keys (1, 2, 3, etc., with no significance attached to the value), but when reasonable I go a step further and use “special” codes. Following are a few examples.
If the values are a sequence (say, Ordered, Paid, Processed, Shipped), I might use 1, 2, 3, 4, to indicate sequence. This can make things easier if you want to find all “up through” a give stages, such as all orders that have not yet been shipped (ID < 4). If you are into planning ahead, make them 10, 20, 30, 40; this will allow you to add values “in between” existing values, if/when new codes or statuses come along. (Yes, you cannot and should not try to anticipate everything and anything that might have to be done some day, but a bit of pre-planning like this can make some changes that much simpler.)
Keys/Ids are often integers (1 byte, 2 byte, 4 byte, whatever). There’s little cost to make them character values (1 char, 2 char, 3, char, 4 char). That’s character, not variable character. Done this way, you can have mnemonics on your codes, such as
O, P, R, S
Or, Pd, Pr, Sh
Ordr, Paid, Proc, Ship
…or whatever floats your boat. Done this way, I have found that it can save a lot of time when analyzing or debugging. You still want the lookup table, for relational integrity as well as a reminder for the more obscure codes.

Practical limitations of expression indexes in PostgreSQL

I have a need to store data using the HSTORE type and index by key.
CREATE INDEX ix_product_size ON product(((data->'Size')::INT))
CREATE INDEX ix_product_color ON product(((data->'Color')))
etc.
What are the practical limitations of using expression indexes? In my case, there could be several hundred different types of data, hence several hundred expression indexes. Every insert, update, and select query will have to process against these indexes in order to pick the correct one.
I've never played with hstore, but I do something similar when I need an EAV column, e.g.:
create index on product_eav (eav_value) where (eav_type = 'int');
The limitation in doing so is that you need to be explicit in your query to make use of it, i.e. this query would not make use of the above index:
select product_id
from product_eav
where eav_name = 'size'
and eav_value = :size;
But this one would:
select product_id
from product_eav
where eav_name = 'size'
and eav_value = :size
and type = 'int';
In your example it should likely be more like:
create index on product ((data->'size')::int) where (data->'size' is not null);
This should avoid adding a reference to the index when there is no size entry. Depending on the PG version you're using the query may need to be modified like so:
select product_id
from products
where data->'size' is not null
and data->'size' = :size;
Another big difference between regular and partial index is that the latter cannot enforce a unique constraint in a table definition. This will succeed:
create unique index foo_bar_key on foo (bar) where (cond);
The following won't:
alter table foo add constraint foo_bar_key unique (bar) where (cond);
But this will:
alter table foo add constraint foo_bar_excl exclude (bar with =) where (cond);