Practical limitations of expression indexes in PostgreSQL - sql

I have a need to store data using the HSTORE type and index by key.
CREATE INDEX ix_product_size ON product(((data->'Size')::INT))
CREATE INDEX ix_product_color ON product(((data->'Color')))
etc.
What are the practical limitations of using expression indexes? In my case, there could be several hundred different types of data, hence several hundred expression indexes. Every insert, update, and select query will have to process against these indexes in order to pick the correct one.

I've never played with hstore, but I do something similar when I need an EAV column, e.g.:
create index on product_eav (eav_value) where (eav_type = 'int');
The limitation in doing so is that you need to be explicit in your query to make use of it, i.e. this query would not make use of the above index:
select product_id
from product_eav
where eav_name = 'size'
and eav_value = :size;
But this one would:
select product_id
from product_eav
where eav_name = 'size'
and eav_value = :size
and type = 'int';
In your example it should likely be more like:
create index on product ((data->'size')::int) where (data->'size' is not null);
This should avoid adding a reference to the index when there is no size entry. Depending on the PG version you're using the query may need to be modified like so:
select product_id
from products
where data->'size' is not null
and data->'size' = :size;
Another big difference between regular and partial index is that the latter cannot enforce a unique constraint in a table definition. This will succeed:
create unique index foo_bar_key on foo (bar) where (cond);
The following won't:
alter table foo add constraint foo_bar_key unique (bar) where (cond);
But this will:
alter table foo add constraint foo_bar_excl exclude (bar with =) where (cond);

Related

How to create a primary key column and fill it with integer values on HANA SQL

I searched but only could found partial answer to this question
The goal would be here to create a new ID column on an existing table.
This new column would be the primary key for the table and I simply want it to be filled with integer values from 1 to number of rows.
What would be the query for that?
I know I have to first alter table to create the new column :
ALTER TABLE <MYTABLE> ADD (ID INTEGER);
Then I could use the series generator :
INSERT INTO <MYTABLE.ID> SELECT SERIES_GENERATE_INTEGER(1,1,(number of rows));
Once the column is filled I could use this line:
ALTER TABLE <MYTABLE> ADD PRIMARY KEY ("ID");
I am sure there is an easier way to do this
You wrote that you want to add a "new ID column to an existing table" and fill it with unique values.
That's not a "standard" operation in any DBMS, as the usual assumption is that records are created with a primary key and not retro fitted.
Thus, "ease" of operation for this is relative to what else you want to do.
For example, if you want to continue using this ID as a primary key for further operations, then using a once-off generator function like the SERIES_GENERATE_INTEGER or a query won't be very helpful since you have to avoid duplicates of already existing values.
Two, relatively easy, options come to mind:
Using a sequence:
create sequence myid;
update <table> set ID = myid.nextval;
And for succeeding inserts:
insert into <table> (id, ..., ...) VALUES (myid.nextval, ..., ...) ;
Note that this generates a value for every existing record and not a predefined set of size X.
Using a GUID
By using a GUID you generate a unique value every time you call the 'SYSUUID' function in SAP HANA. check docu here
Something like
update <table> set ID = SYSUUID;
should do the trick here.
Subsequent inserts would simply call the function for values of ID.

PostgreSQL - Right Index choice for a status field (varchar)

I have a table with lots of entries and a varchar field with length 8 that represents different statuses. There are only about 5 different statuses, lets say 'STATUS1', 'STATUS2', ... and most of the time it is NULL.
When I index the field, it doesn't do much because there are a lot of equal values and then postgres doesn't use the index.
My question is: Is there a way to index such a field and make it faster? Most of the time I query over status IS NULL and I think I can't make that faster. But what if I check for status = 'STATUS1'?
You can use partial indexes in some cases. Let's say you have lots of queries similar to
SELECT *
FROM the_table
WHERE color in ('green', 'blue') AND status = 'STATUS1' ;
This query would most probably run (much) faster if you create a partial index:
CREATE TABLE the_table
(
color text,
status character varying(8)
/* and anything you need */
) ;
CREATE INDEX
ON public.the_table (color)
WHERE status = 'STATUS1' ;
If using PostgreSQL (o any other database which allows it), I'd probably be creating an enumerated type as well, instead of varchar. You have two advantages: only the enumerated values will be allowed (so "autochecking"), and the space needed to store the info (and index it) is less than varchar(8):
CREATE TYPE status_type AS ENUM
('STATUS1',
'STATUS2',
'STATUS3');
and then create the table with it:
CREATE TABLE the_table
(
color text,
status status_type
/* and anything you need */
) ;
If you need to know (programmatically) which are the allowed values for the enumeration (for instance, to create a menu), check here.
If the database wouldn't allow for enums, I'd normalize to a small[ish] table of (anonymous_id_PK, status_value) pairs.

How can I copy a Redshift table but add a sortkey to a column?

I'm currently working on a project that uses a Redshift table with 51 columns. However, the person who made the table forgot to add a sortkey to our time column which will hurt performance for our use case if we don't add it.
How can I make a version of the table with our time column as the sortkey? I'm aware that you can't make a column a sortkey if its a member of an existing table, but I was hoping there's a way to do it that doesn't involve writing out the CREATE TABLE syntax by hand; for example, something like this would be nice:
timecube=# CREATE TABLE foo (like bar) sortkey(time);
ERROR: CREATE TABLE LIKE is not supported with DISTSTYLE, DISTKEY(), or SORTKEY() clauses
but as you can see its not supported. Is there another way? As we're still developing we don't need any of existing data.
Using traditional tools like pgdump didn't work well because they don't include any of the Redshift extras like encoding.
Redshift supports specifying the DIST and SORT keys as part of CREATE TABLE AS statements, as per the docs.
CREATE TABLE table_name
DISTSTYLE KEY
DISTKEY ( column )
SORTKEY ( column )
AS
(SELECT *
FROM source_table)
;
First step you need to do use get create table statement for existing table. Then create new table this time add sort key to new table.
Check encoding for old table ( when you load data using copy command it automatically adds compression encodings)
select "column", type, encoding
from pg_table_def where tablename = 'old_table'
When creating new table add encoding type for each column. Create table with Sort key .
Once new table is created use below command
insert into new table ( select * from old table order by time asc)

Including multiple columns in a single index in Postgres

I have a 'users' table with two columns, 'email' and 'new_email'. I need:
A case-insensitive uniqueness constraint covering both columns - i.e., if "Bob#Example.com" appears in one row's 'email' column, then inserting "bob#example.com" into another row's (or even the same row's) 'new_email' column should fail.
Fast case-insensitive searching for a given email address in either the 'email' or 'new_email' fields - i.e. find the row where the new_email OR email is "Bob#example.com", case-insensitive.
I know that I could do this more easily by creating a related 'emails' table, but I'm expecting to be looking up users in this table (by primary key) from several applications, and I'd like to avoid duplicating the join logic in various places to also retrieve their emails. So I think some kind of expression index would be best, if that's possible.
If this isn't possible, I suppose my next best option would be to create a view that the other applications could use to easily fetch a user's emails along with their other information, but I'm not sure how to do that either.
I'm using Postgres 8.4. Thank you!
I think you'll have to use a trigger to enforce your cross-column uniqueness constraint. If you add unique indexes on each column and then a trigger something like this (untested off the top of my head code):
CREATE FUNCTION no_dups_allowed() RETURNS trigger AS $$
DECLARE
r ROW;
BEGIN
SELECT 1 INTO r
FROM users
WHERE LOWER(email) = LOWER(NEW.email_new)
OR LOWER(email_new) = LOWER(NEW.email);
IF FOUND THEN
-- Found a duplicate so it is time for a hissy fit!
RAISE 'Duplicate email address found' USING ERRCODE = 'unique_violation';
END;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
You'd want something like that as a BEFORE INSERT and BEFORE UPDATE trigger. That trigger would take care of catching cross-column duplicates and the unique indexes would take care of in-column duplicates.
Some useful references:
FOUND
RAISE
Triggers
Trigger Procedures
You'll want the individual indexes for your queries anyway and using the uniqueness half of the indexes simplifies your trigger by leaving it to only deal with the cross-column part; if you try to do it all in the trigger, then you'll have to watch out for updating a row without really changing the email or email_new columns.
For the querying half, you could create a view that used a UNION to combine the two columns. You could also create a function to merge the user's email addresses into one list. Hard to say which would be best without know more details of these other queries but I suspect that fixing all the other queries to know about email and email_new would be the best approach; you'll have to update all the other queries to use the view or function anyway so why build a view or function at all?
No need for triggers. Try this:
create table et (email text, email2 text);
create unique index et_u on et (coalesce(lower(email),lower(email2)));
insert into et (email,email2) values ('scott#gmail.com',NULL);
insert into et (email,email2) values ('scott#gmail.com',NULL);
ERROR: duplicate key value violates unique constraint "et_u"
insert into et (email,email2) values (NULL,'scott#gmail.com');
ERROR: duplicate key value violates unique constraint "et_u"
insert into et (email,email2) values (NULL,'Scott#gmail.com');
ERROR: duplicate key value violates unique constraint "et_u"

Copy a table (including indexes) in postgres

I have a postgres table. I need to delete some data from it. I was going to create a temporary table, copy the data in, recreate the indexes and the delete the rows I need. I can't delete data from the original table, because this original table is the source of data. In one case I need to get some results that depends on deleting X, in another case, I'll need to delete Y. So I need all the original data to always be around and available.
However it seems a bit silly to recreate the table and copy it again and recreate the indexes. Is there anyway in postgres to tell it "I want a complete separate copy of this table, including structure, data and indexes"?
Unfortunately PostgreSQL does not have a "CREATE TABLE .. LIKE X INCLUDING INDEXES'
New PostgreSQL ( since 8.3 according to docs ) can use "INCLUDING INDEXES":
# select version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 8.3.7 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu3)
(1 row)
As you can see I'm testing on 8.3.
Now, let's create table:
# create table x1 (id serial primary key, x text unique);
NOTICE: CREATE TABLE will create implicit sequence "x1_id_seq" for serial column "x1.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "x1_pkey" for table "x1"
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x1_x_key" for table "x1"
CREATE TABLE
And see how it looks:
# \d x1
Table "public.x1"
Column | Type | Modifiers
--------+---------+-------------------------------------------------
id | integer | not null default nextval('x1_id_seq'::regclass)
x | text |
Indexes:
"x1_pkey" PRIMARY KEY, btree (id)
"x1_x_key" UNIQUE, btree (x)
Now we can copy the structure:
# create table x2 ( like x1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES );
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "x2_pkey" for table "x2"
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x2_x_key" for table "x2"
CREATE TABLE
And check the structure:
# \d x2
Table "public.x2"
Column | Type | Modifiers
--------+---------+-------------------------------------------------
id | integer | not null default nextval('x1_id_seq'::regclass)
x | text |
Indexes:
"x2_pkey" PRIMARY KEY, btree (id)
"x2_x_key" UNIQUE, btree (x)
If you are using PostgreSQL pre-8.3, you can simply use pg_dump with option "-t" to specify 1 table, change table name in dump, and load it again:
=> pg_dump -t x2 | sed 's/x2/x3/g' | psql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
And now the table is:
# \d x3
Table "public.x3"
Column | Type | Modifiers
--------+---------+-------------------------------------------------
id | integer | not null default nextval('x1_id_seq'::regclass)
x | text |
Indexes:
"x3_pkey" PRIMARY KEY, btree (id)
"x3_x_key" UNIQUE, btree (x)
[CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } ] TABLE table_name
[ (column_name [, ...] ) ]
[ WITH ( storage_parameter [= value] [, ... ] ) | WITH OIDS | WITHOUT OIDS ]
[ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ]
[ TABLESPACE tablespace ]
AS query][1]
Here is an example
CREATE TABLE films_recent AS
SELECT * FROM films WHERE date_prod >= '2002-01-01';
The other way to create a new table from the first is to use
CREATE TABLE films_recent (LIKE films INCLUDING INDEXES);
INSERT INTO films_recent
SELECT *
FROM books
WHERE date_prod >= '2002-01-01';
Note that Postgresql has a patch out to fix tablespace issues if the second method is used
There are many answers on the web, one of them can be found here.
I ended up doing something like this:
create table NEW ( like ORIGINAL including all);
insert into NEW select * from ORIGINAL
This will copy the schema and the data including indexes, but not including triggers and constraints.
Note that indexes are shared with original table so when adding new row to either table the counter will increment.
I have a postgres table. I need to
delete some data from it.
I presume that ...
delete from yourtable
where <condition(s)>
... won't work for some reason. (Care to share that reason?)
I was going to create a temporary
table, copy the data in, recreate the
indexes and the delete the rows I
need.
Look into pg_dump and pg_restore. Using pg_dump with some clever options and perhaps editing the output before pg_restoring might do the trick.
Since you are doing "what if"-type analysis on the data, I wonder if might you be better off using views.
You could define a view for each scenario you want to test based on the negation of what you want to exclude. I.e., define a view based on what you want to INclude. E.g., if you want a "window" on the data where you "deleted" the rows where X=Y, then you would create a view as rows where (X != Y).
Views are stored in the database (in the System Catalog) as their defining query. Every time you query the view the database server looks up the underlying query that defines it and executes that (ANDed with any other conditions you used). There are several benefits to this approach:
You never duplicate any portion of your data.
The indexes already in use for the base table (your original, "real" table) will be used (as seen fit by the query optimizer) when you query each view/scenario. There is no need to redefine or copy them.
Since a view is a "window" (NOT a shapshot) on the "real" data in the base table, you can add/update/delete on your base table and simply re-query the view scenarios with no need to recreate anything as the data changes over time.
There is a trade-off, of course. Since a view is a virtual table and not a "real" (base) table, you're actually executing a (perhaps complex) query every time you access it. This may slow things down a bit. But it may not. It depends on many issues (size and nature of the data, quality of the statistics in the System Catalog, speed of the hardware, usage load, and much more). You won't know until you try it. If (and only if) you actually find that the performance is unacceptably slow, then you might look at other options. (Materialized views, copies of tables, ... anything that trades space for time.)
A simple way is include all:
CREATE TABLE new_table (LIKE original_table INCLUDING ALL);
Create a new table using a select to grab the data you want. Then swap the old table with the new one.
create table mynewone as select * from myoldone where ...
mess (re-create) with indexes after the table swap.