Index Integer Substring of Varchar ID PostgreSQL - sql

I am going to be creating a very large table (320k+ rows) that I am going to be doing many complicated operations on so performance is very important. Each row will be a reference to a page / entity from an external site that already has unique IDs. In order to keep the data easy to read and for consistency reasons I would rather use those external IDs as my own row IDs, however the problem is that the IDs are in the format of XXX######## where the XXX part is always the same identical string prefix and the second ######## part is a completely unique number. From what I know, using varchar ids is measurably slower performance wise, and only looking at the numerical part will have the same results.
What is the best way to do this? I still want to be able to do queries like WHERE ID = 'XXX########' and have the actual correct ids displayed in result sets rather than trimmed ones. Is there a way to define getters and setters for a column? Or is there a way to create an index that is a function on just the numerical part of the id?

Since your ID column (with format XXX########) is a primary key, there will already be an index on that column. If you wish to create an index based on the "completely unique number" portion of the ID, it is possible to create an expression index in Postgres:
CREATE INDEX pk_substr_idx ON mytable (substring(id,4));
This will create an index on the ######## portion of your column. However, bear in mind that the values stored in the index will be text, not numbers. Therefore, you might not be able to see any real benefit to having this index around (i.e., you'll only be able to check for equality = and not comparison >/</>=/<=.
The other drawback of this approach is that for every row you insert, you'll be updating two indexes (the one for the PK, and the one for the substring).
Therefore, if at all possible, I would recommend splitting your ID into separate prefix (the XXX portion) and id_num (the ######## portion) columns. Since you stated that "the XXX part is always the same identical string prefix", you would stand to reap a performance benefit by either 1) splitting the string into two columns or 2) hard-code the XXX portion into your app (since it's "always the same identical string prefix") and only store the numeric portion in the database.
Another approach (if you are willing to split the string into separate prefix and id_num columns) is to create a composite index. The table definition would then look something like:
CREATE TABLE mytable (
prefix text,
id_num int,
<other columns>,
PRIMARY KEY (prefix, id_num)
);
This creates a primary key on the two columns, and you would be able to see your queries use the index if you write your application with two columns in mind. Again, you would need to split the ID up into text and number portions. I believe this is the only way to get the best performance out of your queries. Any value that mixes text and numbers will ultimately be stored and interpreted as text.
Disclosure: I work for EnterpriseDB (EDB)

Use an IDENTITY type column for the primary key and load the external IDs as a separate column

Related

Is json suitable for holding values coming from dynamic columns

Is json suitable for holding values coming from dynamic columns?
I want to store data about documents where the structure is known in advance, e.g. Name, Name, Year, Place, etc. For now, I have defined these fields as columns in a table in MS SQL.
However, I would like to store data about "dynamic" documents, where the user himself will select a field from a certain pool and insert a value. For now, I store this data as json in one column. For example once will be something like this:'{"Name":"John Doe","Place":"Chicago"}'
and other times it can only be: '{"Name": "John Doe"}'
I wonder how to unify this data. I thought that data from first type of documents (with a known number of columns) should also be stored in json. But I don't know if this is a good approach with large amounts of data, eg 100,000 records.
First, 100,000 rows is not a large number of columns. It is doubtful that you are even talking about a gigabyte of data.
Both XML and JSON incur overhead for storing field names in the data. If you have lots of repeated field names, then lots of redundant field names are being stored. And bigger rows slow down queries.
JSON and XML also have challenges in verifying the field names. This can be handled through the application or constraints. That said, they can be quite useful when the attributes are rarely used.
Your sample data simply suggests NULLable columns. You can have name and place. If there is no place then the value is NULL. Given that you have a fixed pool, there is a good change that this structure is the simplest and most efficient. The only downside is that adding a new column requires adding a column to a table. And that can be an expensive operation.
An alternative is an EAV model, which is mentioned in the comments. This solves the problem of repeating the names, because you can use ids instead. So, you could have:
create table optionalFields (
optionFieldId int identity(1, 1) primary key,
name varchar(255)
);
create table userOptionalFields (
userOptionalFieldId int identity (1, 1) primary key,
userId int references users(userId),
optionalFieldId int references optionalFields(optionalFieldId),
value varchar(255)
);
The downside to an EAV model is that it is simplest when all the values are strings, and that can be a little tricky if some of the values are numbers or dates. On the positive side, the database ensures that the fields are valid.
The choice between the different data models depends on factors such as:
The data type of the values.
The total number of fields.
How often new fields are added.
Whether the fields (for a given user) are updated and if so, if they are updated one-by-one or all-at-once.
How common fields are for a given user.
How familiar you are with XML and JSON.
Whether field names have synonyms (for instance "FullName for "Name").
Whether field names ever change. For instance, might "Name" suddenly become "FullName"?
And no doubt other issues as well.

Database: Should ids be sequential?

I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!
As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.
Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.
If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column

Using VARCHAR as PRIMARY KEY for an 'ORPHAN' table

I'm to create an orphan table (no relationships with any other table whatsoever) that contains 3 columns.
Col1 - String field - VARCHAR(32) - Contains unique data not more than 32 characters
Col2 - String field - TEXT - Contains larger non-unique data of characters
Col3 - Numeric (Bool) - INT(1) - 0/1 for Flagging
I'm thinking of using Col1 as my PRIMARY KEY. I have done some research and see people argue that using a meaningless INT column as a PRIMARY KEY to avoid Foreign Key/Storage issues is the way to go.
However, IMO, since this is an orphan table, it should not matter. Besides, I would require to place an INDEX on Col1 anyway.
As a side note, I'm not expecting more than ~1000 rows in this table.
Thoughts please.
I'd still just use an INT PK and put an index on COL1. I suppose you could use COL1 as the index if you can ensure that nothing will ever be joined to that table, but if nothing else the index will give you an idea of the order in which items are added/deleted from the table. I also like to add an IsActive boolean so that you never delete anything and a DateCreated datetime to almost every table.
If col1 is your real primary key, there is no reason not to use it. Especially if the table is that tiny.
You would need to maintain a unique index on that column anyway, so by adding an artificial primary key you just add more overhead fon insert and delete operations (as two indexes must be maintained).
Unless you are referencing that PK from really, really many other rows (and other tables) you should just go with what is the natural primary for your business rules.
I see where you are coming from but it just makes sense to index the first column anyway. It may be because I am used to excel but the usefulness of the initial column for a primary key also has an order to it along with readability while debugging or capturing data. If you use a more random generated number you still would be searching through a few hundred rows looking for a hard to distinguish key. In the end I highly recommend the extra column of ints. It is well worth it.
Whenever i do any database tables i keep my INT column.
I believe its faster to compare numbers then strings.
So it all depends how ofter you will query the database for info and compare strings in there.
I'm still unclear what the question is. Judging from the answers, I've deduced it down to two plausible questions:
Is it okay to use a VARCHAR instead of an INTEGER as a primary key?
Yes it is okay to use a VARCHAR instead.
In many cases it is preferred, especially if your table is expected to grow beyond 2,147,483,647 records (yes this happens). Performance-wise, even if INTs had a minimal speed advantage, on a ~1000 record table, you would not see it. Designated PKs are indexed by default. The one problem is that you'll lose any auto-generating sequence that the database can do for you.
Is it okay to use your unique COL1 field as a primary key, instead of some other unique ID field?
Yes it is okay.
The whole notion of having a primary key is to establish a unique field. What you're losing, though, may be some intrinsic comprehension. When other users want to join on that table, it's far easier to understand that id is a unique field, whereas col1 (some varchar) may or may not be unique.
In your given scenario, it should be okay. If the scope does grow up then you can always introduce an auto_increment PK column. Just make sure that your field is both indexed and unique.

How to store array or multiple values in one column

Running Postgres 7.4 (Yeah we are in the midst of upgrading)
I need to store from 1 to 100 selected items into one field in a database. 98% of the time it's just going to be 1 item entered, and 2% of the time (if that) there will be multiple items.
The items are nothing more than a text description, (as of now) nothing more than 30 characters long. They are static values the user selects.
Wanted to know the optimal column data type used to store the desired data. I was thinking BLOB but didn't know if this is a overkill. Maybe JSON?
Also I did think of ENUM but as of now I can't really do this since we are running Postgres 7.4
I also wanted to be able to easily identify the item(s) entered so no mappings or referencing tables.
You have a couple of questions here, so I'll address them separately:
I need to store a number of selected items in one field in a database
My general rule is: don't. This is something which all but requires a second table (or third) with a foreign key. Sure, it may seem easier now, but what if the use case comes along where you need to actually query for those items individually? It also means that you have more options for lazy instantiation and you have a more consistent experience across multiple frameworks/languages. Further, you are less likely to have connection timeout issues (30,000 characters is a lot).
You mentioned that you were thinking about using ENUM. Are these values fixed? Do you know them ahead of time? If so this would be my structure:
Base table (what you have now):
| id primary_key sequence
| -- other columns here.
Items table:
| id primary_key sequence
| descript VARCHAR(30) UNIQUE
Map table:
| base_id bigint
| items_id bigint
Map table would have foreign keys so base_id maps to Base table, and items_id would map to the items table.
And if you'd like an easy way to retrieve this from a DB, then create a view which does the joins. You can even create insert and update rules so that you're practically only dealing with one table.
What format should I use store the data?
If you have to do something like this, why not just use a character delineated string? It will take less processing power than a CSV, XML, or JSON, and it will be shorter.
What column type should I use store the data?
Personally, I would use TEXT. It does not sound like you'd gain much by making this a BLOB, and TEXT, in my experience, is easier to read if you're using some form of IDE.
Well, there is an array type in recent Postgres versions (not 100% about PG 7.4). You can even index them, using a GIN or GIST index. The syntaxes are:
create table foo (
bar int[] default '{}'
);
select * from foo where bar && array[1] -- equivalent to bar && '{1}'::int[]
create index on foo using gin (bar); -- allows to use an index in the above query
But as the prior answer suggests, it will be better to normalize properly.

How to use Oracle Indexes

I am a PHP developer with little Oracle experience who is tasked to work with an Oracle database.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL. Instead they seem to create an index out of two fields.
For example I noticed that one of the indexes is a combination of a Date Field and foreign key ID field. The Date field seems to store the entire date and timestamp so the combination is fairly unique.
If the index name was PLAYER_TABLE_IDX how would I go about using this index in my PHP code?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
Any advice Oracle/PHP gurus?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
There's no way around that you have to use reference all the columns in a composite primary key to get a unique row.
You can't use an index directly in a SQL query.
In Oracle, you use the hint syntax to suggestion an index that should be used, but the only means of hoping to use an index is by specifying the column(s) associated with it in the SELECT, JOIN, WHERE and ORDER BY clauses.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL.
Oracle (and PostgreSQL) have what are called "sequences". They're separate objects from the table, but are used for functionality similar to MySQL's auto_increment. Unlike MySQL's auto_increment, you can have more than one sequence used per table (they're never associated), and can control each one individually.
Instead they seem to create an index out of two fields.
That's what the table design was, nothing specifically Oracle about it.
But I think it's time to address that an index has different meaning in a database than how you are using the term. An index is an additional step to make SELECTing data out of a table faster (but makes INSERT/UPDATE/DELETE slower because of maintaining them).
What you're talking about is actually called a primary key, and in this example it'd be called a composite key because it involves more than one column. One of the columns, either the DATE (consider it DATETIME) or the foreign key, can have duplicates in this case. But because of the key being based on both columns, it's the combination of the two values that makes them the key to a unique record in the table.
http://use-the-index-luke.com/ is my Web-Book that explains how to use indexes in Oracle.
It's an overkill to your question, however, it is probably worth reading if you want to understand how things work.