Configuring indexes in postgres - sql

this is my first time dealing with indexes and would like to understand few things.
I have the tables of the following schemas:
Table1: Customer details
id
name
createdOn
username
phone
address
1
xyz
some date
xyz12
12345678
abc
The id in the above table is unique. The id is not defined as PK in the table though. Would id + createdOn be a good complex index?
Table2: Tracked customer actions
customer id
name
timestamp
action type
cart value
address
1
xyz
some date
click
.
abc
The above table does not have any column with unique values and there can be a lot of sparse data. The above actions table is a sample and can have almost 18 columns, with new data being added frequently. Is having all columns as a index a good one?
The queries on these tables could be both simple and complex as below:
select * from customerDetails
OR
with target_customers as (
select
id as customer_id
from customerDetails
where customer_date > {some date}
)
select avg(cart_value) from actions a
where action_type = 'cart updated'
inner join target_customers b on a.customer_id = b.customer_id
These are sample queries and I believe I will be having even more complex queries using different aggregations and joins with other tables as well to gain insights while performing analytics in the future.
I want to understand the best columns for indexes on the above tables.

The id is not defined as PK in the table though."
That's unusual. Why is that?
Would id + createdOn be a good complex index?
No, you'd reverse it: createdOn, id. An index can use the first column alone. This allows you to use the index to order by createdOn and also createdOn between X and Y.
But you probably wouldn't include id in there at all. Make id a primary key and it is indexed.
In general, if you want to cover all possibilities for two keys, make two indexes...
columnA, columnB
columnB
columnA, columnB can cover queries which only reference columnA and can also order by columnA. It can also cover queries which reference both columnA and columnB. But it can't cover a query which only references columnB, so we need an single-column index for columnB.
Is having all columns as a index a good one?
Maybe, it depends on your queries, but probably not.
You want to index foreign keys, they should be indexed automatically, because that will speed up all joins.
You probably want to index timestamps that you're going to search or order by.
Any flags you often query by, such as where action_type = 'cart updated' you may want to index. Or you may want to partition the table by the action type.
The above actions table is a sample and can have almost 18 columns, with new data being added frequently.
This may be a good use of a single jsonb column to store all the miscellaneous attributes. This allows you to use a single index for the jsonb column. However, jsonb is not a panacea and you will have to choose what to put in jsonb and what to make columns.
For example, a timestamp such as createdOn should probably be a column. Also any foreign keys. And status flags such as action_type.

Related

Postgres data base need suggestion creating an index for table

I a have a table structure as below. For fetching the data from table I am having search criteria as mentioned below. I am writing a singe sql query as per requirement(sample query I mentioned below). I need to create an index for the table to cover all the search criteria. It will be helpful somebody advice me.
Table structure(columns):
applicationid varchar(15),
trans_tms timestamp,
SSN varchar,
firstname varchar,
lastname varchar,
DOB date,
Zipcode smallint,
adddetais json
Search criteria will be from API will be fall under 4 categories. All 4 categories are mandatory. At any cost I will receive 4 categories of values for against single applicant.
Search criteria:
ssn&last name (last name need to use function I.e. soundex(lastname)=soundex('inputvalue').
ssn & DOB
ssn&zipcode
firstname&lastname&DOB.
Query:
I am trying to write.
Sample query is:
Select *
from table
where ((ssn='aaa' and soundex(lastname)=soundex('xxx')
or ((ssn='aaa' and dob=xxx)
or (ssn='aaa' and zipcode = 'xxx')
or (firstname='xxx' and lastname='xxx' and dob= xxxx));
For considering performance I need to create an index for the table. Might be composite. Any suggestion will be helpful.
Some Approaches I would follow:
Yes, you are correct composite index/multicolumn index will give benefit in AND conditions of two columns, however, indexes would overlap on columns for given conditions.
Documentation : https://www.postgresql.org/docs/10/indexes-multicolumn.html
You can use a UNION instead of OR.
Reference : https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/
If multiple conditions could be combined for e.g: ssn should be 'aaa' with any combination, then modifying the where clause with lesser OR is preferable.

PostgreSQL Sequence Ascending Out of Order

I'm having an issue with Sequences when inserting data into a Postgres table through SQL Alchemy.
All of the data is inserted fine, the id BIGSERIAL PRIMARY KEY column has all unique values which is great.
However when I query the first 10/20 rows etc. of the table, the id values are not ascending in numeric order. There are gaps in the sequence, fine, that's to be expected, I mean rows will go through values randomly not ascending like:
id
15
22
16
833
30
etc...
I've gone through plenty of SO and Postgres forum posts around this and have only found people talking about having huge serial gaps in their sequences, not about incorrect ascending order when being created
Screenshots of examples:
The table itself has being created through standard DDL statement like so:
CREATE TABLE IF NOT EXISTS schema.table_name (
id BIGSERIAL NOT NULL,
col1 text NOT NULL,
col2 JSONB[] NOT NULL,
etc....
PRIMARY KEY (id)
);
However when I query the first 10/20 rows etc. of the table
Your query has no order by clause, so you are not selecting the first rows of the table, just an undefined set of rows.
Use order by - you will find out that sequence number are indeed assigned in ascending order (potentially with gaps):
select id from ht_data order by id limit 30
In order to actually check the ordering of the sequence, you would actually need another column that stores the timestamp when each row was created. You could then do:
select id from ht_data order by ts limit 30
In general, there is no defined "order" within a SQL table. If you want to view your data in a certain order, you need an ORDER BY clause:
SELECT *
FROM table_name
ORDER BY id;
As for gaps in the sequence, the contract of an auto increment column generally only guarantees that each newly generated id value with be unique and, most of the time (but not necessarily always), will be increasing.
How could you possibly know if the values are "out of order"? SQL tables represent unordered sets. The only indication of ordering in your table is the serial value.
The query that you are running has no ORDER BY. The results are not guaranteed to be in any particular ordering. Period. That is a very simply fact about SQL. That you want the results of a SELECT to be ordered by the primary key or by insertion order is nice, but not how databases work.
The only way you could determine if something were out of order would be if you had a column that separate specified the insert order -- you could have a creation timestamp for instance.
All you have discovered is that SQL lives up to its promise of not guaranteeing ordering unless the query specifically asks for it.

Generating a primary key unique across multiple databases

Operational databases of identical structure work in several countries.
country A has table Users with column user_id
country B has table Users with column user_id
country C has table Users with column user_id
When data from all three databases is brought to the staging area for the further data warehousing purposes all three operational tables are integrated into a single table Users with dwh_user_id.
The logic looks like following:
if record comes from A then dwh_user_id = 1000000 + user_id
if record comes from B then dwh_user_id = 4000000 + user_id
if record comes from c then dwh_user_id = 8000000 + user_id
I have a strong feeling that it is a very bad approach. What would be a better approach?
(user_id + country_iso_code maybe?)
In general, it's a terrible idea to inject logic into your primary key in this way. It really sets you up for failure - what if country A gets more than 4000000 user records?
There are a variety of solutions.
Ideally, you include the column "country" in all tables, and use that together with the ID as the primary key. This keeps the logic identical between master and country records.
If you're working with a legacy system, and cannot modify the country tables, but can modify the master table, add the key there, populate it during load, and use the combination of country and ID as the primary key.
The way we handle this scenario in Ajilius is to add metadata columns to the load. Values like SERVER_NAME or DATABASE_NAME might provide enough unique information to make a compound key unique.
An alternative scenario is to generate a GUID for each row at extract or load time, which would then uniquely identify each row.
The data vault guys like to use a hash across the row, but in this case it would only work if no row was ever a complete duplicate.
This is why they made the Uniqueidentifier data type. See here.
If you can't change to that, I would put each one in a different table and then union them in a view. Something like:
create view vWorld
as
select 1 as CountryId, user_id
from SpainUsers
UNION ALL
select 2 as CountryId, user_id
from USUsers
Most efficient way to do this would be :-
If record from Country A, then user * 0 = Hence dwh_user_id = 0.
If record from Country B, then (user * 0)- 1 = Hence dwh_user_id = -1.
If record from Country C, then (user * 0)+ 1 = Hence dwh_user_id = 1.
Suggesting this logic assuming the dwh_user_id is supposed to be a number field.

clustered index on "join" column or on "where" column

I have two tables in SQL Server.
Table A has two columns: ID and Name. They are both unique.
Table B is joined to table A on the Name column.
Assuming most of the queries to these tables will be:
Select A.Name, B.xyz From A Inner Join B on A.Name = B.Name Where ID = 1
Which column in table A do I want to create a clustered index on? ID to speed up the WHERE or Name to speed up the JOIN?
I don't think you want a clustered index at all. You want just a regular index. (Unless you expect to read all the data in sequential order.)
What you do want is an index on both fields together, with ID first and Name second.
So three indexes:
ID Primary
Name Unique
ID, Name Unique
The only reason to create the last index is if your database supports index merges. Otherwise you want just the index on ID. A.Name isn't used from the index, A.Name is data, not a search key. You want an index on B.Name.
The NAME column since it's unique and you are more likely to perform searches/joins on this one.

Indexes and multi column primary keys

In a MySQL database I have a table with the following primary key
PRIMARY KEY id (invoice, item)
In my application I will also frequently be selecting on item by itself and less frequently on only invoice. I'm assuming I would benefit from indexes on these columns.
MySQL does not complain when I define the following:
INDEX (invoice),
INDEX (item),
PRIMARY KEY id (invoice, item)
But I don't see any evidence (using DESCRIBE -- the only way I know how to look) that separate indexes have been established for these two columns.
Are the columns that make up a primary key automatically indexed individually? Is there a better way than DESCRIBE to explore the structure of my table?
I'm not intimately familiar with the internals of indices on mySql, but on the two database vendor products that I am familiar with (MsSQL, Oracle) indices are balanced-Tree structures, whose nodes are organized as a sequenced tuple of the columns the index is defined on (In the Sequence Defined)
So, unless mySql does it very differently, (probably not), any composite index (on more than one column) can be useable by any query that needs to filter or sort by a subset of the columns in the index, as long as the list of columns is compatible, i.e., if the columns, when sequenced the same as the sequenced list of columns in the complete index, is an ordered subset of the complete set of index columns, which starts at the beginning of the actual index sequence, with no gaps except at the end...
In other words, this means that if you have an index on (a,b,c,d) a query that filters on (a), (a,b), or (a,b,c) can also use the index, but a query that needs to filter on (b), or (c) or (b,c) will not be able to use the index...
So in your case, if you often need to filter or sort on column item alone, you need to add another index on that column by itself...
I personally use phpMyAdmin to view and edit the structure of MySQL databases. It is a web application but it runs well enough on a local web server (I run an instance of apache on my machine for this and phpPgAdmin).
As for the composite key of (invoice, item), it acts like an index for (invoice, item) and for invoice. If you want to index by just item you have to add that index yourself. Your PK will be sorted by invoice and then by item where invoice is the same in multiple records. While the order in a composite PK does not matter for uniqueness enforcement, it does matter for access.
On your table I would use:
PRIMARY KEY id (invoice, item), INDEX (item)
I'm not that familiar with MySQL, but generally an multiple-column index is equally useful on the first column in the index as an index on that column alone. The multiple-column index becomes less useful for querying against a single column the further the column appears into the index.
This makes some sense if you think of the multi-column index as a hierarchy. The first column in the index is the root of the hierarchy, so searching it is just a matter of scanning that first level. However, in order to scan the second column, the database has to look up the tree for each unique value found in the first column. This can be costly enough that most optimizers won't bother to look deeply into a multi-column index, instead opting to full-table-scan.
For example, if you have a table as follows:
Col1 |Col2 |Col3
----------------
A | 1 | Z
A | 2 | Y
A | 2 | X
B | 1 | Z
B | 2 | X
Assuming you have an index on all three columns, in order, the tree will look something like this:
A
+-1
+-Z
+-2
+-X
+-Y
B
+-1
+-Z
+-2
+-X
Looking for Col1='A' is easy: you only have to look at 2 ordered values. However, to resolve col3='X', you have to look at all of the values in the 4 bigger buckets, each of which is ordered individually.
To return table index information, you can use:
SHOW INDEX FROM <table>;
See: http://dev.mysql.com/doc/refman/5.0/en/show-index.html
To view table information:
SHOW CREATE TABLE <table>;
See: http://dev.mysql.com/doc/refman/5.0/en/show-create-table.html
Primary keys are indexes, so there's no need to create additional indexes. You can find out more information about them under the CREATE TABLE syntax (there's too much to insert here):
http://dev.mysql.com/doc/refman/5.0/en/create-table.html
There is a difference between composite index and composite primary key.
If you have defined a composite index like below
INDEX idx(invoice,item)
the index wont work if you query based on item and you need to add a separate index
INDEX itemidx(item)
But, if you have defined a composite primary key like below
PRIMARY KEY(invoice, item)
the index would work if you query based on item and no separate index is required.
Working example:
mysql>create table test ( col1 int(20), col2 int(20) ) primary key(col1,col2);
mysql>explain select * from test where col2 = 1;
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | index | NULL | PRIMARY | 8 | NULL | 10 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
Mysql auto create an index for composite keys. Depending on your queries, you may have to create separate index for individual column in the composite key.
If you are using mysql workbench, you can manually right click the schema and click on edit to see everything about the table
If your query is using both columns in where clause then you don't need to create a separate index in a composite primary key.
EXPLAIN SELECT * FROM `table` WHERE invoice = 1 and item = 1
You are also fine if you want to query with first column only
EXPLAIN SELECT * FROM `table` WHERE invoice = 1
But if you want to query with subsequent columns col2, col3 in composite PK then you would need to create separate indexes on those columns. The following explain query shows the second column does not have a possible key detected by MySQL
EXPLAIN SELECT * FROM `table` WHERE item = 1