MariaDB Indexing - indexing

MariaDB Indexing - indexing

Let's say I have a table of 200,000,000 users. For each user I have saved a certain attribute. Let it be their lastname.
I am unsure of which index type to use with MariaDB. The only queries made to the database will be in the form of SELECT lastname FROM table WHERE username='MYUSERNAME'.
Is it therefore the best to just define the column username as a primary key. Or do I need to do anything else? Also how long is it going to take until the index is built?
Sorry for this question, but this is my first database with more than 200.000 rows.

I would go with:
CREATE INDEX userindex on `table`(username);
This will index the usernames since this is what your query is searching on. This will speed up the results coming back as the username column will be indexed.
Try it and if it reduces performance just delete the index, nothing lost (although make sure you do have backups! :))
This article will help you out https://mariadb.com/kb/en/getting-started-with-indexes/
It says primary keys are best set at table creation and as I guess yours is already in existence that would mean either copying it and creating a primary key or just using an index.
I recently indexed a table with non unique strings as an ID and although it took a few minutes to index the speed performance was a great improvement, this table was 57m rows.
-EDIT- Just re-read and thought it was 200,000 as mentioned at the end but see it is 200,000,000 in the title, that's a hella lotta rows.

username sounds like something that is "unique" and not null. So, make it NOT NULL and have PRIMARY KEY(username), without an AUTO_INCREMENT surrogate PK.
If it not unique, or cannot be NOT NULL, then INDEX(username) is very likely to be useful.
To design indexes, you must first know what queries you will be performing. (If you had called it simply "col1", I would not have been able to guess at the above advice.)
There are 3 index types:
BTree (actually B+Tree; see Wikipedia). This is the default and the most commonly used index type. It is efficient at finding a row given a specific value (WHERE user_name = 'joe'). It is also useful for a range of values (WHERE user_name LIKE 'Smith%').
FULLTEXT is useful for a TEXT column where you want to search for "words" inside it.
SPATIAL is useful for 2-dimensional data, such as geographical points on a map or other type of grid.

Related

Database: Should ids be sequential?

I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!

As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.

Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.

If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column

SQL Server: How to allow duplicate records on small table

I have a small table "ImgViews" that only contains two columns, an ID column called "imgID" + a count column called "viewed", both set up as int.
The idea is to use this table only as a counter so that I can track how often an image with a certain ID is viewed / clicked.
The table has no primary or foreign keys and no relationships.
However, when I enter some data for testing and try entering the same imgID multiple times it always appears greyed out and with a red error icon.
Usually this makes sense as you don't want duplicate records but as the purpose is different here it does make sense for me.
Can someone tell me how I can achieve this or work around it ? What would be a common way to do this ?
Many thanks in advance, Tim.

To address your requirement to store non-unique values, simply remove primary keys, unique constraints, and unique indexes. I expect you may still want a non-unique clustered index on ImgID to improve performance of aggregate queries that would otherwise require a scan the entire table and sort. I suggest you store an insert timestamp, not to provide uniqueness, but to facilitate purging data by date, should the need arise in the future.

You must have some unique index on that table. Make sure there is no unique index and no unique or primary key constraint.
Or, SSMS simply doesn't know how to identify the row that was just inserted because it has no key.
It is generally not best practice to have a table without a (logical) primary key. In your case, I'd make the image id the primary key and increment the counter. The MERGE statement is well-suited for performing and insert or update at the same time. Alternatives exist.
If you don't like that, create a surrogate primary key (an identity column set as the primary key).
At the moment you have no way of addressing a specific row. That makes the table a little unwieldy.

If you allow multiple rows being absolutely identical, how would you update/delete one of those rows?
How would you expect the database being able to "know" what row you referred to??
At the very least add a separate identity column (preferred being the clustered index, too).
As a side note: It's weird that you "like to avoid unneeded data" but at the same time insert duplicates over and over again instead of simply add up the click count per single image...

Use SQL statements, not GUI, if the table has not primary key or unique constraint.

How to use Oracle Indexes

I am a PHP developer with little Oracle experience who is tasked to work with an Oracle database.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL. Instead they seem to create an index out of two fields.
For example I noticed that one of the indexes is a combination of a Date Field and foreign key ID field. The Date field seems to store the entire date and timestamp so the combination is fairly unique.
If the index name was PLAYER_TABLE_IDX how would I go about using this index in my PHP code?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
Any advice Oracle/PHP gurus?

I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
There's no way around that you have to use reference all the columns in a composite primary key to get a unique row.
You can't use an index directly in a SQL query.
In Oracle, you use the hint syntax to suggestion an index that should be used, but the only means of hoping to use an index is by specifying the column(s) associated with it in the SELECT, JOIN, WHERE and ORDER BY clauses.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL.
Oracle (and PostgreSQL) have what are called "sequences". They're separate objects from the table, but are used for functionality similar to MySQL's auto_increment. Unlike MySQL's auto_increment, you can have more than one sequence used per table (they're never associated), and can control each one individually.
Instead they seem to create an index out of two fields.
That's what the table design was, nothing specifically Oracle about it.
But I think it's time to address that an index has different meaning in a database than how you are using the term. An index is an additional step to make SELECTing data out of a table faster (but makes INSERT/UPDATE/DELETE slower because of maintaining them).
What you're talking about is actually called a primary key, and in this example it'd be called a composite key because it involves more than one column. One of the columns, either the DATE (consider it DATETIME) or the foreign key, can have duplicates in this case. But because of the key being based on both columns, it's the combination of the two values that makes them the key to a unique record in the table.

http://use-the-index-luke.com/ is my Web-Book that explains how to use indexes in Oracle.
It's an overkill to your question, however, it is probably worth reading if you want to understand how things work.

Should i have a primary ID? i am indexing another field

Using sqlite i need a table to hold a blob to store a md5 hash and a 4byte int. I plan to index the int but this value will not be unique.
Do i need a primary key for this table? and is there an issue with indexing a non unique value? (I assume there is not issue or reason for any).

Personally, I like to have a unique primary id on all tables. It makes finding unique records for updating/deleting easier.

How are you going to reference on a SELECT * FROM Table WHERE or an UPDATE ... WHERE? Are you sure you want each one?

You already have one.
SQLite automatically creates an integer ROWID column for every row of every table. This can function as a primary key if you don't declare your own.
In general it's a good idea to declare your own primary key column. In the particular instance you mentioned, ROWID will probably be fine for you.

My advice is to go with primary key if you want to have referential integrity. However there is no issue with indexing a non unique value. The only thing is that your performance will downgrade a little.

What are the consequences of letting two identical rows somehow get into this table?
One consequence is, of course, wasted space. But I'm talking about something more fundamental, here. There are times when duplicate rows in data give you wrong results. For example, if you grouped by the int column (field), and listed the count of rows in each group, a duplicate row (record) might throw you off, depending on what you are really looking for.
Relational databases work better if they are based on relations. Relations are always in first normal form. The primary reason for declaring a primary key is to prevent the table from getting out of first normal form, and thus not representing a relation.

Where to place a primary key

To my knowledge SQL Server 2008 will only allow one clustered index per table. For the sake of this question let's say I have a list of user-submitted stories that contains the following columns.
ID (int, primary key)
Title (nvarchar)
Url (nvarchar)
UniqueName (nvarchar) This is the url slug (blah-blah-blah)
CategoryID (int, FK to Category table)
Most of the time stories will never be queried by ID. Most of the queries will be done either by the CategoryID or by the UniqueName.
I'm new to indexing so I assumed that it would be best to place 2 nonclustered indexes on this table. One on UniqueName and one on CategoryID. After doing some reading about indexes it seems like haivng a clustered index on UniqueName would be very beneficial. Considering UniqueName is... unique would it be advantageous to place the primary key on UniuqeName and get rid of the ID field? As for CategoryID I assume a nonclustered index will do just fine.
Thanks.

In the first place you can put the clustered index on unique name, it doesn't have to be onthe id field. If you do little or no joining to this table you could get rid of the id. In any event I would put a unique index on the unique name field (you may find in doing so that it isn't as unique as you thought it would be!).
If you do a lot of joining though, I would keep the id field, it is smaller and more efficient to join on.
Since you say you are new at indexing, I will point out that while primary keys have an index created automatically when they are defined, foreign keys do not. You almost always want to index your foreign key fields.

Just out of habit, I always create an Identity field "ID" like you have as the PK. It makes things consistent. If all "master" tables have a field named "ID" that is INT Identity, then it's always obvious what the PK is. Additionally, if I need to make a bridge entity, I'll be storing two (or more) columns of type INT instead of type nvarchar(). So in your example, I would keep ID as the PK and create a unique index on UniqueName.

Data is stored in order of the clustered key; if you are going to key retrievial of data by one of those fields it would be advantageous to use that assuming values aren't significantly fragmented, which can slow down insert performance.
On the other hand, if this table is joined to a lot on the ID, it probably makes more sense to keep the clustered key on the PK.

Generally it's always best to index a table on a identity key and use this as the clustered index. There's a simple rule of thumb here
Don't use a meaningful column as primary index
The reason for this is that generally using a PK on a meaningful column tends to give rise to maintenance issues. It's a rule of thumb, so can be overridden such circumstances dictate, but usually it's best to work from the assumed default position of each table indexed by a (clustered) meaningless identity column. Such tends to be more efficient for joins, and as it's usually the default design that most DBAs will adopt so won't raise any eyebrows or give any issues because they system isn't as the next DBA might assume. Meaningless PKs are invariably more flexible and can adapted more easily to changing circumstances then otherwise
When to override the rule? Only if you do envisage performance issues. For most databases with reasonable loads on modern hardware suitably indexed you will not have any trouble if you're not squeezing the last millisecond of performance out of them by clustering the optimal index. DBA and Programmer cycles are much more expensive than CPU cycles and if you'll only shave the odd millisecond or so off your queries by adopting a different strategy then it's just not worth it. However should you be looking at a table with approaching a million rows then that's a different matter. It depends very much on circumstances, but generally if I'm designing a database with tables of less than 100,000 rows I will lean heavily towards designing for flexibility, ease of writing stable queries, and along the principals any other designer would expect to see. Over a million rows then I design for performance. Between 100k and a million it's a matter of judgement.

There is no requirement or necessity to have a clustered index at all, primary key or otherwise. It's a performance optimisation tool, like all indexing strategies, and should be applied when an improvement can be gained by using it.
As already mentioned, because the table is physically sorted according to the clustered index key, it's a Highlander situation: there can only be one!
Clustered indexes are mostly useful for situations such as:
you regularly need to retrieve a set of rows whose values for a given column are in a range, so columns that are often the subject of a BETWEEN clause are interesting; or
most of your single-row hits in the table occur in an area that can be described by a subset of the values of a key.
I thought that they were particularly un-useful for situations like when you have high-volume transaction systems with very frequent inserts when a sequential key is the clustered column. You'll get a gang of processes all trying to insert at the same physical location (a "hot-spot"). Turns out, as was commented here before this edit, that I'm sadly out-of-date and showing my age. See this post on the topic by Kimberley Tripp which says it all much better.
Sequential numeric "ID" columns are generally not good candidate columns. Names can be good, dates likewise - if carefully considered.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas