What's a MySQL index table? - sql

I need to speed up a query. Is an index table what I'm looking for? If so, how do I make one? Do I have to update it each insert?
Here are the table schemas:
--table1-- | --tableA-- | --table2--
id | id | id
attrib1 | t1id | attrib1
attrib2 | t2id | attrib2
| attrib1 |
And the query:
SELECT
table1.attrib1,
table1.attrib2,
tableA.attrib1
FROM
table1,
tableA
WHERE
table1.id = tableA.t1id
AND (tableA.t2id = x or ... or tableA.t2id = z)
GROUP BY
table1.id

You need to create a composite index on tableA:
CREATE INDEX ix_tablea_t1id_t2id ON table_A (t1id, t2id)
Indexes in MySQL are considered a part of a table: they are updated automatically, and used automatically whenever the optimizer decides it's a good move to use them.
MySQL does not use the term index table.
This term is used by Oracle to refer to what other databases call CLUSTERED INDEX: a kind of table where the records themselves are arranged according to the value of a column (or a set of columns).
In MySQL:
When you use MyISAM storage, an index is created as a separate file that has .MYI extension.
The contents of this file represent a B-Tree, each leaf containing the index key and a pointer to the offset in .MYD file which contains the data.
The size of the pointer is determined by the server setting called myisam_data_pointer_size, which can vary from 2 to 7 bytes, and defaults to 6 since MySQL 5.0.6.
This allows creating MyISAM tables up to 2 ^ (8 * 6) bytes = 256 TB
In InnoDB, all tables are inherently ordered by the PRIMARY KEY, it does not support heap-organized tables.
Each index, therefore, in fact is just a plain InnoDB table consisting of a single PRIMARY KEY of N+M records: N records being an indexed value, and M records being a PRIMARY KEY of the main table record which holds the indexed data.

Related

SQL: String column to unique integer?

I have a table place2022 which has a very long CHAR column
timestamp | user_id | pixel_color | coordinate
-----------------+------------------------------------------------------------------------------------------+-------------+------------
17:38:20.021+00 | p0sXpmkcmg1KLiCdK5e4xKdudb1f8cjscGs35082sKpGBfQIw92nZ7yGvWbQ/ggB1+kkRBaYu1zy6n16yL/yjA== | #FF4500 | 371,488
17:38:20.024+00 | Ctar52ln5JEpXT+tVVc8BtQwm1tPjRwPZmPvuamzsZDlFDkeo3+ItUW89J1rXDDeho6A4zCob1MKmJrzYAjipg== | #51E9F4 | 457,493
17:38:20.025+00 | rNMF5wpFYT2RAItySLf9IcFZwOhczQhkRhmTD4gv0K78DpieXrVUw8T/MBAZjj2BIS8h5exPISQ4vlyzLzad5w== | #000000 | 65,986
17:38:20.025+00 | u0a7l8hHVvncqYmav27EARAE6ciLtpUTPXMI33lDrUmtj5Ei3ixlfRuG28KUvs7r5LpeiE/iOKPALVjkILhrYg== | #3690EA | 73,961
The user_ids are already hashes, so all I really care about here is having some sort of id column which is 1-1 with the user_id.
I've counted the number of unique user_ids, which is 10381163, which fits into 24 bits. Therefore, I can compress the id field down to a 32-bit integer using the obvious scheme of "Assign 1 to the first new user_id you see, 2 to the second new user_id you see", etc. I don't even care that the user_id's are mapped in the order that they're seen: I just need them to be mapped in an invertible manner to 32-bit ints somehow. I'd also like to persist this mapping somewhere so that, if I want to, I can go backwards.
What would be the best way to achieve this? I imagine that we could create a new table (create table place2022_user_ids as select distinct(user_id) from place2022;?) and then reverse-lookup the user_id column in that table, but I don't know quite how to formulate the queries and also make sure that I'm not doing something ridiculously slow.
I am using postgresql, if it matters.
If you have a recent (>8) version of Postgres you can add an auto increment id column to an existing table.
ALTER TABLE place2022
ADD COLUMN id SERIAL PRIMARY KEY;
NB If the existing column is a PRIMARY KEY you will need to drop it first.
See drop primary key constraint in postgresql by knowing schema and table name only

How to shard from existing data in a table in Postgresql

I have a large table inter, which contains 50 billion rows. Each row consists of two columns, both of them are actually foreign keys of IDs of the other two tables(just the relation, foreign key constraints were not set in the database).
My table structure is like:
create table test_1(
id integer primary key,
content varchar(300),
content_len integer
);
create index test_1_id_len on test_1(id, content_len);
--this has 1.5 billion rows.
-- example row1: 1, 'alskfnla', 8
-- example row2: 1, 'asdgaagder', 10
-- example row3: 1, 'dsafnlakdsvn', 12
create table test_2(
id integer primary key,
split_str char(3)
);
--this has 60,000 rows.
-- example row1: 1, 'abc'
-- example row2: 2, 'abb'
create table inter(
id_1 integer, -- id of test_1
id_2 integer -- id of test_2
);
create index test_index_1 on inter(id_1);
create index test_index_2 on inter(id_2);
create index test_index_1_2 on inter(id_1, id_2);
--this has 50 billion rows.
-- example row1: 1, 2
-- example row2: 1, 3
-- example row3: 1, 4
Further, I need to do some queries like
select *
from inter
inner join test_1 on(test_1.id = inter.id_1)
where id_2 in (1,2,3,4,5,67,8,9,10)
and test_1.content_len = 30
order by id_2;
The reason why I want to shard the table is that I could not create indices on the two columns( the transactions did not end for one week, and it occupied full virtual memory).
SO I am considering to shard the table by one of the columns. This column has around 60,000 values, from 1 to 60,000. I would like to shard the table to 60,000 subtables. I do some searches, but most of the articles do it by a trigger, which could not be applied in my case since the data are already in the table. Does anyone know how to do that, thanks a lot!
ENV: redhat, RAM 180GB, postgresql 11.0
You don't want to shard the table, but partition it.
60000 partitions is too many. Use list partitioning to split the table in something like at most 600 partitions. Make sure to upgrade to PostgreSQL v12 so that you can benefit from the latest performance improvements.
The hard part will be moving the data without eexcessive downtime. Perhaps you can use triggers to capture changes while you INSERT INTO ... SELECT and catch up later.

Updating / deleting arbitrary rows in column without primary key

I am building a tool that will display all the tables in a given PostgreSQL database (client's legacy app), then the user would dig in and can see all the data in given table. It is essentially a database viewer.
Next step will be to allow user to update each row, in a similar manner to how one updates data in Airtable.
While for most columns I will have the primary keys so I can use to build appropriate Update ... where ID=? statements, I realized that may not be the case always. For some join tables, for example, I do not have the ID or any other primary key.
I still would like to have the functionality where the user looks at the grid of data displayed from such columns, selects a row with click of mouse and provides new values.
PostgreSQL used to use OIDs to uniquelly identify rows for such cases, but this is no longer the case even for the legacy database I am dealing with.
The only solution I can think of is using the offset/sort order to figure out which row is to be updated, but this leads to race conditions if sort changes in the meantime or the user deletes/adds some rows.
Any ideas how I can update such "anonymous" rows?
Each table in Postgres has a system column ctid which unambiguously identifies a row. Example:
drop table if exists my_table;
create table my_table(id int, str text);
insert into my_table values
(1, 'one'),
(1, 'two'),
(2, 'one');
select ctid, *
from my_table;
ctid | id | str
-------+----+-----
(0,1) | 1 | one
(0,2) | 1 | two
(0,3) | 2 | one
(3 rows)
You can use the column in delete or update:
delete from my_table
where ctid = '(0,2)'
returning *
id | str
----+-----
1 | two
(1 row)
DELETE 1
Note however, that there is no guarantee that a row has always the same ctid, per the documentation:
ctid
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.

How to get rid of gaps in rowid numbering after deleting rows?

Table tmp :
CREATE TABLE if not exists tmp (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL);
I inserted 5 rows. select rowid,id,name from tmp; :
rowid
id
name
1
1
a
2
2
b
3
3
c
4
4
d
5
5
e
Now I delete rows with id 3 and 4 and run above query again:
rowid
id
name
1
1
a
2
2
b
5
5
e
rowid is not getting reset and leaves holes. Even after vacuum it doesn't reset rowid.
I want :
rowid
id
name
1
1
a
2
2
b
3
5
e
How to achieve above output?
I assume you already know a little about rowid, since you're asking about its interaction with the VACUUM command, but this may be useful information for future readers:
rowid is a special column available in all tables (unless you use WITHOUT ROWID), used internally by sqlite. A VACUUM is supposed to rebuild the table, aiming to reduce fragmentation in the database file, and may change the values of the rowid column. Moving on.
Here's the answer to your question: rowid is really special. So special that if you have an INTEGER PRIMARY KEY, it becomes an alias for the rowid column. From the docs on rowid:
With one exception noted below, if a rowid table has a primary key that consists of a single column and the declared type of that column is "INTEGER" in any mixture of upper and lower case, then the column becomes an alias for the rowid. Such a column is usually referred to as an "integer primary key". A PRIMARY KEY column only becomes an integer primary key if the declared type name is exactly "INTEGER". Other integer type names like "INT" or "BIGINT" or "SHORT INTEGER" or "UNSIGNED INTEGER" causes the primary key column to behave as an ordinary table column with integer affinity and a unique index, not as an alias for the rowid.
This makes your primary key faster than it would've been otherwise (presumably because there's no lookup from your primary key to rowid):
The data for rowid tables is stored as a B-Tree structure containing one entry for each table row, using the rowid value as the key. This means that retrieving or sorting records by rowid is fast. Searching for a record with a specific rowid, or for all records with rowids within a specified range is around twice as fast as a similar search made by specifying any other PRIMARY KEY or indexed value.
Of course, when your primary key is an alias for rowid, it would be terribly inconvenient if this could change. Since rowid is now aliased to your application data, it would not be acceptable for sqlite to change it.
Hence, this little note in the VACUUM docs:
The VACUUM command may change the ROWIDs of entries in any tables that do not have an explicit INTEGER PRIMARY KEY.
If you really really really absolutely need the rowid to change on a VACUUM (I don't see why -- feel free to discuss your reasons in the comments, I may have some suggestions), you can avoid this aliasing behavior. Note that it will decrease the performance of any table lookups using your primary key.
To avoid the aliasing, and degrade your performance, you can use INT instead of INTEGER when defining your key:
A PRIMARY KEY column only becomes an integer primary key if the declared type name is exactly "INTEGER". Other integer type names like "INT" or "BIGINT" or "SHORT INTEGER" or "UNSIGNED INTEGER" causes the primary key column to behave as an ordinary table column with integer affinity and a unique index, not as an alias for the rowid.
I found a solution for some case. I don't know why, but this worked.
1.Rename column "id" to any other name (not PRIMARY KEY) or delete this column because you have already "rowid".
CREATE TABLE if not exists tmp (
my_i INTEGER NOT NULL,
name TEXT NOT NULL);
2.Insert 5 rows in it.
select rowid,* from tmp;
rowid my_i name
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
3.Delete rows with rowid 3 and 4 and run above query again.
DELETE FROM tmp WHERE rowid = 3;
DELETE FROM tmp WHERE rowid = 4;
select rowid,* from tmp;
rowid my_i name
1 1 a
2 2 b
5 5 e
4.Run SQL
VACUUM;
5.Run SQL
select rowid,* from tmp;
The output:
rowid my_i name
1 1 a
2 2 b
3 5 e
You must define all data from database to new array / list.After that you must delete table and rewrite all data from array / list to database .
Check it ;
https://stackoverflow.com/a/57862686/8363647
I don't get why there is so much hesitance in illustrating the answer here.
If there are any tips or specific examples y'all could provide on how or why or when to be weary of usage, we'd all appreciate it.
Here is how I solved my problem, similar to OP.
c.execute(f"DELETE FROM customers WHERE rowid = ({id})")
print(f"deleted {id}")
conn.commit()
c.execute("VACUUM")
conn.close()

Indexes and multi column primary keys

In a MySQL database I have a table with the following primary key
PRIMARY KEY id (invoice, item)
In my application I will also frequently be selecting on item by itself and less frequently on only invoice. I'm assuming I would benefit from indexes on these columns.
MySQL does not complain when I define the following:
INDEX (invoice),
INDEX (item),
PRIMARY KEY id (invoice, item)
But I don't see any evidence (using DESCRIBE -- the only way I know how to look) that separate indexes have been established for these two columns.
Are the columns that make up a primary key automatically indexed individually? Is there a better way than DESCRIBE to explore the structure of my table?
I'm not intimately familiar with the internals of indices on mySql, but on the two database vendor products that I am familiar with (MsSQL, Oracle) indices are balanced-Tree structures, whose nodes are organized as a sequenced tuple of the columns the index is defined on (In the Sequence Defined)
So, unless mySql does it very differently, (probably not), any composite index (on more than one column) can be useable by any query that needs to filter or sort by a subset of the columns in the index, as long as the list of columns is compatible, i.e., if the columns, when sequenced the same as the sequenced list of columns in the complete index, is an ordered subset of the complete set of index columns, which starts at the beginning of the actual index sequence, with no gaps except at the end...
In other words, this means that if you have an index on (a,b,c,d) a query that filters on (a), (a,b), or (a,b,c) can also use the index, but a query that needs to filter on (b), or (c) or (b,c) will not be able to use the index...
So in your case, if you often need to filter or sort on column item alone, you need to add another index on that column by itself...
I personally use phpMyAdmin to view and edit the structure of MySQL databases. It is a web application but it runs well enough on a local web server (I run an instance of apache on my machine for this and phpPgAdmin).
As for the composite key of (invoice, item), it acts like an index for (invoice, item) and for invoice. If you want to index by just item you have to add that index yourself. Your PK will be sorted by invoice and then by item where invoice is the same in multiple records. While the order in a composite PK does not matter for uniqueness enforcement, it does matter for access.
On your table I would use:
PRIMARY KEY id (invoice, item), INDEX (item)
I'm not that familiar with MySQL, but generally an multiple-column index is equally useful on the first column in the index as an index on that column alone. The multiple-column index becomes less useful for querying against a single column the further the column appears into the index.
This makes some sense if you think of the multi-column index as a hierarchy. The first column in the index is the root of the hierarchy, so searching it is just a matter of scanning that first level. However, in order to scan the second column, the database has to look up the tree for each unique value found in the first column. This can be costly enough that most optimizers won't bother to look deeply into a multi-column index, instead opting to full-table-scan.
For example, if you have a table as follows:
Col1 |Col2 |Col3
----------------
A | 1 | Z
A | 2 | Y
A | 2 | X
B | 1 | Z
B | 2 | X
Assuming you have an index on all three columns, in order, the tree will look something like this:
A
+-1
+-Z
+-2
+-X
+-Y
B
+-1
+-Z
+-2
+-X
Looking for Col1='A' is easy: you only have to look at 2 ordered values. However, to resolve col3='X', you have to look at all of the values in the 4 bigger buckets, each of which is ordered individually.
To return table index information, you can use:
SHOW INDEX FROM <table>;
See: http://dev.mysql.com/doc/refman/5.0/en/show-index.html
To view table information:
SHOW CREATE TABLE <table>;
See: http://dev.mysql.com/doc/refman/5.0/en/show-create-table.html
Primary keys are indexes, so there's no need to create additional indexes. You can find out more information about them under the CREATE TABLE syntax (there's too much to insert here):
http://dev.mysql.com/doc/refman/5.0/en/create-table.html
There is a difference between composite index and composite primary key.
If you have defined a composite index like below
INDEX idx(invoice,item)
the index wont work if you query based on item and you need to add a separate index
INDEX itemidx(item)
But, if you have defined a composite primary key like below
PRIMARY KEY(invoice, item)
the index would work if you query based on item and no separate index is required.
Working example:
mysql>create table test ( col1 int(20), col2 int(20) ) primary key(col1,col2);
mysql>explain select * from test where col2 = 1;
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | index | NULL | PRIMARY | 8 | NULL | 10 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
Mysql auto create an index for composite keys. Depending on your queries, you may have to create separate index for individual column in the composite key.
If you are using mysql workbench, you can manually right click the schema and click on edit to see everything about the table
If your query is using both columns in where clause then you don't need to create a separate index in a composite primary key.
EXPLAIN SELECT * FROM `table` WHERE invoice = 1 and item = 1
You are also fine if you want to query with first column only
EXPLAIN SELECT * FROM `table` WHERE invoice = 1
But if you want to query with subsequent columns col2, col3 in composite PK then you would need to create separate indexes on those columns. The following explain query shows the second column does not have a possible key detected by MySQL
EXPLAIN SELECT * FROM `table` WHERE item = 1