Fragmentation of one specific index is increasing too often

Fragmentation of one specific index is increasing too often - sql

I have a large table and it has more than 10 indexes. I have a fragmentation problem on one specific index. In the day hours, thousands of rows are being inserted in this table and the fragmentation of just one specific index is increasing very frequently. Other indexes are OK (maybe 0.01% per hour), but this specific index is increasing like 3 - 4% per hour ! It will probably be like 50 - 60% at the end of the day.
Can you help me to find out why this index is increasing too often.
----- Fill factor
This specific index: 0%
Other index (that has no problem about increasing): 90%
----- Index details;
non-clustured
2 index key columns: (bit and nvarchar(100) type columns)
1 included column: (int) FK_OrderID (foreign key for another table)
number of rows in the table : 6.5 million
size of the table: 6.2 gb
and SHOWCONTIG details for the table;
Pages Scanned................................: 805566
Extents Scanned..............................: 100877
Extent Switches..............................: 108951
Avg. Pages per Extent........................: 8.0
Scan Density [Best Count:Actual Count].......: 92.42% [100696:108952]
Logical Scan Fragmentation ..................: 1.43%
Extent Scan Fragmentation ...................: 19.82%
Avg. Bytes Free per Page.....................: 983.4
Avg. Page Density (full).....................: 87.85%
Thanks!

I have resolved this issue by setting fillfactor value = 80. Thanks for replies

Related

query to get the results of each row in table1 with a subquery of N maximum records found to meet a condition in table2

I am trying without success to calculate building heights in my city using the LIDAR satellite dataset.
System specs
CPU: Core i7 6700k 4200MHz, 4 cores, 8 threads
RAM: 32GB DDR4 3200mhz
SSD: 1TB Samsung 970 EVO
OS: Ubuntu 18.04
Postgres setup
I am using the latest version of Postgres v12.1 database with PostGIS with the following tweaks recommended in different sources:
shared_buffers = 256MB
maintenance_work_mem = 4GB
max_parallel_maintenance_workers = 7
max_parallel_workers = 7
max_wal_size = 60GB
min_wal_size = 90MB
random_page_cost = 1.0
Database setup
In the lidar table I have more than 3000 million rows, and in the buildings table more than 150000 rows.
In the lidar table the GiST index was created: CREATE INDEX lidar_idx ON lidar USING GIST (geom);
building table: | gid | geom |
lidar table: | z | geom |
Height calculation
Currently in order to calculate the height of a building, it is necessary to check if each one of the 3000 million points (rows) is inside the area of each building and calculate the average of all the points found inside a building area.
The queries I have tried are taking forever (probably more than 5 days or even more) and I would like to simplify the query so that I can get the height of the building with a lot less points, without having to compare with all the insane 3000 million records each time for each building.
In example:
For building with id1, I would like to get only the first 100 records found which are inside the building geometry area ( ST_Within(l.geom, e.geom) ), and once those 100 records are found, pass to the next building.
For building with id2, I would like the same, get only the first 100 records found which are inside the building area.
And so on..
My main query is
SELECT e.gid, AVG(l.z) AS height
FROM lidar l,
buildings e
WHERE ST_Within(l.geom, e.geom)
GROUP BY e.gid) t
I have tried with another query, but I can not get it to work.
SELECT e.gid, AVG(l.z), COUNT(1) FILTER (WHERE ST_Within(l.geom, e.geom)) AS gidc
FROM lidar l, buildings e
WHERE gidc < 100
GROUP BY e.gid

I don't think you really want to do this at all. You should first try to make the correct query faster rather than compromising correctness by working with an arbitrary (but not random) subset of the data.
But if you do want it, then you can use a lateral join.
SELECT e.gid from
buildings e cross join lateral
(select AVG(l.z) AS height FROM lidar l WHERE ST_Within(l.geom, e.geom) LIMIT 100)
it is necessary to check if each one of the 3000million points (rows) is inside the area of each building and calculate the average of all the points found inside a building area.
This is exactly what a geometry index is for. You don't need to look at every point to get just the ones inside the a building area. If you don't have the right index, such as on lidar using gist (geom), then the lateral join query will also be awful.

Efficiency problem querying postgresql table

I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?

I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.

There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
unhealthy index - try to run REINDEX INDEX indexname
bad statistics - try to run ANALYZE tablename
suboptimal state of table - try to run VACUUM tablename
too low, or to high setting of effective_cache_size
issues with IO - some systems has a problem with high random IO, try to increase random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
benchmark your IO if it is possible (like bonie++)

To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.
I can think of two reasons:
Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.
To see if that is the problem, run
ANALYZE eods;
and see if that changes the execution plans chosen.
Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.
At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:
SET enable_indexscan = off;
Then repeat the query.

Can a clustered index help in this scenario?

We have 2 tables dispense and dispensedetail. On Dispense there is a foreign key PatientID.
Most processes and queries are in the context of a patient.
Currently the dispense table has a non-clustered index (DispenseID, PatientID).
The DispenseDetail has DispenseDetailID as primary key and a non-clustered index (DispenseID).
We are noticing some slowness that is caused by pageIo latches (sh) as sql server has to bring data from disk into memory.
I am thinking about a clustered index (DispenseID,DispenseDetailID) which can help retrieve the dispense details of a particular patient but it may worsen the insert dispense.. Dispense inserts are more important as without them there won't be data to query.
Will a non-clustered index (DispenseID,DispenseDetailID) help any?
Any comments or thoughts will be much appreciated.
Thanks!
Information for sqlonly
There are 4 physical CPUs each with 6 cores totalling 24 cores. There are 32 virtual CPUs.
The database is on a VM where there are 4 physical CPUs each with 6 cores totally 24 cores.and there are 32 virtual CPUs.
The dispense table has 4000000+ rows. The dispense detail has 11000000+ rows.
I don't know how to compute or get the average page io latch wait time. Querying sys.dm_os_latch_stats and ordering by wait time, this is the result set:
latch_class waiting_requests_count wait_time_ms max_wait_time_ms
BUFFER 62658377 97584783 12051
ACCESS_METHODS_DATASET_PARENT 950195 7870081 19652
ACCESS_METHODS_HOBT_VIRTUAL_ROOT 799403 5071290 5692
BACKUP_OPERATION 785245 372930 206
LOG_MANAGER 7 40403 11235
ACCESS_METHODS_HOBT_COUNT 7959 19728 1587
NESTING_TRANSACTION_FULL 122342 7969 59
ACCESS_METHODS_ACCESSOR_CACHE 67877 5143 65
ACCESS_METHODS_BULK_ALLOC 1644 734 49
ACCESS_METHODS_HOBT 15 76 15
SPACEMGR_ALLOCEXTENT_CACHE 169 71 10
SPACEMGR_IAM_PAGE_RANGE_CACHE 68 49 4
NESTING_TRANSACTION_READONLY 1942 11 1
SERVICE_BROKER_WAITFOR_MANAGER 31 9 4
TRACE_CONTROLLER 1 1 1
APPEND_ONLY_STORAGE_FIRST_ALLOC 11 1 1
In dev I used the current indexes to get just dispenseID and dispenseDetailID for the patient -- the outcome it is an index seek. However that result set must be inserted into a temp table to get other fields and the insert to temp table is costly so the net is no improvement gained.
Thanks!

PostgreSQL Index not used when data rows are large

Hi I'm curious about why index doesn't work when data rows are large even 100.
Here's select for 10 data:
mydb> explain select * from data where user_id=1;
+-----------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------|
| Index Scan using ix_data_user_id on data (cost=0.14..8.15 rows=1 width=2043) |
| Index Cond: (user_id = 1) |
+-----------------------------------------------------------------------------------+
EXPLAIN
Here's select for 100 data:
mydb> explain select * from data where user_id=1;
+------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------|
| Seq Scan on data (cost=0.00..44.67 rows=1414 width=945) |
| Filter: (user_id = 1) |
+------------------------------------------------------------+
EXPLAIN
How can index work when data rows are 100?

100 is not a large amount of data. Think 10,000 or 100,000 rows for a respectable amount.
To put it simply, records in a table are stored on data pages. A data page typically has about 8k bytes (it depends on the database and on settings). A major purpose of indexes is to reduce the number of data pages that need to be read.
If all the records in a table fit on one page, there is no need to reduce the number pages being read. The one page will be read. Hence, the index may not be particularly useful.

Optimal MySQL temporary tables (memory tables) configuration?

First of all, I am new to optimizing mysql. The fact is that I have in my web application (around 400 queries per second), a query that uses a GROUP BY that i can´t avoid and that is the cause of creating temporary tables. My configuration was:
max_heap_table_size = 16M
tmp_table_size = 32M
The result: temp table to disk percent + - 12.5%
Then I changed my settings, according to this post
max_heap_table_size = 128M
tmp_table_size = 128M
The result: temp table to disk percent + - 18%
The results were not expected, do not understand why.
It is wrong tmp_table_size = max_heap_table_size?
Should not increase the size?
Query
SELECT images, id
FROM classifieds_ads
WHERE parent_category = '1' AND published='1' AND outdated='0'
GROUP BY aux_order
ORDER BY date_lastmodified DESC
LIMIT 0, 100;
EXPLAIN
| 1 |SIMPLE|classifieds_ads | ref |parent_category, published, combined_parent_oudated_published, oudated | combined_parent_oudated_published | 7 | const,const,const | 67552 | Using where; Using temporary; Using filesort |

"Using temporary" in the EXPLAIN report does not tell us that the temp table was on disk. It only tells us that the query expects to create a temp table.
The temp table will stay in memory if its size is less than tmp_table_size and less than max_heap_table_size.
Max_heap_table_size is the largest a table can be in the MEMORY storage engine, whether that table is a temp table or non-temp table.
Tmp_table_size is the largest a table can be in memory when it is created automatically by a query. But this can't be larger than max_heap_table_size anyway. So there's no benefit to setting tmp_table_size greater than max_heap_table_size. It's common to set these two config variables to the same value.
You can monitor how many temp tables were created, and how many on disk like this:
mysql> show global status like 'Created%';
+-------------------------+-------+
| Variable_name | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 20 |
| Created_tmp_files | 6 |
| Created_tmp_tables | 43 |
+-------------------------+-------+
Note in this example, 43 temp tables were created, but only 20 of those were on disk.
When you increase the limits of tmp_table_size and max_heap_table_size, you allow larger temp tables to exist in memory.
You may ask, how large do you need to make it? You don't necessarily need to make it large enough for every single temp table to fit in memory. You might want 95% of your temp tables to fit in memory and only the remaining rare tables go on disk. Those last 5% might be very large -- a lot larger than the amount of memory you want to use for that.
So my practice is to increase tmp_table_size and max_heap_table_size conservatively. Then watch the ratio of Created_tmp_disk_tables to Created_tmp_tables to see if I have met my goal of making 95% of them stay in memory (or whatever ratio I want to see).
Unfortunately, MySQL doesn't have a good way to tell you exactly how large the temp tables were. That will vary per query, so the status variables can't show that, they can only show you a count of how many times it has occurred. And EXPLAIN doesn't actually execute the query so it can't predict exactly how much data it will match.
An alternative is Percona Server, which is a distribution of MySQL with improvements. One of these is to log extra information in the slow-query log. Included in the extra fields is the size of any temp tables created by a given query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas