MariaDB Server Optimization - reducing I/O operation - sql

I need help with MariaDB server optimization. I have a lot I/O operations (created tmp disk tables) and I want to reduce it.
Hardware: CPU 20 x 2197mHz, RAM 50 Gb, SSD disks RAID 10
Software: 10.1.26-MariaDB-0+deb9u1 - Debian 9.1
Server handles Wordpress databases (~1500).
Stats
Warnings
Config:
key_buffer_size = 384M
max_allowed_packet = 5096M
thread_stack = 192K
thread_cache_size = 16
myisam_recover_options = BACKUP
max_connections = 200
table_cache = 12000
max_connect_errors = 20
open_files_limit = 30000
wait_timeout = 3600
interactive_timeout = 3600
query_cache_type = 0
query_cache_size = 0
query_cache_limit = 0
join_buffer_size = 2M
tmp_table_size = 1G
max_heap_table_size = 1G
table_open_cache = 15000
innodb_buffer_pool_size = 35G
innodb_buffer_pool_instances = 40

Are you using MyISAM? If you are you shouldn't be. Convert any MyISAM tables outside the mysql schema to InnoDB and set key_buffer_size to 1M.
max_allowed_packet = 5G is absurdly high.
thread_cache_size = 0 is the recommended default. Unless you really know what you are doing and have measurements to back it up, you should leave it alone.
join_buffer_size is another setting you should almost certainly not be touching.
max_heap_table_size = 1G is almost certainly too large - if you are getting temporary tables that big created in memory with any regularity, your server will run out of memory, grind to a halt and OOM anyway.

Yes, follow Gordan's advice.
Improve WP's indexing by following the advice here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#speeding_up_wp_postmeta
Turn on the slowlog to identify what queries are the worst: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

Related

RabbitMQ slow when opening a new connection

I have a rather busy RabbitMQ setup which at peak times becomes extremely slow accepting new connections (RabbitMQ 3.9.14)
I've tried fine tuning /etc/sysctl.conf as found on a guide on the RabbitMQ website
fs.file-max = 10000000
fs.nr_open = 10000000
fs.inotify.max_user_watches=524288
net.core.somaxconn = 4096
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4
net.ipv4.ip_local_port_range = 10000 64000
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
net.netfilter.nf_conntrack_max=1048576
And also played around with the rabbitmq.conf options to see if anything would have an impact, however that is unfortunately not the case
num_acceptors.tcp = 32
channel_max = 4096
tcp_listen_options.backlog = 512
tcp_listen_options.nodelay = true
tcp_listen_options.linger.on = true
tcp_listen_options.linger.timeout = 0
tcp_listen_options.sndbuf = 196608
tcp_listen_options.recbuf = 196608
collect_statistics_interval = 60000
Due to the nature of my setup (PHP), every time messages are being published to RabbitMQ, a new connection is created, I wish I could do long-standing connections but that is beyond of what PHP is designed for
During peak activity, some connections take up to 7 seconds to open, once the connection is established however, the messages publishing performance is just fine.
I feel like I've exhausted all the logical options that I'm aware of. Is there any other tweaks that I can attempt to change in order to improve the connection performance of the node? The server load is low-ish, sitting at 15% peak. Disabling the management interface had negligible impact
Update: At first, when I've updated to RabbitMQ 3.10.5, I thought that the issue was solved, however that was not the case, it just gave us a bit more headroom.
The real cause was our high churn-rate (200/s+), during my conversation in the RabbitMQ slack channel it became apparent that a high churn rate would block the event loop and cause the spikes seen above.
The solution for us was to use a proxy to re-use connections instead of opening a new one every time we publish something:
https://github.com/cloudamqp/amqproxy
This has effectively resolved our issue.

Using multiple threads for DB updates results in higher write time per update

So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib

Wrong balance between Aerospike instances in cluster

I have an application with a high load for batch read operations. My Aerospike cluster (v 3.7.2) has 14 servers, each one with 7GB RAM and 2 CPUs in Google Cloud.
By looking at Google Cloud Monitoring Graphs, I noticed a very unbalanced load between servers: some servers have almost 100% CPU load, while others have less than 50% (image below). Even after hours of operation, the cluster unbalanced pattern doesn't change.
Is there any configuration that I could change to make this cluster more homogeneous? How to optimize node balancing?
Edit 1
All servers in the cluster have the same identical aerospike.conf file:
Aerospike database configuration file.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 32
batch-index-threads 32
proto-fd-max 15000
batch-max-requests 200000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.6 3002
mesh-seed-address-port 10.240.0.5 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 3
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
write-block-size 1M
filesize 180G
}
}
Edit 2:
$ asinfo
1 : node
BB90600F00A0142
2 : statistics
cluster_size=14;cluster_key=E3C3672DCDD7F51;cluster_integrity=true;objects=3739898;sub-records=0;total-bytes-disk=193273528320;used-bytes-disk=26018492544;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=239353472;data-used-bytes-memory=0;index-used-bytes-memory=239353472;sindex-used-bytes-memory=0;free-pct-memory=95;stat_read_reqs=2881465329;stat_read_reqs_xdr=0;stat_read_success=2878457632;stat_read_errs_notfound=3007093;stat_read_errs_other=0;stat_write_reqs=551398;stat_write_reqs_xdr=0;stat_write_success=549522;stat_write_errs=90;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=4;stat_rw_timeout=1862;udf_read_reqs=0;udf_read_success=0;udf_read_errs_other=0;udf_write_reqs=0;udf_write_success=0;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=7021;stat_proxy_reqs_xdr=0;stat_proxy_success=2121;stat_proxy_errs=4739;stat_ldt_proxy=0;stat_cluster_key_err_ack_dup_trans_reenqueue=607;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;stat_compressed_pkts_received=0;err_tsvc_requests=110;err_tsvc_requests_timeout=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=17;err_rw_pending_limit=19;err_rw_cant_put_unique=0;geo_region_query_count=0;geo_region_query_cells=0;geo_region_query_points=0;geo_region_query_falsepos=0;fabric_msgs_sent=58002818;fabric_msgs_rcvd=57998870;paxos_principal=BB92B00F00A0142;migrate_msgs_sent=55749290;migrate_msgs_recv=55759692;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=7228;migrate_num_incoming_refused=0;queue=0;transactions=101978550;reaped_fds=6;scans_active=0;basic_scans_succeeded=0;basic_scans_failed=0;aggr_scans_succeeded=0;aggr_scans_failed=0;udf_bg_scans_succeeded=0;udf_bg_scans_failed=0;batch_index_initiate=40457778;batch_index_queue=0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0;batch_index_complete=40456708;batch_index_timeout=1037;batch_index_errors=33;batch_index_unused_buffers=256;batch_index_huge_buffers=217168717;batch_index_created_buffers=217583519;batch_index_destroyed_buffers=217583263;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=7021;proxy_action=5519;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=551089;write_prole=1055431;read_dup_prole=14232;rw_err_dup_internal=0;rw_err_dup_cluster_key=1814;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=1767;rw_err_ack_badnode=0;client_connections=366;waiting_transactions=0;tree_count=0;record_refs=3739898;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=296;partition_replica=572;partition_desync=0;partition_absent=3228;partition_zombie=0;partition_object_count=3739898;partition_ref_count=4096;system_free_mem_pct=61;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=90;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=53184;uptime=1001388;stat_write_errs_notfound=0;stat_write_errs_other=90;heartbeat_received_self=0;heartbeat_received_foreign=145137042;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_running=0;query_long_running=0;query_short_queue_full=0;query_long_queue_full=0;query_short_reqs=0;query_long_reqs=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 : features
cdt-list;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
61
5 : partition-generation
11811
6 : edition
Aerospike Community Edition
7 : version
Aerospike Community Edition build 3.7.2
8 : build
3.7.2
9 : services
10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.41:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.23:3000
10 : services-alumni
10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.23:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.41:3000
I have a few comments about your configuration. First, transaction-threads-per-queue should be set to 3 or 4 (don't set it to the number of cores).
The second has to do with your batch-read tuning. You're using the (default) batch-index protocol, and the config params you'll need to tune for batch-read performance are:
You have batch-max-requests set very high. This is probably affecting both your CPU load and your memory consumption. It's enough that there's a slight imbalance in the number of keys you're accessing per-node, and that will reflect in the graphs you've shown. At least, this is possibly the issue. It's better that you iterate over smaller batches than try to fetch 200K records per-node at a time.
batch-index-threads – by default its value is 4, and you set it to 32 (of a max of 64). You should do this incrementally by running the same test and benchmarking the performance. On each iteration adjust higher, then down if it's decreased in performance. For example: test with 32, +8 = 40 , +8 = 48, -4 = 44. There's no easy rule-of-thumb for the setting, you'll need to tune through iterations on the hardware you'll be using, and monitor the performance.
batch-max-buffer-per-queue – this is more directly linked to the number of concurrent batch-read operations the node can support. Each batch-read request will consume at least one buffer (more if the data cannot fit in 128K). If you do not have enough of these allocated to support the number of concurrent batch-read requests you will get exceptions with error code 152 BATCH_QUEUES_FULL . Track and log such events clearly, because it means you need to raise this value. Note that this is the number of buffers per-queue. Each batch response worker thread has its own queue, so you'll have batch-index-threads x batch-max-buffer-per-queue buffers, each taking 128K of RAM. The batch-max-unused-buffers caps the memory usage of all these buffers combined, destroying unused buffers until their number is reduced. There's an overhead to allocating and destroying these buffers, so you do not want to set it too low compared to the total. Your current cost is 32 x 256 x 128KB = 1GB.
Finally, you're storing your data on a filesystem. That's fine for development instances, but not recommended for production. In GCE you can provision either a SATA SSD or an NVMe SSD for your data storage, and those should be initialized, and used as block devices. Take a look at the GCE recommendations for more details. I suspect you have warnings in your log about the device not keeping up.
It's likely that one of your nodes is an outlier with regards to the number of partitions it has (and therefore number of objects). You can confirm it with asadm -e 'asinfo -v "objects"'. If that's the case, you can terminate that node, and bring up a new one. This will force the partitions to be redistributed. This does trigger a migration, which takes quite longer in the CE server than in the EE one.
For anyone interested, Aerospike Enterpirse 4.3 introduced 'uniform-balance' which homogeneously balances data partitions. Read more here: https://www.aerospike.com/blog/aerospike-4-3-all-flash-uniform-balance/

SQL Server memory usage more than 3 GB on 64 bit machine

My SQL Server process memory usage show more than 3 GB on 64 bit machine.
I tried to find the problem with
SELECT
[object_name], [counter_name],
[instance_name], [cntr_value]
FROM sys.[dm_os_performance_counters]
WHERE [object_name] = 'SQLServer:Buffer Manager'
and result was:
Buffer cache hit ratio 6368
Buffer cache hit ratio base 6376
Page lookups/sec 438640376
Free list stalls/sec 182
Free pages 215468
Total pages 442368
Target pages 442368
Database pages 196000
Reserved pages 0
Stolen pages 30900
Lazy writes/sec 1510
Readahead pages/sec 1204816
Page reads/sec 1384292
Page writes/sec 765586
Checkpoint pages/sec 129207
AWE lookup maps/sec 0
AWE stolen maps/sec 0
AWE write maps/sec 0
AWE unmap calls/sec 0
AWE unmap pages/sec 0
Page life expectancy 119777
As this seems to be too much page lookup then I executed this query
SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')
AS [PageLookupPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Page lookups/sec'
result for this query was more than 350.
One more metric was tried as
SELECT (1.0*cntr_value/128) /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND lower(counter_name) = 'Page life expectancy')
AS [BufferPoolRate]
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND counter_name = 'total pages'
result was 0.029.
I picked this from Tim Ford blogs.
What could be the possible cause of memory usage? Is it bad query plan or should I look on some other area?
[Update]
MEMORYCLERK_SQLBUFFERPOOL is using 3573824 (VM reserved) 3573824 (VM committed)
Check your setting under Properties \ Memory \ Maximum Server Memory.
See the answer on this post: Seeing High Memory Usage In SQL Server 2012 which outlines (roughly) the fact that SQL Server will claim what it can based on the maximum available memory. Unless you need this memory for something else, do not worry about it.

Simple postgresql Insert/Update queries taking *seconds* to respond every 5-7 minutes

I have a Ruby on Rails app that is using a Postgresql database. I've noticed that my database performance has huge spikes every 5-7 minutes.
I'm seeing 1+ second response times for simple queries like:
UPDATE users SET last_seen_at = ? where id = ?
or
INSERT INTO emails (email, created_at, updated_at) VALUES (?, ?, ?)
The VPS is an AWS EC2 instance (m2.2xlarge) with a 4 core Xeon 2.4ghz and 34gb of memory.
Here is my postgresql.conf
I made to following changes to the conf to try to figure it out (like reducing the # of checkpoint timeouts) to no avail.
root:/etc/postgresql/9.2/main# diff postgresql.conf.bck postgresql.conf
176,178c176,178
< #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
< #checkpoint_timeout = 5min # range 30s-1h
< #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
---
> checkpoint_segments = 10 # in logfile segments, min 1, 16MB each
> checkpoint_timeout = 30min # range 30s-1h
> checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
361c361
< #log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
---
> log_min_duration_statement = 2s # -1 is disabled, 0 logs all statements
370,371c370,371
< #debug_print_rewritten = off
< #debug_print_plan = off
---
> #debug_print_rewritten = on
> #debug_print_plan = on
376c376
< #log_duration = off
---
> #log_duration = on
378c378
< #log_hostname = off
---
> #log_hostname = on
399c399
< #log_lock_waits = off # log lock waits >= deadlock_timeout
---
> log_lock_waits = on # log lock waits >= deadlock_timeout
You have a serious IO problem at the end of the checkpoints. Note that the slow queries are mostly COMMIT, which should do almost nothing but flush the WAL log, and that it took 41.604 s to sync the files (including 11 s to sync one file!).
There is probably nothing much you can do from within PostgreSQL to improve this. I've heard rumors that lowering shared_buffers might help, but I have not seen that first hand.
You probably need to make changes on the operating system, like lowering /proc/sys/vm/dirty_ratio so that it doesn't allow so much dirty data to build up between checkpoints. Also, if you can separate your WAL logs to separate disks from the main data, that can help.
What filesystem are you using? What kernel/distro?
There is also the possibility that your workload simply can't be accommodated by the IO system you are using, and you need to move to more capable hardware.