apscheduler loses a job after reboot - redis

I've faced a problem with python APScheduler.
I've made a simple script:
from apscheduler.schedulers.background import BackgroundScheduler
from time import sleep
from datetime import datetime
scheduler = BackgroundScheduler({
'apscheduler.jobstores.default': {
'type': 'redis',
'host': "127.0.0.1",
'port': 6379,
'db': 0,
'encoding': "utf-8",
'encoding_errors': "strict",
'decode_responses': False,
'retry_on_timeout': False,
'ssl': False,
'ssl_cert_reqs': "required"
},
'apscheduler.executors.default': {
'class': 'apscheduler.executors.pool:ThreadPoolExecutor',
'max_workers': '20'
},
'apscheduler.job_defaults.coalesce': 'false',
'apscheduler.job_defaults.max_instances': '3',
'apscheduler.timezone': 'UTC',
}, daemon=False)
scheduler.start()
def testfunc():
with open('./data.log', 'a') as f:
f.write(f'{datetime.now()}\n')
scheduler.add_job(testfunc, 'interval', minutes=1, id='my_job_id')
scheduler.add_job(testfunc, 'date', run_date="2120-1-1 11:12:13", id='my_job_id_2')
while True:
scheduler.print_jobs()
sleep(10)
So, I'm using BackgroundScheduler with ThreadPoolExecutor and Redis jobstore, quite simple, as in documentation.
It's working fine, tasks added, I can see data in redis-cli also.
Then I reboot server and check data in redis again. And what I see is only my_job_id_2 task. The one with interval trigger disappeared totally.
Redis is set to save data to RDB every minute. The same happening if I execute SAVE command in redis-cli before reboot.
Why is it happening?

Redis is an in-memory store unless you configure persistence data will be dropped between reboots.
https://redis.io/topics/persistence details the options available to you

Are you sure it's saving?
check the configuration
redis-cli config get save
1) "save"
2) "3600 1 300 100 60 10000"
check the redis log to ensure the save has occured
1:M 16 Sep 2021 16:06:31.375 * DB saved on disk
example w/ error
547130:M 16 Sep 2021 09:12:58.100 # Error moving temp DB file temp-547130.rdb on the final destination dump.rdb (in server root dir /home/namizaru/redis/src): Operation not permitted
Make sure on restart it's loading the file in the redis log
549156:M 16 Sep 2021 09:14:31.522 * DB loaded from disk: 0.000 seconds
consider using AOF as it logs all of the commands so you'll at least be able to figure out which commands are getting missed

Related

How to use subprocess.run() to run Hive query?

So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '"{}"'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.
The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston#tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston#tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston#tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston#tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '"{}"'.format(query), ">", "{}".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > ${outfile} to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

Webalizer database error

Webalizer stops to generate statistics.
When I try to check database i see:
# webalizer --db-info
Stone Steps Webalizer v3.10.2.5 (Linux 4.6.4-grsec-zfs+)
Using database /home/www/1/statystyka/webalizer.db
Reading history file... /home/www/1/statystyka/webalizer.hist
Cannot find the last URL (ID: 752154) of an active visit (ID: 3)
Saving history information...
When I do it on other site I see:
# webalizer --db-info
Stone Steps Webalizer v3.10.2.5 (Linux 4.6.4-grsec-zfs+)
Using database /home/www/2/statystyka/webalizer.db
Reading history file... /home/www/2/statystyka/webalizer.hist
Creating output in /home/www/2/statystyka
Database : /home/www/2/statystyka/webalizer.db
Created by : 3.10.2.5
Last updated by : 3.10.2.5
First day : 2017/12/01
Log time : 2017/12/27 01:18:15
Active visits : 2
Active downloads: 0
Incremental : yes
Batch : no
Maintenance time is 0.00 seconds
Total run time is 0.00 seconds
Saving history information...
I tried to run webalizer --end-month but it failed.
How to fix that problem?
My fix:
Get webalizer database from backup before damage and rebuild database using log:
/usr/local/sbin/webalizer -c /etc/webalizer/webalizer-1.conf -q -T /var/log/apache/1.access.log

Ceph s3 bucket space not freeing up

I been testing Ceph with s3
my test ENV is a 3node with an datadisk of 10GB each so 30GB
its set to replicate 3 times. so i have "15290 MB" space available.
I got the S3 bucket working and been uploading files, and filled up the storage, tried to remove the said files but the disks are still show as full
cluster 4ab8d087-1802-4c10-8c8c-23339cbeded8
health HEALTH_ERR
3 full osd(s)
full flag(s) set
monmap e1: 3 mons at {ceph-1=xxx.xxx.xxx.3:6789/0,ceph-2=xxx.xxx.xxx.4:6789/0,ceph-3=xxx.xxx.xxx.5:6789/0}
election epoch 30, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e119: 3 osds: 3 up, 3 in
flags full,sortbitwise,require_jewel_osds
pgmap v2224: 164 pgs, 13 pools, 4860 MB data, 1483 objects
14715 MB used, 575 MB / 15290 MB avail
164 active+clean
I am not sure how to get the disk space back?
Can any one advise on what i have done wrong or have missed
I'm beginning with ceph and had the same problem.
try running the garbage collector
list what will be deleted
radosgw-admin gc list --include-all
then run it
radosgw-admin gc process
if it didn't work (like for me with most of my data)
find the bucket with your data :
ceph df
Usually your S3 data goes in the default pool default.rgw.buckets.data
purge it from every object /!\ you will loose all your data /!\
rados purge default.rgw.buckets.data --yes-i-really-really-mean-it
I don't know why ceph is not purging this data itself for now (still learning...).
Thanks to Julien on this info
you be right with steps 1 and 2
when you run
radosgw-admin gc list --include-all
you see an example like
[
{
"tag": "17925483-8ff6-4aaf-9db2-1eafeccd0454.94098.295\u0000",
"time": "2017-10-27 13:51:58.0.493358s",
"objs": [
{
"pool": "default.rgw.buckets.data",
"oid": "17925483-8ff6-4aaf-9db2-1eafeccd0454.24248.3__multipart_certs/boot2docker.iso.2~UQ4MH7uZgQyEd3nDZ9hFJr8TkvldwTp.1",
"key": "",
"instance": ""
},
{
"pool": "default.rgw.buckets.data",
"oid": "17925483-8ff6-4aaf-9db2-1eafeccd0454.24248.3__shadow_certs/boot2docker.iso.2~UQ4MH7uZgQyEd3nDZ9hFJr8TkvldwTp.1_1",
"key": "",
"instance": ""
}, ....
if you notice the time
2017-10-27 13:51:58.0.493358s
when running
radosgw-admin gc process
It will only clear/remove parts that are older then the time feild
eg i can run "radosgw-admin gc process" over and over again but the files wont be removed till till after "2017-10-27 13:51:58.0.493358s"
but you are also right
rados purge default.rgw.buckets.data --yes-i-really-really-mean-it
works as well
You can list all buckets to be processed by GC (garbage collection) with:
radosgw-admin gc list --include-all
Then you can check that GC will run after specified time. Manually you can run:
radosgw-admin gc process --include-all
It will start process of garbage collection and with "--include-all" will process all entries, including unexpired.
Then you can check the progress of clean-up with:
watch -c "ceph -s"
or simply check the result with "ceph -s" that all buckets supposed to be deleted are gone. Documentation regarding GC settings you can find here:
https://docs.ceph.com/en/quincy/radosgw/config-ref/#garbage-collection-settings

Benchmark Redis under Twemproxy with redis-benchmark

I am trying to test a very simple setup with Redis and Twemproxy but I can't find a way to make it faster.
I have 2 redis servers that I run with bare minimum configuration:
./redis-server --port 6370
./redis-server --port 6371
Both of the compiled from source and running under 1 machine with all the appropriate memory and CPUs.
If I run a redis-benchmark in one of the instances I get the following:
./redis-benchmark --csv -q -p 6371 -t set,get,incr,lpush,lpop,sadd,spop -r 100000000
"SET","161290.33"
"GET","176366.86"
"INCR","170940.17"
"LPUSH","178571.42"
"LPOP","168350.17"
"SADD","176991.16"
"SPOP","168918.92"
Now I would like to use Twemproxy in front of the two instances to distribute the requests and get a higher throughput (at least this is what I expected!).
I used the following configuration for Twemproxy:
my_cluster:
listen: 127.0.0.1:6379
hash: fnv1a_64
distribution: ketama
auto_eject_hosts: false
redis: true
servers:
- 127.0.0.1:6371:1 server1
- 127.0.0.1:6372:1 server2
And I run nutcracker as:
./nutcracker -c twemproxy_redis.yml -i 5
The results are very disappointing:
./redis-benchmark -r 1000000 --csv -q -p 6379 -t set,get,incr,lpush,lpop,sadd,spop-q -p 6379
"SET","112485.94"
"GET","113895.21"
"INCR","110987.79"
"LPUSH","145560.41"
"LPOP","149700.61"
"SADD","122100.12"
I tried to understand what is going on by getting Twemproxy's statistics as this:
telnet 127.0.0.1 22222
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
{
"service": "nutcracker",
"source": "localhost.localdomain",
"version": "0.4.1",
"uptime": 10,
"timestamp": 1452545028,
"total_connections": 303,
"curr_connections": 3,
"my_cluster": {
"client_eof": 300,
"client_err": 0,
"client_connections": 0,
"server_ejects": 0,
"forward_error": 0,
"fragments": 0,
"server1": {
"server_eof": 0,
"server_err": 0,
"server_timedout": 0,
"server_connections": 1,
"server_ejected_at": 0,
"requests": 246791,
"request_bytes": 11169484,
"responses": 246791,
"response_bytes": 1104215,
"in_queue": 0,
"in_queue_bytes": 0,
"out_queue": 0,
"out_queue_bytes": 0
},
"server2": {
"server_eof": 0,
"server_err": 0,
"server_timedout": 0,
"server_connections": 1,
"server_ejected_at": 0,
"requests": 353209,
"request_bytes": 12430516,
"responses": 353209,
"response_bytes": 2422648,
"in_queue": 0,
"in_queue_bytes": 0,
"out_queue": 0,
"out_queue_bytes": 0
}
}
}
Connection closed by foreign host.
Is there any other benchmark around that works properly? Or redis-benchmark should had worked?
I forgot to mention that I am using Redis: 3.0.6 and Twemproxy: 0.4.1
It might seem counter-intuitive, but putting two instances of redis with a proxy in front of them will certainly reduce performance!
In a single instance scenario, redis-benchmark connects directly to the redis server, and thus has minimal latency per request.
Once you put two instances and a single twemproxy in front of them, think what happens - you connect to twemproxy, which analyzes the request, chooses the right instance, and connects to it.
So, first of all, each request now has two network hops to travel instead of one. Added latency means less throughput of course.
Also, you are using just one twemproxy instance. So let's assume that twemproxy itself performs more or less like a single redis instance, you can never beat a single instance with a single proxy.
Twemproxy facilitates scaling out, not scaling up. It allows you to grow your cluster to sizes that a single instance could never achieve. But there's a latency price to pay, and as long as you're using a single proxy, it's also a throughput price.
The proxy imposes a small tax on each request. Measure throughput using the proxy with one server. Impose a load until the throughput stops growing and the response times slow to a crawl. Add another server and note the response times are restored to normal, while capacity just doubled. Of course, you'll want to add servers well before response times start to crawl.

ElasticSearch: Accidentally starting 2 instances in same server

So I accidentally started 2 ElasticSearch instances on the same machine. One with port 9200, the other with port 9201. This means there's 2 cluster nodes, each with the same name, and each has 1/2 of the total shards for each index.
If I kill one of the instances, I now end up with 1 instance having 1/2 the shards.
How do I fix this problem? I want to have just 1 instance with all the shards in it (like it used to be)
SO... there is a clean way to resolve this. Although I must say the ElasticSearch documentation is very very confusing (all these buzzwords like cluster and zen discovery boggles my mind!)
1)
Now, if you have 2 instances, one in port 9200, and the other in 9201. And you want ALL the shards to be in 9200.
Run this command to disable allocation in the 9201 instance. You can change persistent to transient if you want this change to not be permanent. I'd keep it persistent so this doesn't ever happen again.
curl -XPUT localhost:9201/_cluster/settings -d '{
"persistent" : {
"cluster.routing.allocation.disable_allocation" : true
}
}'
2) Now, run the command to MOVE all the shards in the 9201 instance to 9200.
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "<NAME OF INDEX HERE>", "shard" : <SHARD NUMBER HERE>,
"from_node" : "<ID OF 9201 node>", "to_node" : "<ID of 9200 node>"
}
}
]
}'
You need to run this command for every shard in the 9201 instance (the one you wanna get rid of).
If you have ElasticSearch head, that shard will be purple, and will have "REALLOCATING" status. If you have lots of data, say > 1 GB, it will take awhile for the shard to move - perhaps up to a hour or even more, so be patient. Don't shutdown the instance/node until everything is done moving.
That's it!