I can't find a straight answer, is it possible to retry a job given just the job's redis id?
I want to make an endpoint in Django that can manually retry a specific job, because I don't always need failed jobs to retry themselves.
An example might be clearer for what I want to do:
urlpatterns = [
path('workorder/<str:order_id>/retry',views.workorder_retry, name='workorderretry'),
def workorder_retry(request, order_id):
work_order = models.WorkOrder.objects.get(order_id=order_id)
print("work order retry: ")
q = util.rq_setup()
redis_conn = rq.get_connection('default')
old_job = q.fetch_job(str(work_order.job_id))
retried_job = q.enqueue(old_job)
I'm testing python-bigquery-storage to insert multiple items into a table using the _default stream.
I used the example shown in the official docs as a basis, and modified it to use the default stream.
Here is a minimal example that's similar to what I'm trying to do:
syntax = "proto2";
message CustomerRecord {
optional string customer_name = 1;
optional int64 row_num = 2;
from itertools import islice
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2
import customer_record_pb2
import logging
CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.
def chunks(l, n):
"""Yield successive `n`-sized chunks from `l`."""
_it = iter(l)
while True:
chunk = [*islice(_it, 0, n)]
if chunk:
yield chunk
def create_stream_manager(project_id, dataset_id, table_id, write_client):
# Use the default stream
# The stream name is:
# projects/{project}/datasets/{dataset}/tables/{table}/_default
parent = write_client.table_path(project_id, dataset_id, table_id)
stream_name = f'{parent}/_default'
# Create a template with fields needed for the first request.
request_template = types.AppendRowsRequest()
# The initial request must contain the stream name.
request_template.write_stream = stream_name
# So that BigQuery knows how to parse the serialized_rows, generate a
# protocol buffer representation of our message descriptor.
proto_schema = types.ProtoSchema()
proto_descriptor = descriptor_pb2.DescriptorProto()
proto_schema.proto_descriptor = proto_descriptor
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.writer_schema = proto_schema
request_template.proto_rows = proto_data
# Create an AppendRowsStream using the request template created above.
append_rows_stream = writer.AppendRowsStream(write_client, request_template)
return append_rows_stream
def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):
append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)
response_futures = []
row_count = 0
# Send the rows in chunks, to limit memory usage.
for chunk in chunks(rows, CHUNK_SIZE):
proto_rows = types.ProtoRows()
for row in chunk:
row_count += 1
# Create an append row request containing the rows
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
request.proto_rows = proto_data
future = append_rows_stream.send(request)
# Wait for all the append row requests to finish.
for f in response_futures:
# Shutdown background threads and close the streaming connection.
return row_count
def create_row(row_num: int, name: str):
row = customer_record_pb2.CustomerRecord()
row.row_num = row_num
row.customer_name = name
return row
def main():
write_client = bigquery_storage_v1.BigQueryWriteClient()
rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]
send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)
if __name__ == '__main__':
In the above, CHUNK_SIZE is 2 just for this minimal example, but, in a real situation, I used a chunk size of 5000.
In real usage, I have several separate streams of data that need to be processed in parallel, so I make several calls to send_rows_to_bq, one for each stream of data, using a thread pool (one thread per stream of data). (I'm assuming here that AppendRowsStream is not meant to be shared by multiple threads, but I might be wrong).
It mostly works, but I often get a mix of intermittent errors in the call to append_rows_stream's send method:
google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.
I think I just need to retry on these errors, but I'm not sure how to best implement a retry strategy here. My impression is that I need to use the following strategy to retry errors when calling send:
If the error is a StreamClosedError, the append_rows_stream stream manager can't be used anymore, and so I need to call close on it and then call my create_stream_manager again to create a new one, then try to call send on the new stream manager.
Otherwise, on any google.api_core.exceptions.ServerError error, retry the call to send on the same stream manager.
Am I approaching this correctly?
Thank you.
The best solution to this problem is to update to the newer lib release.
This problem happens or was happening in the older versions because once the connection write API reaches 10MB, it hangs.
If the update to the newer lib does not work you can try these options:
Limit the connection to < 10MB.
Disconnect and connect again to the API.
I trigger an airflow DAG and pass REST parameters. Upon a REST parameter list, I want to repeat some of the tasks in this DAG. After some tries I got stuck and I am not sure if this is possible.
Here one try:
def determine_rest_params(**kwargs):
values_comma_sep = kwargs["dag_run"].conf["myparam"]
values= []
if values_comma_sep :
values= values_comma_sep .split(",")
return values
def create_task_for_param(p, **kwargs)
# create an operator instance
with airflow.DAG("get_prediction2", default_args=default_args, schedule_interval=None) as dag:
start = DummyOperator(
params = determine_rest_params()
for cur_p in params:
cur_task = create_task_for_param(cur_p)
start >> cur_task
I only see the start task and no other operator. Is it possible in general?
You can try this: (Not sure if this will work)
Change the for loop to:
for i in range(0, len(params):
params[i] = create_task_for_param(params[i])
if i == 0:
I am also not sure if you are getting the params right way. Just print it and see if you getting params. if yes then the above for loop should work. if not you can try getting params in a start task.
I have a very slow performing scraper. I know the bottle neck is not the pipeline (i.e. bi_pipeline) because other scrapers that don't use XMLFeedSpider are very fast. Here is my code:
class MySpider(XMLFeedSpider):
custom_settings = {
'my.pipelines.bi_pipeline': 400
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'DEALER'
def parse_node(self, response, node):
my_item = Dealer()
my_item['title'] = node.xpath('TITLE/text()').get()
# send to pipeline to get stored in database
yield my_item
# get the sales for each dealer
yield Request("https://some.domain.com/od/dealers.json?id=" + node.xpath('ID/text()').get(), callback=self.each_sale)
I don't know why but this is very slow. Like 35 items per minute. Where should I look to optimize?
Solved. There was an update script being called on a Trigger in the database. It was a clean up script and the target I was running it on needed a lot of cleaning.
I have an application which uses Sidekiq. The web server process will sometimes put a job on Sidekiq, but I won't necessarily have the worker running. Is there a utility which I could call from the Rails console which would pull one job off the Redis queue and run the appropriate Sidekiq worker?
Here's a way that you'll likely need to modify to get the job you want (maybe like g8M suggests above), but it should do the trick:
> job = Sidekiq::Queue.new("your_queue").first
> job.klass.constantize.new.perform(*job.args)
If you want to delete the job:
> job.delete
Tested on sidekiq 5.2.3.
I wouldn't try to hack sidekiq's API to run the jobs manually since it could leave some unwanted internal state but I believe the following code would work
# Fetch the Queue
queue = Sidekiq::Queue.new # default queue
# OR
# queue = Sidekiq::Queue.new(:my_queue_name)
# Fetch the job
# job = queue.first
# OR
job = queue.find do |job|
meta = job.args.first
# => {"job_class" => "MyJob", "job_id"=>"1afe424a-f878-44f2-af1e-e299faee7e7f", "queue_name"=>"my_queue_name", "arguments"=>["Arg1", "Arg2", ...]}
meta['job_class'] == 'MyJob' && meta['arguments'].first == 'Arg1'
# Removes from queue so it doesn't get processed twice
meta = job.args.first
klass = meta['job_class'].constantize
# => MyJob
# Performs the job without using Sidekiq's API, does not count as performed job and so on.
# OR
# Perform the job using Sidekiq's API so it counts as performed job and so on.
# klass.new(*meta['arguments']).perform_now
Please let me know if this doesn't work or if someone knows a better way to do this.
I have a periodic celery task running once per minute, like so:
#periodic_task(run_every=(crontab(hour="*", minute="*", day_of_week="*")))
def scraping_task():
result = pollAPI()
Where the function pollAPI(), as you might have guessed from the name, polls an API. The catch is that the API has a rate limit that is undisclosed, and sometimes gives an error response, if that limit is hit. I'd like to be able to take that response, and if the limit is hit, decrease the periodic task interval dynamically (or even put the task on pause for a while). Is this possible?
I read in the docs about overwriting the is_due method of schedules, but I am lost on exactly what to do to give the behaviour I'm looking for here. Could anyone help?
You could try using celery.conf.update to update your CELERYBEAT_SCHEDULE.
You can add a model in the database that will store the information if the rate limit is reached. Before doing an API poll, you can check the information in the database. If there is no limit, then just send an API request.
The other approach is to use PeriodicTask from django-celery-beat. You can update the interval dynamically. I created an example project and wrote an article showing how to use dynamic periodic tasks in Celery and Django.
The example code that updates the task when the limit reached:
def scraping_task(special_object_id, larger_interval=1000):
result = pollAPI()
except Exception as e:
# limit reached
special_object = ModelWithTask.objects.get(pk=special_object_id)
task = PeriodicTask.objects.get(pk=special_object.task.id)
new_schedule, created = IntervalSchedule.objects.get_or_create(
task.interval = new_schedule
You can pass the parameters to the scraping_task when creating a PeriodicTask object. You will need to have an additional model in the database to have access to the task:
from django.db import models
from django_celery_beat.models import PeriodicTask
class ModelWithTask(models.Model):
task = models.OneToOneField(
PeriodicTask, null=True, blank=True, on_delete=models.SET_NULL
# create periodic task
special_object = ModelWithTask.objects.create_or_get()
schedule, created = IntervalSchedule.objects.get_or_create(
task = PeriodicTask.objects.create(
name="Task 1",
"special_obejct_id": special_object.id,
special_object.task = task