Python BigQuery Storage Write retry strategy when writing to default stream - google-bigquery

I'm testing python-bigquery-storage to insert multiple items into a table using the _default stream.
I used the example shown in the official docs as a basis, and modified it to use the default stream.
Here is a minimal example that's similar to what I'm trying to do:
customer_record.proto
syntax = "proto2";
message CustomerRecord {
optional string customer_name = 1;
optional int64 row_num = 2;
}
append_rows_default.py
from itertools import islice
from google.cloud import bigquery_storage_v1
from google.cloud.bigquery_storage_v1 import types
from google.cloud.bigquery_storage_v1 import writer
from google.protobuf import descriptor_pb2
import customer_record_pb2
import logging
logging.basicConfig(level=logging.DEBUG)
CHUNK_SIZE = 2 # Maximum number of rows to use in each AppendRowsRequest.
def chunks(l, n):
"""Yield successive `n`-sized chunks from `l`."""
_it = iter(l)
while True:
chunk = [*islice(_it, 0, n)]
if chunk:
yield chunk
else:
break
def create_stream_manager(project_id, dataset_id, table_id, write_client):
# Use the default stream
# The stream name is:
# projects/{project}/datasets/{dataset}/tables/{table}/_default
parent = write_client.table_path(project_id, dataset_id, table_id)
stream_name = f'{parent}/_default'
# Create a template with fields needed for the first request.
request_template = types.AppendRowsRequest()
# The initial request must contain the stream name.
request_template.write_stream = stream_name
# So that BigQuery knows how to parse the serialized_rows, generate a
# protocol buffer representation of our message descriptor.
proto_schema = types.ProtoSchema()
proto_descriptor = descriptor_pb2.DescriptorProto()
customer_record_pb2.CustomerRecord.DESCRIPTOR.CopyToProto(proto_descriptor)
proto_schema.proto_descriptor = proto_descriptor
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.writer_schema = proto_schema
request_template.proto_rows = proto_data
# Create an AppendRowsStream using the request template created above.
append_rows_stream = writer.AppendRowsStream(write_client, request_template)
return append_rows_stream
def send_rows_to_bq(project_id, dataset_id, table_id, write_client, rows):
append_rows_stream = create_stream_manager(project_id, dataset_id, table_id, write_client)
response_futures = []
row_count = 0
# Send the rows in chunks, to limit memory usage.
for chunk in chunks(rows, CHUNK_SIZE):
proto_rows = types.ProtoRows()
for row in chunk:
row_count += 1
proto_rows.serialized_rows.append(row.SerializeToString())
# Create an append row request containing the rows
request = types.AppendRowsRequest()
proto_data = types.AppendRowsRequest.ProtoData()
proto_data.rows = proto_rows
request.proto_rows = proto_data
future = append_rows_stream.send(request)
response_futures.append(future)
# Wait for all the append row requests to finish.
for f in response_futures:
f.result()
# Shutdown background threads and close the streaming connection.
append_rows_stream.close()
return row_count
def create_row(row_num: int, name: str):
row = customer_record_pb2.CustomerRecord()
row.row_num = row_num
row.customer_name = name
return row
def main():
write_client = bigquery_storage_v1.BigQueryWriteClient()
rows = [ create_row(i, f"Test{i}") for i in range(0,20) ]
send_rows_to_bq("PROJECT_NAME", "DATASET_NAME", "TABLE_NAME", write_client, rows)
if __name__ == '__main__':
main()
Note:
In the above, CHUNK_SIZE is 2 just for this minimal example, but, in a real situation, I used a chunk size of 5000.
In real usage, I have several separate streams of data that need to be processed in parallel, so I make several calls to send_rows_to_bq, one for each stream of data, using a thread pool (one thread per stream of data). (I'm assuming here that AppendRowsStream is not meant to be shared by multiple threads, but I might be wrong).
It mostly works, but I often get a mix of intermittent errors in the call to append_rows_stream's send method:
google.cloud.bigquery_storage_v1.exceptions.StreamClosedError: This manager has been closed and can not be used.
google.api_core.exceptions.Unknown: None There was a problem opening the stream. Try turning on DEBUG level logs to see the error.
I think I just need to retry on these errors, but I'm not sure how to best implement a retry strategy here. My impression is that I need to use the following strategy to retry errors when calling send:
If the error is a StreamClosedError, the append_rows_stream stream manager can't be used anymore, and so I need to call close on it and then call my create_stream_manager again to create a new one, then try to call send on the new stream manager.
Otherwise, on any google.api_core.exceptions.ServerError error, retry the call to send on the same stream manager.
Am I approaching this correctly?
Thank you.

The best solution to this problem is to update to the newer lib release.
This problem happens or was happening in the older versions because once the connection write API reaches 10MB, it hangs.
If the update to the newer lib does not work you can try these options:
Limit the connection to < 10MB.
Disconnect and connect again to the API.

Related

GtkTreeView stops updating unless I change the focus of the window

I have a GtkTreeView object that uses a GtkListStore model that is constantly being updated as follows:
Get new transaction
Feed data into numpy array
Convert numbers to formatted strings, store in pandas dataframe
Add updated token info to GtkListStore via GtkListStore.set(titer, liststore_cols, liststore_data), where liststore_data is the updated info, liststore_cols is the name of the columns (both are lists).
Here's the function that updates the ListStore:
# update ListStore
titer = ls_full.get_iter(row)
liststore_data = []
[liststore_data.append(df.at[row, col])
for col in my_vars['ls_full'][3:]]
# check for NaN value, add a (space) placeholder is necessary
for i in range(3, len(liststore_data)):
if liststore_data[i] != liststore_data[i]:
liststore_data[i] = " "
liststore_cols = []
[liststore_cols.append(my_vars['ls_full'].index(col) + 1)
for col in my_vars['ls_full'][3:]]
ls_full.set(titer, liststore_cols, liststore_data)
Class that gets the messages from the websocket:
class MyWebsocketClient(cbpro.WebsocketClient):
# class exceptions to WebsocketClient
def on_open(self):
# sets up ticker Symbol, subscriptions for socket feed
self.url = "wss://ws-feed.pro.coinbase.com/"
self.channels = ['ticker']
self.products = list(cbp_symbols.keys())
def on_message(self, msg):
# gets latest message from socket, sends off to be processed
if "best_ask" and "time" in msg:
# checks to see if token price has changed before updating
update_needed = parse_data(msg)
if update_needed:
update_ListStore(msg)
else:
print(f'Bad message: {msg}')
When the program first starts, the updates are consistent. Each time a new transaction comes in, the screen reflects it, updating the proper token. However, after a random amount of time - seen it anywhere from 5 minutes to over an hour - the screen will stop updating, unless I change the focus of the window (either activate or inactive). This does not last long, though (only enough to update the screen once). No other errors are being reported, memory usage is not spiking (constant at 140 MB).
How can I troubleshoot this? I'm not even sure where to begin. The data back-ends seem to be OK (data is never corrupted nor lags behind).
As you've said in the comments that it is running in a separate thread then i'd suggest wrapping your "update liststore" function with GLib.idle_add.
from gi.repository import GLib
GLib.idle_add(update_liststore)
I've had similar issues in the past and this fixed things. Sometimes updating liststore is fine, sometimes it will randomly spew errors.
Basically only one thread should update the GUI at a time. So by wrapping in GLib.idle_add() you make sure your background thread does not intefer with the main thread updating the GUI.

how to structure beginning and ending synchronous calls using trio?

My ask is for structured trio pseudo-code (actual trio function-calls, but dummy worker-does-work-here fill-in) so I can understand and try out good flow-control practices for switching between synchronous and asynchronous processes.
I want to do the following...
load a file of json-data into a data-dict
aside: the data-dict looks like { 'key_a': {(info_dict_a)}, 'key_b': {info_dict_b} }
have each of n-workers...
access that data-dict to find the next record-to-process info-dict
prepare some data from the record-being-processed and post the data to a url
process the post-response to update a 'response' key in the record-being-processed info-dict
update the data-dict with the key's info-dict
overwrite the original file of json-data with the updated data-dict
Aside: I know there are other ways I could achieve my overall goal than the clunky repeated rewrite of a json file -- but I'm not asking for that input; I really would like to understand trio well enough to be able to use it for this flow.
So, the processes that I want to be synchronous:
the get next record-to-process info-dict
the updating of the data-dict
the overwriting of the original file of json-data with the updated data-dict
New to trio, I have working code here ...which I believe is getting the next record-to-process synchronously (via using a trio.Semaphore() technique). But I'm pretty sure I'm not saving the file synchronously.
Learning Go a few years ago, I felt I grokked the approaches to interweaving synchronous and asynchronous calls -- but am not there yet with trio. Thanks in advance.
Here is how I would write the (pseudo-)code:
async def process_file(input_file):
# load the file synchronously
with open(input_file) as fd:
data = json.load(fd)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub})
# save your result json synchronously
save_file(data, input_file)
trio guarantees you that once you exit the async with block every task you spawned is complete so you can safely save your file because no more update will occur.
I also removed the grab_next_entry function because it seems to me that this function will iterate over the same keys (incrementally) at each call (giving a O(n!)) complexity while you could simplify it by just iterating over your dict once (dropping the complexity to O(n))
You don't need the Semaphore either, except if you want to limit the number of parallel post_update calls. But trio offers a builtin mechanism for this as well thanks to its CapacityLimiter that you would use like this:
limit = trio.CapacityLimiter(10)
async with trio.open_nursery() as nursery:
async with limit:
for x in z:
nursery.start_soon(func, x)
UPDATE thanks to #njsmith's comment
So, in order to limit the amount of concurrent post_update you'll rewrite it like this:
async def post_update(data, limit):
async with limit:
...
And then you can rewrite the previous loop like that:
limit = trio.CapacityLimiter(10)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub}, limit)
This way, we spawn n tasks for the n entries in your data-dict, but if there are more than 10 tasks running concurrently, then the extra ones will have to wait for the limit to be released (at the end of the async with limit block).
This code uses channels to multiplex requests to and from a pool of workers. I found the additional requirement (in your code comments) that the post-response rate is throttled, so read_entries sleeps after each send.
from random import random
import time, asks, trio
snd_input, rcv_input = trio.open_memory_channel(0)
snd_output, rcv_output = trio.open_memory_channel(0)
async def read_entries():
async with snd_input:
for key_entry in range(10):
print("reading", key_entry)
await snd_input.send(key_entry)
await trio.sleep(1)
async def work(n):
async for key_entry in rcv_input:
print(f"w{n} {time.monotonic()} posting", key_entry)
r = await asks.post(f"https://httpbin.org/delay/{5 * random()}")
await snd_output.send((r.status_code, key_entry))
async def save_entries():
async for entry in rcv_output:
print("saving", entry)
async def main():
async with trio.open_nursery() as nursery:
nursery.start_soon(read_entries)
nursery.start_soon(save_entries)
async with snd_output:
async with trio.open_nursery() as workers:
for n in range(3):
workers.start_soon(work, n)
trio.run(main)

python-ldap: Retrieve only a few entries from LDAP search

I wish to mimic the ldapsearch -z flag behavior of retrieving only a specific amount of entries from LDAP using python-ldap.
However, it keeps failing with the exception SIZELIMIT_EXCEEDED.
There are multiple links where the problem is reported, but the suggested solution doesn't seem to work
Python-ldap search: Size Limit Exceeded
LDAP: ldap.SIZELIMIT_EXCEEDED
I am using search_ext_s() with sizelimit parameter set to 1, which I am sure is not more than the server limit
On Wireshark, I see that 1 entry is returned and the server raises SIZELIMIT_EXCEEDED. This is the same as ldapsearch -z behavior
But the following line raises an exception and I don't know how to retrieve the returned entry
conn.search_ext_s(<base>,ldap.SCOPE_SUBTREE,'(cn=demo_user*)',['dn'],sizelimit=1)
Based upon the discussion in the comments, this is how I achieved it:
import ldap
# These are not mandatory, I just have a habit
# of setting against Microsoft Active Directory
ldap.set_option(ldap.OPT_REFERRALS, 0)
ldap.set_option(ldap.OPT_PROTOCOL_VERSION, 3)
conn = ldap.initialize('ldap://<SERVER-IP>')
conn.simple_bind(<username>, <password>)
# Using async search version
ldap_result_id = conn.search_ext(<base-dn>, ldap.SCOPE_SUBTREE,
<filter>, [desired-attrs],
sizelimit=<your-desired-sizelimit>)
result_set = []
try:
while 1:
result_type, result_data = conn.result(ldap_result_id, 0)
if (result_data == []):
break
else:
# Handle the singular entry anyway you wish.
# I am appending here
if result_type == ldap.RES_SEARCH_ENTRY:
result_set.append(result_data)
except ldap.SIZELIMIT_EXCEEDED:
print 'Hitting sizelimit'
print result_set
Sample Output:
# My server has about 500 entries for 'demo_user' - 1,2,3 etc.
# My filter is '(cn=demo_user*)', attrs = ['cn'] with sizelimit of 5
$ python ldap_sizelimit.py
Hitting sizelimit
[[('CN=demo_user0,OU=DemoUsers,DC=ad,DC=local', {'cn': ['demo_user0']})],
[('CN=demo_user1,OU=DemoUsers,DC=ad,DC=local', {'cn': ['demo_user1']})],
[('CN=demo_user10,OU=DemoUsers,DC=ad,DC=local', {'cn': ['demo_user10']})],
[('CN=demo_user100,OU=DemoUsers,DC=ad,DC=local', {'cn': ['demo_user100']})],
[('CN=demo_user101,OU=DemoUsers,DC=ad,DC=local', {'cn': ['demo_user101']})]]
You may use play around with more srv controls to sort these etc. but I think the basic idea is conveyed ;)
You have to use the async search method LDAPObject.search_ext() and separate collect the results with LDAPObject.result() until the exception ldap.SIZELIMIT_EXCEEDED is raised.
The accepted answer works if you are searching for less users than specified by the server's sizelimit, but will fail if you wish to gather more than that (the default for AD is 1000 users).
Here's a Python3 implementation that I came up with after heavily editing what I found here and in the official documentation. At the time of writing this it works with the pip3 package python-ldap version 3.2.0.
def get_list_of_ldap_users():
hostname = "google.com"
username = "username_here"
password = "password_here"
base = "dc=google,dc=com"
print(f"Connecting to the LDAP server at '{hostname}'...")
connect = ldap.initialize(f"ldap://{hostname}")
connect.set_option(ldap.OPT_REFERRALS, 0)
connect.simple_bind_s(username, password)
connect=ldap_server
search_flt = "(cn=demo_user*)" # get all users with a specific cn
page_size = 1 # how many users to search for in each page, this depends on the server maximum setting (default is 1000)
searchreq_attrlist=["cn", "sn", "name", "userPrincipalName"] # change these to the attributes you care about
req_ctrl = SimplePagedResultsControl(criticality=True, size=page_size, cookie='')
msgid = connect.search_ext_s(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
total_results = []
pages = 0
while True: # loop over all of the pages using the same cookie, otherwise the search will fail
pages += 1
rtype, rdata, rmsgid, serverctrls = connect.result3(msgid)
for user in rdata:
total_results.append(user)
pctrls = [c for c in serverctrls if c.controlType == SimplePagedResultsControl.controlType]
if pctrls:
if pctrls[0].cookie: # Copy cookie from response control to request control
req_ctrl.cookie = pctrls[0].cookie
msgid = connect.search_ext_s(base=base, scope=ldap.SCOPE_SUBTREE, filterstr=search_flt, attrlist=searchreq_attrlist, serverctrls=[req_ctrl])
else:
break
else:
break
return total_results

Lego-EV3: How to fix EOFError when catching user-input via multiprocessing?

Currently, I am working with a EV3 lego robot that is controlled by several neurons. Now I want to modify the code (running on
python3) in such a way that one can change certain parameter values on the run via the shell (Ubuntu) in order to manipulate the robot's dynamics at any time (and for multiple times). Here is a schema of what I have achieved so far based on a short example code:
from multiprocessing import Process
from multiprocessing import SimpleQueue
import ev3dev.ev3 as ev3
class Neuron:
(definitions of class variables and update functions)
def check_input(queue):
while (True):
try:
new_para = str(input("Type 'parameter=value': "))
float(new_para[2:0]) # checking for float in input
var = new_para[0:2]
if (var == "k="): # change parameter k
queue.put(new_para)
elif (var == "g="): # change parameter g
queue.put(new_para)
else:
print("Error". Type 'k=...' or 'g=...')
queue.put(0) # put anything in queue
except (ValueError, EOFError):
print("New value is not a number. Try again!")
(some neuron-specific initializations)
queue = SimpleQueue()
check = Process(target=check_input, args=(queue,))
check.start()
while (True):
if (not queue.empty()):
cmd = queue.get()
var = cmd[0]
val = float(cmd[2:])
if (var == "k"):
Neuron.K = val
elif (var == "g"):
Neuron.g = val
(updating procedure for neurons, writing data to file)
Since I am new to multiprocessing there are certainly some mistakes concerning taking care of locking, efficiency and so on but the robot moves and input fields occur in the shell. However, the current problem is that it's actually impossible to make an input:
> python3 controller_multiprocess.py
> Type 'parameter=value': New value is not a number. Try again!
> Type 'parameter=value': New value is not a number. Try again!
> Type 'parameter=value': New value is not a number. Try again!
> ... (and so on)
I know that this behaviour is caused by putting the exception of EOFError due to the fact that this error occurs when the exception is removed (and the process crashes). Hence, the program just rushes through the try-loop here and assumes that no input (-> empty string) was made over and over again. Why does this happen? - when not called as a threaded procedure the program patiently waits for an input as expected. And how can one fix or bypass this issue so that changing parameters gets possible as wanted?
Thanks in advance!

Dynamically change the periodic interval of celery task at runtime

I have a periodic celery task running once per minute, like so:
#tasks.py
#periodic_task(run_every=(crontab(hour="*", minute="*", day_of_week="*")))
def scraping_task():
result = pollAPI()
Where the function pollAPI(), as you might have guessed from the name, polls an API. The catch is that the API has a rate limit that is undisclosed, and sometimes gives an error response, if that limit is hit. I'd like to be able to take that response, and if the limit is hit, decrease the periodic task interval dynamically (or even put the task on pause for a while). Is this possible?
I read in the docs about overwriting the is_due method of schedules, but I am lost on exactly what to do to give the behaviour I'm looking for here. Could anyone help?
You could try using celery.conf.update to update your CELERYBEAT_SCHEDULE.
You can add a model in the database that will store the information if the rate limit is reached. Before doing an API poll, you can check the information in the database. If there is no limit, then just send an API request.
The other approach is to use PeriodicTask from django-celery-beat. You can update the interval dynamically. I created an example project and wrote an article showing how to use dynamic periodic tasks in Celery and Django.
The example code that updates the task when the limit reached:
def scraping_task(special_object_id, larger_interval=1000):
try:
result = pollAPI()
except Exception as e:
# limit reached
special_object = ModelWithTask.objects.get(pk=special_object_id)
task = PeriodicTask.objects.get(pk=special_object.task.id)
new_schedule, created = IntervalSchedule.objects.get_or_create(
every=larger_inerval,
period=IntervalSchedule.SECONDS,
)
task.interval = new_schedule
task.save()
You can pass the parameters to the scraping_task when creating a PeriodicTask object. You will need to have an additional model in the database to have access to the task:
from django.db import models
from django_celery_beat.models import PeriodicTask
class ModelWithTask(models.Model):
task = models.OneToOneField(
PeriodicTask, null=True, blank=True, on_delete=models.SET_NULL
)
# create periodic task
special_object = ModelWithTask.objects.create_or_get()
schedule, created = IntervalSchedule.objects.get_or_create(
every=10,
period=IntervalSchedule.SECONDS,
)
task = PeriodicTask.objects.create(
interval=schedule,
name="Task 1",
task="scraping_task",
kwargs=json.dumps(
{
"special_obejct_id": special_object.id,
}
),
)
special_object.task = task
special_object.save()