TensorFlow Federated (TFF) TypeError in tff.templates.IterativeProcess.next() when clients_per_round exceed 99 - tensorflow

I implemented a custom federated learning GAN training loop with TFF similar to this code by Google Research.
The client data for a particular training round is found using the following code snippet:
def client_dataset_fn():
# Sample clients and data
sampled_clients = np.random.choice(train_data.client_ids, size=cfg.clients_per_round, replace=False)
datasets = [(next(client_gen_inputs_iterator),
train_data.create_tf_dataset_for_client(client_id).take(cfg.n_critic))
for client_id in sampled_clients]
return datasets
client_noise_inputs, client_real_data = zip(*client_dataset_fn())
This works perfectly up until cfg.clients_per_round is set to 99. When it is set to 100 or a larger value (with the total number of clients being larger of course), I receive the following error:
Traceback (most recent call last):
File "main.py", line 109, in main
metrics = run_single_trial(train_data, test_data, cfg)
File "/mnt/workspace/tff/GAN/federated/fedgan_main.py", line 73, in run_single_trial
metrics = train_loop(iterative_process, server_dataset_fn, client_dataset_fn, model, eval_hook_fn, cfg)
File "/mnt/workspace/tff/GAN/federated/fedgan_main.py", line 124, in train_loop
client_real_data)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/computation/function_utils.py", line 525, in __call__
return context.invoke(self, arg)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 226, in invoke
_ingest(executor, unwrapped_arg, arg.type_signature)))
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 396, in _wrapped
return await coro
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 111, in _ingest
ingested = await asyncio.gather(*ingested)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 116, in _ingest
return await executor.create_value(val, type_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 294, in create_value
value, type_spec))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 111, in create_value
self._target_executor.create_value(value, type_spec))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 105, in _delegate
result_value = await _delegate_with_trace_ctx(coro, self._event_loop)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 396, in _wrapped
return await coro
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/federating_executor.py", line 394, in create_value
return await self._strategy.compute_federated_value(value, type_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/federated_composing_strategy.py", line 279, in compute_federated_value
py_typecheck.check_type(value, list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/py_typecheck.py", line 41, in check_type
type_string(type_spec), type_string(type(target))))
TypeError: Expected list, found tuple.
During debugging, I looked at the target variable in the final line of the traceback and found it to be the abovementioned client_real_data and client_noise_inputs. Their types are in fact tuples not lists, however, this does not change with different numbers of cfg.clients_per_round. The only usage of cfg.clients_per_round is shown above in the random choice.
I really cannot explain why this is happening, maybe somebody out there has experienced something similar and can help me out.
My used package versions are as follows:
Python 3.6.9 or 3.8.10 (checked both)
tensorflow 2.5.1
tensorflow-federated 0.19.0
retrying 1.3.3
six 1.15.0
As a workaround I now manually change the data type of client_noise_inputs and client_real_data using list(tuple_var), but I am still curious as to why the list is required somehow.

(Copying and pasting from original on GitHub)
This seems to me to be an implementation distinction between the federated_composing_strategy and the federated_resolving_strategy. IIRC, by default we don't inject a composing executor into your stack until you hit 100 clients--which would be the source of this exciting mystery.
In particular, the composing strategy is programmed against the assumption that the incoming clients-placed value is represented as a list, whereas the resolving strategy codes against a much more flexible set of containers.
It's not wild to coerce your clients-placed value to a list--we also could extend the permitted representation of clients-placed values in the composing executor to match that in the resolving one, possibly pulling the appropriate logic to a shared place like here. I think its a contribution wed be very happy to accept if youre up for it!

Related

Multiple Callbacks and 'TypeError'?

I am trying to run a Python program that generates visuals from an audio file. I'm a bit of a beginner here, so I've just been reverse-engineering issues and incompatibilities that have come up along the way.
Now, I am faced with an error. The program runs successfully but when it attempts to write/save the output video file, it gives me multiple tracebacks and a a 'TypeError' at the very end:
Traceback (most recent call last):
File "visualize.py", line 400, in <module>
clip.write_videofile(outname,audio_codec='aac')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/decorators.py", line 54, in requires_duration
return f(clip, *a, **k)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/decorators.py", line 135, in use_clip_fps_by_default
return f(clip, *new_a, **new_kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/decorators.py", line 22, in convert_masks_to_RGB
return f(clip, *a, **k)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/video/VideoClip.py", line 307, in write_videofile
logger=logger)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/video/io/ffmpeg_writer.py", line 216, in ffmpeg_write_video
ffmpeg_params=ffmpeg_params) as writer:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/moviepy/video/io/ffmpeg_writer.py", line 88, in __init__
'-r', '%0.02f' % fps,
TypeError: must be real number, not NoneType
My understanding is that each of the listed files and lines within them are giving the same error? Or does the TypeError at the end only apply to the most recent (lowest) file reference?
Would love to figure out how to resolve this. Somewhere in these file(s) a number is being referenced incorrectly, it seems.
Thanks so much for any help!!
I tried diving into the most recent (lowest) file called 'ffmpegwriter.py' and I went to the referenced line. To me, '%.02f' looked like it might have been formatted incorrectly, so I tried adding a 0 before the decimal, but same error.
Not sure where to look...

Data Length Error when Merging PDFs with PyPDF2

I am starting a project that will take specific pages out of each PDF in a folder and merge those pages into a single file. I am getting the error below when building the quoted code about the length of the encryption, and I don't know where I would need to address that.
from PyPDF2 import PdfFileMerger
import glob
files = glob.glob('C:/Users/Jake/Documents/UPLOAD/test_merge/*.pdf')
merger = PdfFileMerger()
for file in files:
merger.append(file)
merger.write("merged.pdf")
merger.close()
ERROR
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Merger\Merger .02", line 10, in <module>
merger.write("merged.pdf")
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_merger.py", line 312, in write
my_file, ret_fileobj = self.output.write(fileobj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 838, in write
self.write_stream(stream)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 811, in write_stream
self._sweep_indirect_references(self._root)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 960, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 1005, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_reader.py", line 1187, in get_object
retval = self._encryption.decrypt_object(
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 747, in decrypt_object
return cf.decrypt_object(obj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 185, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 179, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 87, in decrypt
d = aes.decrypt(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\Crypto\Cipher\_mode_cbc.py", line 246, in decrypt
raise ValueError("Data must be padded to %d byte boundary in CBC mode" % self.block_size)
ValueError: Data must be padded to 16 byte boundary in CBC mode
[Finished in 393ms]
I wrote a basic program from a YouTube video and tried to run it, but I got the error that PyCryptodome was a dependent for PyPDF2. After installing that, I am getting an error about the data length for encryption when writing the pdf. Googling that error lead me to this solution. I am a bit of a novice, and I don't really understand why any kind of encryption is being applied in the first place, other than what I assume is necessary for the pdf reader/writer to operate, so I don't know where I would need to apply that solution in this code.
After writing up this question, I was lead to this solution, which I tried to run the code below, I received the same error.
from PyPDF2 import PdfFileMerger, PdfFileReader
import glob
merger = PdfFileMerger()
files = glob.glob('C:/Users/Jake/Documents/UPLOAD/test_merge/*.pdf')
for filename in files:
with open(filename, 'rb') as source:
tmp = PdfFileReader(source)
merger.append(tmp)
merger.write('Result.pdf')
ERROR
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Merger\Merger .03.py", line 13, in <module>
merger.write('Result.pdf')
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_merger.py", line 312, in write
my_file, ret_fileobj = self.output.write(fileobj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 838, in write
self.write_stream(stream)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 811, in write_stream
self._sweep_indirect_references(self._root)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 960, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 1005, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_reader.py", line 1187, in get_object
retval = self._encryption.decrypt_object(
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 747, in decrypt_object
return cf.decrypt_object(obj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 185, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 179, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 87, in decrypt
d = aes.decrypt(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\Crypto\Cipher\_mode_cbc.py", line 246, in decrypt
raise ValueError("Data must be padded to %d byte boundary in CBC mode" % self.block_size)
ValueError: Data must be padded to 16 byte boundary in CBC mode
[Finished in 268ms]
My thinking is that something else has gone wrong, but I am at a loss at to what that could be.
What have I done wrong with this build to get this error, and how can I correct it?
Turns out this is an issue with PyPDF2. There is a 3-line fix that can be injected to correct the error if you attempt this before it is patched.

External ID not found

Odoo Server Error
Traceback (most recent call last):
File "/home/odoo/src/odoo/14.0/odoo/tools/cache.py", line 85, in lookup
r = d[key]
File "/home/odoo/src/odoo/14.0/odoo/tools/func.py", line 71, in wrapper
return func(self, *args, **kwargs)
File "/home/odoo/src/odoo/14.0/odoo/tools/lru.py", line 34, in getitem
a = self.d[obj]
KeyError: ('ir.model.data', <function IrModelData.xmlid_lookup at 0x7f5794273a60>, 'account.account_invoices_without_payment')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/odoo/src/odoo/14.0/addons/web/controllers/main.py", line 2121, in report_download
response = self.report_routes(reportname, docids=docids, converter=converter, context=context)
File "/home/odoo/src/odoo/14.0/odoo/http.py", line 532, in response_wrap
response = f(*args, **kw)
File "/home/odoo/src/odoo/14.0/addons/web/controllers/main.py", line 2056, in report_routes
pdf = report.with_context(context)._render_qweb_pdf(docids, data=data)[0]
File "/home/odoo/src/odoo/14.0/addons/account/models/ir_actions_report.py", line 41, in _render_qweb_pdf
invoice_reports = (self.env.ref('account.account_invoices_without_payment'), self.env.ref('account.account_invoices'))
File "/home/odoo/src/odoo/14.0/odoo/api.py", line 511, in ref
return self['ir.model.data'].xmlid_to_object(xml_id, raise_if_not_found=raise_if_not_found)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1944, in xmlid_to_object
t = self.xmlid_to_res_model_res_id(xmlid, raise_if_not_found)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1928, in xmlid_to_res_model_res_id
return self.xmlid_lookup(xmlid)[1:3]
File "", line 2, in xmlid_lookup
File "/home/odoo/src/odoo/14.0/odoo/tools/cache.py", line 90, in lookup
value = d[key] = self.method(*args, **kwargs)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1921, in xmlid_lookup
raise ValueError('External ID not found in the system: %s' % xmlid)
ValueError: External ID not found in the system:
account.account_invoices_without_payment
The error occurs when I tried to print an invoice. It happens even if I choose an empty print template. Any help?Thanks.
In my opinion, you should be check table ir_model_data with name=account.account_invoices_without_payment. If you can find, you must update module account. If you can't find, you can be insert new record table ir_model_data with name and res_id = id view account_invoices_without_payment in ir_ui_view.
May be help you.
Please upgrade the module account and make sure that the db using correctly. You can give db-filter to choose the db correctly
After checking in the settings-> external id. I find that somehow this external id got deleted for unknown reason. I opened an new database and compared and find that this is the case then I create a new eternal id according to the new db's value.

Running into uvloop issues with Database queries from Rasa-X?

I'm trying to make a simple query to my amazon neptune database, from Rasa-x.
Here is the code from my actions.py:
class ActionQueryDietary(Action):
def name(self) -> Text:
return "action_query_dietary"
def run(self, dispatcher: CollectingDispatcher, tracker: Tracker, domain: Dict[Text, Any]) -> List[Dict[Text, Any]]:
available = False
restaurant = "XXXX"
dietaryQuestion=tracker.get_slot('dietaryQuestion')
g, remoteConn = kb.openConnection()
dietary = kb.getDietary(g, restaurant, dietaryQuestion)
if(dietary=="Yes"):
available = True
if(available==True):
dispatcher.utter_message(text="According to our knowledge base, {} is on the menu").format(dietaryQuestion)
else:
dispatcher.utter_message(text="Sorry, according to our knowledge base, we don't have this option on the menu")
kb.closeConnection(remoteConn)
return []
and here is the code from knowledgebase.py:
def getDietary(g, restaurant, dietary):
properties = queryRestaurantProperties(g, restaurant)
result = properties[dietary]
print(result)
return result
but any query to the knowledgebase results in this error:
2020-10-22T18:01:22.347345241Z File "/opt/venv/lib/python3.7/site-packages/sanic/app.py", line 973, in handle_request
2020-10-22T18:01:22.347351643Z response = await response
2020-10-22T18:01:22.347357446Z File "/app/rasa_sdk/endpoint.py", line 102, in webhook
2020-10-22T18:01:22.347363522Z result = await executor.run(action_call)
2020-10-22T18:01:22.347369398Z File "/app/rasa_sdk/executor.py", line 392, in run
2020-10-22T18:01:22.347375473Z events = action(dispatcher, tracker, domain)
2020-10-22T18:01:22.347381348Z File "/app/actions/actions.py", line 33, in run
2020-10-22T18:01:22.347387063Z dietary = kb.getDietary(g, restaurant, dietaryQuestion)
2020-10-22T18:01:22.347392835Z File "/app/actions/knowledgebase.py", line 117, in getDietary
2020-10-22T18:01:22.347399112Z properties = queryRestaurantProperties(g, restaurant)
2020-10-22T18:01:22.347405111Z File "/app/actions/knowledgebase.py", line 86, in queryRestaurantProperties
2020-10-22T18:01:22.347411869Z result = g.V().has('name', restaurant).valueMap().next()
2020-10-22T18:01:22.347418048Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 89, in next
2020-10-22T18:01:22.347424293Z return self.__next__()
2020-10-22T18:01:22.347430069Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 48, in __next__
2020-10-22T18:01:22.347436333Z self.traversal_strategies.apply_strategies(self)
2020-10-22T18:01:22.347441940Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 573, in apply_strategies
2020-10-22T18:01:22.347447667Z traversal_strategy.apply(traversal)
2020-10-22T18:01:22.347453031Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/remote_connection.py", line 149, in apply
2020-10-22T18:01:22.347459352Z remote_traversal = self.remote_connection.submit(traversal.bytecode)
2020-10-22T18:01:22.347465069Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 55, in submit
2020-10-22T18:01:22.347486749Z result_set = self._client.submit(bytecode)
2020-10-22T18:01:22.347493788Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/client.py", line 127, in submit
2020-10-22T18:01:22.347499424Z return self.submitAsync(message, bindings=bindings).result()
2020-10-22T18:01:22.347505093Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/client.py", line 146, in submitAsync
2020-10-22T18:01:22.347511092Z return conn.write(message)
2020-10-22T18:01:22.347516610Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/connection.py", line 55, in write
2020-10-22T18:01:22.347522673Z self.connect()
2020-10-22T18:01:22.347529987Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/connection.py", line 45, in connect
2020-10-22T18:01:22.347536097Z self._transport.connect(self._url, self._headers)
2020-10-22T18:01:22.347542222Z File "/opt/venv/lib/python3.7/site-packages/gremlin_python/driver/tornado/transport.py", line 36, in connect
2020-10-22T18:01:22.347547822Z lambda: websocket.websocket_connect(url))
2020-10-22T18:01:22.347553477Z File "/opt/venv/lib/python3.7/site-packages/tornado/ioloop.py", line 571, in run_sync
2020-10-22T18:01:22.347559295Z self.start()
2020-10-22T18:01:22.347564864Z File "/opt/venv/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 132, in start
2020-10-22T18:01:22.347570978Z self.asyncio_loop.run_forever()
2020-10-22T18:01:22.347576693Z File "uvloop/loop.pyx", line 1351, in uvloop.loop.Loop.run_forever
2020-10-22T18:01:22.347582342Z File "uvloop/loop.pyx", line 484, in uvloop.loop.Loop._run
2020-10-22T18:01:22.347588222Z RuntimeError: Cannot run the event loop while another loop is running
I tried using nest_asyncio.apply, but that resulted in this error:
ValueError: Can't patch loop of type <class 'uvloop.Loop'>
which according to the docs is just a rule.
So I don't really know how to proceed?
Adding my comment above as an answer. In some cases it is necessary to downlevel the version of Tornado being used. There are some issues that you can sometimes run into if the event loop is already running when someone else tries to create it. For now, down leveling to Tornado 4.5.1 with Gremlin Python should resolve any issues.

Writing to BigQuery from within a ParDo function

I would like to call a beam.io.Write(beam.io.BigQuerySink(..)) operation from within a ParDo function to generate a separate BigQuery table for each key in the PCollection (i'm using the python SDK). Here are two similar threads, which unfortunately didn't help:
1) https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
2) Dynamic table name when writing to BQ from dataflow pipelines
When I execute the following code, the rows for the first key get inserted into BigQuery and then the pipeline fails with the error below. Would really appreciate any suggestions on what I'm doing wrong or any suggestions on how to fix it.
Pipeline code:
rows = p | 'read_bq_table' >> beam.io.Read(beam.io.BigQuerySource(query=query))
class par_upload(beam.DoFn):
def process(self, context):
key, value = context.element
### This block causes issues ###
value | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.first_table', #will be replace by a dynamic name based on key
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
### End block ######
return [value]
### Following part works fine ###
filtered = (rows | 'filter_rows' >> beam.Filter(lambda row: row['topic'] == 'analytics')
| 'apply_projection' >> beam.Map(apply_projection, projection_fields)
| 'group_by_key' >> beam.GroupByKey()
| 'par_upload_to_bigquery' >> beam.ParDo(par_upload())
| 'flat_map' >> beam.FlatMap(lambda l: l) #this step is just for testing
)
### This part works fine if I comment out the 'write_to_bq' block above
filtered | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.another_table',
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
)
Error message:
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:root:Writing 1 rows to PROJECT-NAME:analytics.first_table table.
INFO:root:Final: Debug counters: {'element_counts': Counter({'CreatePInput0': 1, 'write_to_bq/native_write': 1})}
ERROR:root:Error while visiting par_upload_to_bigquery
Traceback (most recent call last):
File "split_events.py", line 137, in <module>
run()
File "split_events.py", line 132, in run
p.run()
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 159, in run
return self.runner.run(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 102, in run
super(DirectPipelineRunner, self).run(pipeline)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 98, in run
pipeline.visit(RunVisitor(self))
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 182, in visit
self._root_transform().visit(visitor, self, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 419, in visit
part.visit(visitor, pipeline, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 422, in visit
visitor.visit_transform(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 93, in visit_transform
self.runner.run_transform(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in run_transform
return m(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 98, in func_wrapper
func(self, pvalue, *args, **kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 180, in run_ParDo
runner.process(v)
File "apache_beam/runners/common.py", line 133, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4483)
File "apache_beam/runners/common.py", line 139, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4311)
File "apache_beam/runners/common.py", line 150, in apache_beam.runners.common.DoFnRunner.reraise_augmented (apache_beam/runners/common.c:4677)
File "apache_beam/runners/common.py", line 137, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4245)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 149, in process
return self.run(self.dofn.process, context, args, kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 134, in run
result = method(context, *args, **kwargs)
File "split_events.py", line 73, in process
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 724, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 445, in __ror__
return _MaterializePValues(cache).visit(result)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 105, in visit
return self._pvalue_cache.get_unwindowed_pvalue(node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 262, in get_unwindowed_pvalue
return [v.value for v in self.get_pvalue(pvalue)]
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 244, in get_pvalue
value_with_refcount = self._cache[self.key(pvalue)]
KeyError: "(4384177040, None) [while running 'par_upload_to_bigquery']"
Edit (after the first answer):
I didn't realise my value needs to be a PCollection.
I've changed my code to this now (which is probably very inefficient):
key_pipe = p | 'pipe_' + key >> beam.Create(value)
key_pipe | 'write_' + key >> beam.io.Write(beam.io.BigQuerySink(..))
Which now works fine locally but not with BlockingDataflowPipelineRunner :-(
The pipeline fails with the following error:
JOB_MESSAGE_ERROR: (979394c29490e588): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 474, in do_work
work_executor.execute()
File "dataflow_worker/executor.py", line 901, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:24331)
op.start()
File "dataflow_worker/executor.py", line 465, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:14193)
def start(self):
File "dataflow_worker/executor.py", line 469, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13499)
fn, args, kwargs, tags_and_types, window_fn = (
ValueError: too many values to unpack (expected 5)
In the similar threads, the only suggestion to do BigQuery write operations in a ParDo was to use the BigQuery API directly, or using a client.
The code that you wrote is putting a Dataflow ParDo class beam.io.BigQuerySink() into a DoFn function. The ParDo class expects to work on a PCollection like filtered in the working code example. Which is not the case for the non-functioning code working on value.
I think the easiest option would be to take a look at the gcloud-python BigQuery function insert_data() and put this inside your ParDo.