Related
I am trying to convert a Jupyter notebook to pdf via Latex using nbconvert, in order to automatically include citations to articles contained in a separate .bib file. I worked according to the tutorial/example here. Such a tutorial was suggested in the nbconvert documentation, here.
I have the following files in the same directory in which I am running the Jupyter notebook:
citations.tplx (the template to be used to include the bibliography)
references.bib (a .bib file containing the citations, taken from Google Scholar)
Inside the markdown cells, I use the following syntax to cite a work:
<cite data-cite="cortez2009modeling">(Cortez, 2009)</cite>
where such a work in the .bib file is reported as follows:
#article{cortez2009modeling,
title={Modeling wine preferences by data mining from physicochemical properties},
author={Cortez, Paulo and Cerdeira, Ant{\'o}nio and Almeida, Fernando and Matos, Telmo and Reis, Jos{\'e}},
journal={Decision support systems},
volume={47},
number={4},
pages={547--553},
year={2009},
publisher={Elsevier}
}
In a new notebook, also saved in the same location, I run the following command, also taken by the tutorial mentioned above:
%%bash
jupyter nbconvert --to latex --template citations.tplx --post pdf my_notebook.ipynb
I get a very long output, full of warnings, but basically, the error is:
ModuleNotFoundError: No module named 'pdf'
I also tried to do this according to other tutorials on the web, but even when the PDF file was indeed generated (using a slightly different nbconvert command), my citations were not captured in the text (a question mark would appear instead), and there was no bibliography at the end of the document. A warning would say there were 'problems' with Bibtex, but nothing more.
In the following, I report the complete output of the command I wrote above:
Traceback (most recent call last):
File "/opt/anaconda3/bin/jupyter-nbconvert", line 11, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.8/site-packages/jupyter_core/application.py", line 254, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 844, in launch_instance
app.initialize(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 290, in initialize
super().initialize(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/jupyter_core/application.py", line 225, in initialize
self.parse_command_line(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 713, in parse_command_line
self.update_config(self.cli_config)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/configurable.py", line 220, in update_config
self._load_config(config)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/configurable.py", line 190, in _load_config
warn(msg)
File "/opt/anaconda3/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1214, in hold_trait_notifications
self.notify_change(change)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1227, in notify_change
return self._notify_observers(change)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1264, in _notify_observers
c(event)
File "/opt/anaconda3/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 265, in _postprocessor_class_changed
self.postprocessor_factory = import_item(new)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/utils/importstring.py", line 38, in import_item
return __import__(parts[0])
ModuleNotFoundError: No module named 'pdf'
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-22-c5829f9d50d0> in <module>
----> 1 get_ipython().run_cell_magic('bash', '', 'jupyter nbconvert --to latex --template citations.tplx --post pdf Orlando_Taddeo_CW.ipynb\n')
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397 with self.builtin_trap:
2398 args = (magic_arg_s, cell)
-> 2399 result = fn(*args, **kwargs)
2400 return result
2401
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
140 else:
141 line = script
--> 142 return self.shebang(line, cell)
143
144 # write a basic docstring:
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/decorator.py in fun(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
243 sys.stderr.flush()
244 if args.raise_error and p.returncode!=0:
--> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
246
247 def _run_script(self, p, cell, to_close):
CalledProcessError: Command 'b'jupyter nbconvert --to latex --template citations.tplx --post pdf Orlando_Taddeo_CW.ipynb\n'' returned non-zero exit status 1.
Could anyone please shed some light on this? I really can't figure out why it does not work.
Thank you very much in advance.
Problem with connecting the Storage Domain (Host host2 cannot access the Storage Domain(s) )
Hello, everyone! I need specialist help, because I'm already desperate. My company has four hosts that are connected to the storage. Each host has its own IP to access the storage, which means host 1 has an ip 10.42.0.10 and 10.42.1.10 -> host 2 has an ip 10.42.0.20 and 10.42.0.20 respectively. Host 1 cannot ping the address 10.42.0.20. Hardware I tried to explain in more detail.
Host 1 has ovirt node 4.3.9 installed and hosted-engine deployed.
When trying to add host 2 to a cluster it is installed, but not activated. There is an error in ovirt manager - "Host **host2** cannot access the Storage Domain(s) <UNKNOWN>" and host 2 goes to "Not operational" status. On host 2, it writes "connect to 10.42.1.10:3260 failed (No route to host)" in the logs and repeats indefinitely. I manually connected host 2 to the storage using iscsiadm to ip 10.42.0.20. But the error is not missing(. At the same time, when the host tries to activate it, I can run virtual machines on it until the host shows an error message. VMs that have been run on host 2 continue to run even when the host has Non-operational status.
I assume that when adding host 2 to a cluster, ovirt tries to connect it to the same repository host 1 is connected to from ip 10.42.1.10. There may be a way to get ovirt to connect to another ip address instead of the ip domain address for the first host. I'm attaching logs:
/var/log/ovirt-engine/engine.log
2020-03-31 09:13:03,866+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [7fa128f4] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host2.school34.local cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center DataCenter. Setting Host state to Non-Operational.
2020-03-31 10:40:04,883+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [7a48ebb7] START, ConnectStorageServerVDSCommand(HostName = host2.school34.local, StorageServerConnectionManagementVDSParameters:{hostId='d82c3a76-e417-4fe4-8b08-a29414e3a9c1', storagePoolId='6052cc0a-71b9-11ea-ba5a-00163e10c7e7', storageType='ISCSI', connectionList='[StorageServerConnections:{id='c8a05dc2-f8a2-4354-96ed-907762c29761', connection='10.42.0.10', iqn='iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='0ec6f34e-01c8-4ecc-9bd4-7e2a250d589d', connection='10.42.1.10', iqn='iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 2c1a22b5
2020-03-31 10:43:05,061+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [7a48ebb7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host2.school34.local command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
vdsm.log
2020-03-31 09:34:07,264+0300 ERROR (jsonrpc/5) [storage.HSM] Could not connect to storageServer (hsm:2420)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer
conObj.connect()
File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 488, in connect
iscsi.addIscsiNode(self._iface, self._target, self._cred)
File "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
iscsiadm.node_login(iface.name, target.address, target.iqn)
File "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 337, in node_login
raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0, portal: 10.42.1.10,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, target: iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0, portal: 10.42.1.10,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])
2020-03-31 09:36:01,583+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'port': u'3260', u'connection': u'10.42.0.10', u'iqn': u'iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', u'user': u'', u'tpgt': u'2', u'ipv6_enabled': u'false', u'password': '********', u'id': u'c8a05dc2-f8a2-4354-96ed-907762c29761'}, {u'port': u'3260', u'connection': u'10.42.1.10', u'iqn': u'iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'password': '********', u'id': u'0ec6f34e-01c8-4ecc-9bd4-7e2a250d589d'}], u'storagepoolID': u'6052cc0a-71b9-11ea-ba5a-00163e10c7e7', u'domainType': 3}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'64cc0385-3a11-474b-98f0-b0ecaa6c67c8'} at 0x7fe1ac1ff510> timeout=60, duration=60.00 at 0x7fe1ac1ffb10> task#=316 at 0x7fe1f0041ad0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod
result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1102, in connectStorageServer
connectionParams)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
result = ctask.prepare(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
return m(self, *a, **kw)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare
result = self._run(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer
conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 488, in connect
iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 213, in execCmd
(out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
ready = poller.poll(self._remaining_time(endtime)) (executor:363)
Thanks a lot!
I'm struggling trying to figure out your architecture. It seems like you've configured your Storage Domain pointing to 10.42.1.10:3260 as iSCSI portal.
If I've understood correctly, your cluster should be something like:
+-------------+
|HostedEngine |
+------+------+
|
| Management network
+-------+------+ 10.42.0.0/24
| |
+ +
10.42.0.10 10.42.0.20
+--------+ +--------+
| host1 | | host2 |
+--------+ +--------+
10.42.1.10 10.42.1.20
+ +
| |
+-------+------+
|
| Storage network
+---+---+ 10.42.1.0/24
|Storage|
+-------+
Provided my guess is correct, it seems you've configured your iSCSI target on host1 instead of a proper, external, storage device. Otherwise, you've messed up with IP addressing.
I have a dataframe, with 20 different sheets. It ran normally for the first 16 sheets, but on the 17th sheet it raised an error. Here is my code:
A=A.sort_values(by=['timing','id'])
The error was:
Traceback (most recent call last):
File "<ipython-input-24-11bf4f35bb1b>", line 1, in <module>
SessionNumber(5)
File "filepath", line 160
DepthBuyA=DepthBuyA.sort_values(by=['timing','id'])
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 4411, in sort_values
stacklevel=stacklevel)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
So I thought, there must be some problem with the column 'id' on that particular sheet, because other sheets also had 'id' and none of which raised an error like that. So I tried:
print(A['id'])
And it successfully printed the column 'id' for sheet 17, however, right after printing it, it raised this error:
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda\lib\site-packages\pandas\core\internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "C:\Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'id'
So after that I tried the code by putting it directly into the console, and now there is no error.
A=A.sort_values(by=['timing','id'])
So what is the problem and what can I do to fix?
Thank you!
Used column index instead of name, it is fine now
Trying to slice Panda Dataframe using minutes and getting error message "pandas.tslib.OutOfBoundsDatetime" .
My understanding is that I can slice panda's data with format x['HH:MM' : 'HH:MM' ] but do not understand what is wrong:
min_index
Out[237]:
0 04:00
1 04:01
2 04:04
3 04:05
4 04:07
dtype: object
df=pd.DataFrame(np.arange(5))
x=df.set_index(pd.to_datetime(min_index))
x
Out[241]:
0
TimeBarStart
2017-02-07 04:00:00 0
2017-02-07 04:01:00 1
2017-02-07 04:04:00 2
2017-02-07 04:05:00 3
2017-02-07 04:07:00 4
x['04:01' : '04:05']
Traceback (most recent call last):
File "C:\Users\Chris\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-242-e650c256178d>", line 1, in <module>
x['04:01' : '04:05']
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1985, in __getitem__
indexer = convert_to_index_sliceable(self, key)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1758, in convert_to_index_sliceable
return idx._convert_slice_indexer(key, kind='getitem')
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 1079, in _convert_slice_indexer
indexer = self.slice_indexer(start, stop, step, kind=kind)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 1511, in slice_indexer
return Index.slice_indexer(self, start, end, step, kind=kind)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2785, in slice_indexer
kind=kind)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2964, in slice_locs
start_slice = self.get_slice_bound(start, 'left', kind)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2903, in get_slice_bound
label = self._maybe_cast_slice_bound(label, side, kind)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 1472, in _maybe_cast_slice_bound
bounds = self._parsed_string_to_bounds(reso, parsed)
File "C:\Users\Chris\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 1297, in _parsed_string_to_bounds
return (Timestamp(st, tz=self.tz),
File "pandas\tslib.pyx", line 295, in pandas.tslib.Timestamp.__new__ (pandas\tslib.c:9203)
File "pandas\tslib.pyx", line 1334, in pandas.tslib.convert_to_tsobject (pandas\tslib.c:25690)
File "pandas\tslib.pyx", line 1562, in pandas.tslib._check_dts_bounds (pandas\tslib.c:29245)
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 04:01:00
Try to use between_time() method:
In [75]: x.between_time('04:01', '04:05')
Out[75]:
0
1
2017-02-07 04:01:00 1
2017-02-07 04:04:00 2
2017-02-07 04:05:00 3
I would like to call a beam.io.Write(beam.io.BigQuerySink(..)) operation from within a ParDo function to generate a separate BigQuery table for each key in the PCollection (i'm using the python SDK). Here are two similar threads, which unfortunately didn't help:
1) https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
2) Dynamic table name when writing to BQ from dataflow pipelines
When I execute the following code, the rows for the first key get inserted into BigQuery and then the pipeline fails with the error below. Would really appreciate any suggestions on what I'm doing wrong or any suggestions on how to fix it.
Pipeline code:
rows = p | 'read_bq_table' >> beam.io.Read(beam.io.BigQuerySource(query=query))
class par_upload(beam.DoFn):
def process(self, context):
key, value = context.element
### This block causes issues ###
value | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.first_table', #will be replace by a dynamic name based on key
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
### End block ######
return [value]
### Following part works fine ###
filtered = (rows | 'filter_rows' >> beam.Filter(lambda row: row['topic'] == 'analytics')
| 'apply_projection' >> beam.Map(apply_projection, projection_fields)
| 'group_by_key' >> beam.GroupByKey()
| 'par_upload_to_bigquery' >> beam.ParDo(par_upload())
| 'flat_map' >> beam.FlatMap(lambda l: l) #this step is just for testing
)
### This part works fine if I comment out the 'write_to_bq' block above
filtered | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.another_table',
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
)
Error message:
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:root:Writing 1 rows to PROJECT-NAME:analytics.first_table table.
INFO:root:Final: Debug counters: {'element_counts': Counter({'CreatePInput0': 1, 'write_to_bq/native_write': 1})}
ERROR:root:Error while visiting par_upload_to_bigquery
Traceback (most recent call last):
File "split_events.py", line 137, in <module>
run()
File "split_events.py", line 132, in run
p.run()
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 159, in run
return self.runner.run(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 102, in run
super(DirectPipelineRunner, self).run(pipeline)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 98, in run
pipeline.visit(RunVisitor(self))
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 182, in visit
self._root_transform().visit(visitor, self, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 419, in visit
part.visit(visitor, pipeline, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 422, in visit
visitor.visit_transform(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 93, in visit_transform
self.runner.run_transform(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in run_transform
return m(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 98, in func_wrapper
func(self, pvalue, *args, **kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 180, in run_ParDo
runner.process(v)
File "apache_beam/runners/common.py", line 133, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4483)
File "apache_beam/runners/common.py", line 139, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4311)
File "apache_beam/runners/common.py", line 150, in apache_beam.runners.common.DoFnRunner.reraise_augmented (apache_beam/runners/common.c:4677)
File "apache_beam/runners/common.py", line 137, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4245)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 149, in process
return self.run(self.dofn.process, context, args, kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 134, in run
result = method(context, *args, **kwargs)
File "split_events.py", line 73, in process
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 724, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 445, in __ror__
return _MaterializePValues(cache).visit(result)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 105, in visit
return self._pvalue_cache.get_unwindowed_pvalue(node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 262, in get_unwindowed_pvalue
return [v.value for v in self.get_pvalue(pvalue)]
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 244, in get_pvalue
value_with_refcount = self._cache[self.key(pvalue)]
KeyError: "(4384177040, None) [while running 'par_upload_to_bigquery']"
Edit (after the first answer):
I didn't realise my value needs to be a PCollection.
I've changed my code to this now (which is probably very inefficient):
key_pipe = p | 'pipe_' + key >> beam.Create(value)
key_pipe | 'write_' + key >> beam.io.Write(beam.io.BigQuerySink(..))
Which now works fine locally but not with BlockingDataflowPipelineRunner :-(
The pipeline fails with the following error:
JOB_MESSAGE_ERROR: (979394c29490e588): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 474, in do_work
work_executor.execute()
File "dataflow_worker/executor.py", line 901, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:24331)
op.start()
File "dataflow_worker/executor.py", line 465, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:14193)
def start(self):
File "dataflow_worker/executor.py", line 469, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13499)
fn, args, kwargs, tags_and_types, window_fn = (
ValueError: too many values to unpack (expected 5)
In the similar threads, the only suggestion to do BigQuery write operations in a ParDo was to use the BigQuery API directly, or using a client.
The code that you wrote is putting a Dataflow ParDo class beam.io.BigQuerySink() into a DoFn function. The ParDo class expects to work on a PCollection like filtered in the working code example. Which is not the case for the non-functioning code working on value.
I think the easiest option would be to take a look at the gcloud-python BigQuery function insert_data() and put this inside your ParDo.