iTextSharp font widths definition not correctly loaded - pdf

I have peculiar issue with a PDF i got from a customer who used Ghostscript 9.19 to create a PDF/A compatible PDF (1.4). I want to extract text from it, using the LocationTextExtractionStrategy but I get most of the text in reversed order and with additional spaces:
expecting: abc
getting: c b a
in the content stream this might be [a,1,b,1,c] TJ
This issue has been posted to Stackoverflow before, however I cracked down the issue further than hacking my own extraction strategy. I realized that the font widths for the font used, is all zeroes and neither the metrics nor hMetrics maps are filled. I was using iTextSharp 5.5.4, but 5.5.9 still has the same issue.
This is the font:
/Subtype /Type0
/BaseFont /VGOQEA+SegoeUI
/Type /Font
/Encoding 33 0 R
/ToUnicode 48 0 R
/DescendantFonts [35 0 R]
object 33 is a CMap:
/CIDSystemInfo 32 0 R
/Filter /FlateDecode
/Length 266
/CMapName /OneByteIdentityH
/Type /CMap
object 48 is a Stream and incidentally also a CMap
object 35 finally has some width definitions
/DW 539
/CIDSystemInfo 32 0 R
/Subtype /CIDFontType2
/BaseFont /VGOQEA+SegoeUI
/FontDescriptor 26 0 R
/Type /Font
/W [0, [646], 32, [274], 40, [302, 302], 44, [217, 400, 217], 48, [539, 539, 539, 539], 53, [539, 539, 539, 539], 58, [217], 64, [955, 645, 573], 68, [701, 506], 72, [710, 266], 75, [580, 471, 898], 80, [560], 82, [598, 531], 85, [687], 87, [934], 90, [570], 97, [509, 588, 462, 589, 523, 313, 589, 566, 242], 107, [497, 242, 861, 566, 586, 588], 114, [348, 424, 339, 566], 119, [723], 122, [452], 220, [687], 228, [509], 246, [586], 252, [566]]
/CIDToGIDMap 47 0 R
In the order that iTextSharp initializes the CMapAwareDocumentFont, first the base DocumentFont gets initialized. The constructor then checks:
if (PdfName.TYPE1.Equals(subType) || PdfName.TRUETYPE.Equals(subType))
it's neither.
else if (PdfName.TYPE3.Equals(subType))
it's not. Then the next else branch is entered:
PdfName encodingName = font.GetAsName(PdfName.ENCODING);
if (encodingName == null)
but since encoding is given as an indirect reference, encodingName is null. Thusly the handling of this Type0 font is never handled.
I changed this bit in the library code, but even if the method ProcessType0(font) is called, the widths values never made it to the font definition - i.e. all remained zero. This is probably to be expected. Finally I got the metrics hash map filled, but i didn't cover all the used characters. I.e. the situation got a bit better, but not all good:
instead of: abc -> c b a
I now got: abc -> acb
My only working but hacky solution to the current issue was to hack processor.DisplayPdfString((PdfString)entryObj) to adjust the textMatrix by a fixed amount. However this is no general solution and I would rather that the font would work correctly. Any other suggestions what I should try?
EDIT:
I revisited my issue using iText 7 .Net and now I get a NullReferenceException when trying to read text from my test file.
var reader = new iText.Kernel.Pdf.PdfReader(#"itextsharp_sample_locationtext_extraction_reversing_characters(1).pdf");
var extractionStrategy = new LocationTextExtractionStrategy();
var doc = new PdfDocument(reader);
var page = doc.GetFirstPage();
var test = PdfTextExtractor.GetTextFromPage(doc.GetFirstPage());
Console.WriteLine(test);
This is the file I used: https://drive.google.com/file/d/0B1RdIg0_Pbd_aTlOT2VmbnFlaTQ/view?usp=sharing
The code works for other PDF files but not for this one.
UPDATE
Thanks to mkl I know now for a while that the underlying issue is font encoding streams. Sadly I just got a message via the iText sales team that this is not on the roadmap. They claim this is issue is a weird outlier, most anybody else does not need the support for font encoding streams. So if any of you reading this does have an issue with missing support, kindly let them know by email.

Related

Data Length Error when Merging PDFs with PyPDF2

I am starting a project that will take specific pages out of each PDF in a folder and merge those pages into a single file. I am getting the error below when building the quoted code about the length of the encryption, and I don't know where I would need to address that.
from PyPDF2 import PdfFileMerger
import glob
files = glob.glob('C:/Users/Jake/Documents/UPLOAD/test_merge/*.pdf')
merger = PdfFileMerger()
for file in files:
merger.append(file)
merger.write("merged.pdf")
merger.close()
ERROR
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Merger\Merger .02", line 10, in <module>
merger.write("merged.pdf")
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_merger.py", line 312, in write
my_file, ret_fileobj = self.output.write(fileobj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 838, in write
self.write_stream(stream)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 811, in write_stream
self._sweep_indirect_references(self._root)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 960, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 1005, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_reader.py", line 1187, in get_object
retval = self._encryption.decrypt_object(
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 747, in decrypt_object
return cf.decrypt_object(obj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 185, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 179, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 87, in decrypt
d = aes.decrypt(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\Crypto\Cipher\_mode_cbc.py", line 246, in decrypt
raise ValueError("Data must be padded to %d byte boundary in CBC mode" % self.block_size)
ValueError: Data must be padded to 16 byte boundary in CBC mode
[Finished in 393ms]
I wrote a basic program from a YouTube video and tried to run it, but I got the error that PyCryptodome was a dependent for PyPDF2. After installing that, I am getting an error about the data length for encryption when writing the pdf. Googling that error lead me to this solution. I am a bit of a novice, and I don't really understand why any kind of encryption is being applied in the first place, other than what I assume is necessary for the pdf reader/writer to operate, so I don't know where I would need to apply that solution in this code.
After writing up this question, I was lead to this solution, which I tried to run the code below, I received the same error.
from PyPDF2 import PdfFileMerger, PdfFileReader
import glob
merger = PdfFileMerger()
files = glob.glob('C:/Users/Jake/Documents/UPLOAD/test_merge/*.pdf')
for filename in files:
with open(filename, 'rb') as source:
tmp = PdfFileReader(source)
merger.append(tmp)
merger.write('Result.pdf')
ERROR
Traceback (most recent call last):
File "C:\Users\Jake\Documents\Work Projects\Python\Contract Merger\Merger .03.py", line 13, in <module>
merger.write('Result.pdf')
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_merger.py", line 312, in write
my_file, ret_fileobj = self.output.write(fileobj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 838, in write
self.write_stream(stream)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 811, in write_stream
self._sweep_indirect_references(self._root)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 960, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_writer.py", line 1005, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_reader.py", line 1187, in get_object
retval = self._encryption.decrypt_object(
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 747, in decrypt_object
return cf.decrypt_object(obj)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 185, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 179, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "C:\Users\Jake\Anaconda3\lib\site-packages\PyPDF2\_encryption.py", line 87, in decrypt
d = aes.decrypt(data)
File "C:\Users\Jake\Anaconda3\lib\site-packages\Crypto\Cipher\_mode_cbc.py", line 246, in decrypt
raise ValueError("Data must be padded to %d byte boundary in CBC mode" % self.block_size)
ValueError: Data must be padded to 16 byte boundary in CBC mode
[Finished in 268ms]
My thinking is that something else has gone wrong, but I am at a loss at to what that could be.
What have I done wrong with this build to get this error, and how can I correct it?
Turns out this is an issue with PyPDF2. There is a 3-line fix that can be injected to correct the error if you attempt this before it is patched.

TensorFlow Federated (TFF) TypeError in tff.templates.IterativeProcess.next() when clients_per_round exceed 99

I implemented a custom federated learning GAN training loop with TFF similar to this code by Google Research.
The client data for a particular training round is found using the following code snippet:
def client_dataset_fn():
# Sample clients and data
sampled_clients = np.random.choice(train_data.client_ids, size=cfg.clients_per_round, replace=False)
datasets = [(next(client_gen_inputs_iterator),
train_data.create_tf_dataset_for_client(client_id).take(cfg.n_critic))
for client_id in sampled_clients]
return datasets
client_noise_inputs, client_real_data = zip(*client_dataset_fn())
This works perfectly up until cfg.clients_per_round is set to 99. When it is set to 100 or a larger value (with the total number of clients being larger of course), I receive the following error:
Traceback (most recent call last):
File "main.py", line 109, in main
metrics = run_single_trial(train_data, test_data, cfg)
File "/mnt/workspace/tff/GAN/federated/fedgan_main.py", line 73, in run_single_trial
metrics = train_loop(iterative_process, server_dataset_fn, client_dataset_fn, model, eval_hook_fn, cfg)
File "/mnt/workspace/tff/GAN/federated/fedgan_main.py", line 124, in train_loop
client_real_data)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/computation/function_utils.py", line 525, in __call__
return context.invoke(self, arg)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 226, in invoke
_ingest(executor, unwrapped_arg, arg.type_signature)))
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 396, in _wrapped
return await coro
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 111, in _ingest
ingested = await asyncio.gather(*ingested)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/execution_context.py", line 116, in _ingest
return await executor.create_value(val, type_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 294, in create_value
value, type_spec))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 111, in create_value
self._target_executor.create_value(value, type_spec))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 105, in _delegate
result_value = await _delegate_with_trace_ctx(coro, self._event_loop)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 396, in _wrapped
return await coro
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
result = await fn(*fn_args, **fn_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/federating_executor.py", line 394, in create_value
return await self._strategy.compute_federated_value(value, type_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/core/impl/executors/federated_composing_strategy.py", line 279, in compute_federated_value
py_typecheck.check_type(value, list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/common_libs/py_typecheck.py", line 41, in check_type
type_string(type_spec), type_string(type(target))))
TypeError: Expected list, found tuple.
During debugging, I looked at the target variable in the final line of the traceback and found it to be the abovementioned client_real_data and client_noise_inputs. Their types are in fact tuples not lists, however, this does not change with different numbers of cfg.clients_per_round. The only usage of cfg.clients_per_round is shown above in the random choice.
I really cannot explain why this is happening, maybe somebody out there has experienced something similar and can help me out.
My used package versions are as follows:
Python 3.6.9 or 3.8.10 (checked both)
tensorflow 2.5.1
tensorflow-federated 0.19.0
retrying 1.3.3
six 1.15.0
As a workaround I now manually change the data type of client_noise_inputs and client_real_data using list(tuple_var), but I am still curious as to why the list is required somehow.
(Copying and pasting from original on GitHub)
This seems to me to be an implementation distinction between the federated_composing_strategy and the federated_resolving_strategy. IIRC, by default we don't inject a composing executor into your stack until you hit 100 clients--which would be the source of this exciting mystery.
In particular, the composing strategy is programmed against the assumption that the incoming clients-placed value is represented as a list, whereas the resolving strategy codes against a much more flexible set of containers.
It's not wild to coerce your clients-placed value to a list--we also could extend the permitted representation of clients-placed values in the composing executor to match that in the resolving one, possibly pulling the appropriate logic to a shared place like here. I think its a contribution wed be very happy to accept if youre up for it!

How to convert Jupyter notebook to PDF via Latex?

I am trying to convert a Jupyter notebook to pdf via Latex using nbconvert, in order to automatically include citations to articles contained in a separate .bib file. I worked according to the tutorial/example here. Such a tutorial was suggested in the nbconvert documentation, here.
I have the following files in the same directory in which I am running the Jupyter notebook:
citations.tplx (the template to be used to include the bibliography)
references.bib (a .bib file containing the citations, taken from Google Scholar)
Inside the markdown cells, I use the following syntax to cite a work:
<cite data-cite="cortez2009modeling">(Cortez, 2009)</cite>
where such a work in the .bib file is reported as follows:
#article{cortez2009modeling,
title={Modeling wine preferences by data mining from physicochemical properties},
author={Cortez, Paulo and Cerdeira, Ant{\'o}nio and Almeida, Fernando and Matos, Telmo and Reis, Jos{\'e}},
journal={Decision support systems},
volume={47},
number={4},
pages={547--553},
year={2009},
publisher={Elsevier}
}
In a new notebook, also saved in the same location, I run the following command, also taken by the tutorial mentioned above:
%%bash
jupyter nbconvert --to latex --template citations.tplx --post pdf my_notebook.ipynb
I get a very long output, full of warnings, but basically, the error is:
ModuleNotFoundError: No module named 'pdf'
I also tried to do this according to other tutorials on the web, but even when the PDF file was indeed generated (using a slightly different nbconvert command), my citations were not captured in the text (a question mark would appear instead), and there was no bibliography at the end of the document. A warning would say there were 'problems' with Bibtex, but nothing more.
In the following, I report the complete output of the command I wrote above:
Traceback (most recent call last):
File "/opt/anaconda3/bin/jupyter-nbconvert", line 11, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.8/site-packages/jupyter_core/application.py", line 254, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 844, in launch_instance
app.initialize(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 290, in initialize
super().initialize(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/jupyter_core/application.py", line 225, in initialize
self.parse_command_line(argv)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
return method(app, *args, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/application.py", line 713, in parse_command_line
self.update_config(self.cli_config)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/configurable.py", line 220, in update_config
self._load_config(config)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/config/configurable.py", line 190, in _load_config
warn(msg)
File "/opt/anaconda3/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1214, in hold_trait_notifications
self.notify_change(change)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1227, in notify_change
return self._notify_observers(change)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/traitlets.py", line 1264, in _notify_observers
c(event)
File "/opt/anaconda3/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 265, in _postprocessor_class_changed
self.postprocessor_factory = import_item(new)
File "/opt/anaconda3/lib/python3.8/site-packages/traitlets/utils/importstring.py", line 38, in import_item
return __import__(parts[0])
ModuleNotFoundError: No module named 'pdf'
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-22-c5829f9d50d0> in <module>
----> 1 get_ipython().run_cell_magic('bash', '', 'jupyter nbconvert --to latex --template citations.tplx --post pdf Orlando_Taddeo_CW.ipynb\n')
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2397 with self.builtin_trap:
2398 args = (magic_arg_s, cell)
-> 2399 result = fn(*args, **kwargs)
2400 return result
2401
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
140 else:
141 line = script
--> 142 return self.shebang(line, cell)
143
144 # write a basic docstring:
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/decorator.py in fun(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
/opt/anaconda3/envs/tf/lib/python3.7/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
243 sys.stderr.flush()
244 if args.raise_error and p.returncode!=0:
--> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
246
247 def _run_script(self, p, cell, to_close):
CalledProcessError: Command 'b'jupyter nbconvert --to latex --template citations.tplx --post pdf Orlando_Taddeo_CW.ipynb\n'' returned non-zero exit status 1.
Could anyone please shed some light on this? I really can't figure out why it does not work.
Thank you very much in advance.

tensorflow 2.0 keras save model to hdfs: Can't decrement id ref count

I have mounted an hdfs drive by hdfs fuse, thus I can access the hdfs by path /hdfs/xxx.
After training a model by keras, I want to save it to /hdfs/model.h5 by model.save("/hdfs/model.h5").
I get the following error:
2020-02-26T10:06:51.83869705Z File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838791107Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838796288Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838802442Z Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
2020-02-26T10:06:51.838807122Z Traceback (most recent call last):
2020-02-26T10:06:51.838811833Z File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838816793Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838821942Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838827917Z Traceback (most recent call last):
2020-02-26T10:06:51.838832755Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 117, in save_model_to_hdf5
2020-02-26T10:06:51.838838098Z f.flush()
2020-02-26T10:06:51.83885453Z File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 452, in flush
2020-02-26T10:06:51.838859816Z h5f.flush(self.id)
2020-02-26T10:06:51.838864401Z File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838869302Z File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838874126Z File "h5py/h5f.pyx", line 146, in h5py.h5f.flush
2020-02-26T10:06:51.838879016Z RuntimeError: Can't flush cache (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838885827Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x4e5b018, total write size = 4, bytes this sub-write = 4, bytes actually written = 18446744073709551615, offset = 34552)
But I can directly write a file to the same path by
with open("/hdfs/a.txt") as f:
f.write("1")
Also I've figured out a tricky workaround and it worked...
model.save("temp.h5")
move("temp.h5", "/hdfs/model.h5")
So maybe the problem is about keras api? It can only save the model locally but cannot save to an hdfs path.
Any idea how to fix the problem?
I don't think tensorflow makes any promises about being able to save to hdfs-fuse. Your (final) error is "Can't flush cache" not, "Can't decrement id ref count", basically meaning "Can't save straight to hdfs-fuse". But, to be honest, it seems fixed to me, your workaround is fine.

Writing to BigQuery from within a ParDo function

I would like to call a beam.io.Write(beam.io.BigQuerySink(..)) operation from within a ParDo function to generate a separate BigQuery table for each key in the PCollection (i'm using the python SDK). Here are two similar threads, which unfortunately didn't help:
1) https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
2) Dynamic table name when writing to BQ from dataflow pipelines
When I execute the following code, the rows for the first key get inserted into BigQuery and then the pipeline fails with the error below. Would really appreciate any suggestions on what I'm doing wrong or any suggestions on how to fix it.
Pipeline code:
rows = p | 'read_bq_table' >> beam.io.Read(beam.io.BigQuerySource(query=query))
class par_upload(beam.DoFn):
def process(self, context):
key, value = context.element
### This block causes issues ###
value | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.first_table', #will be replace by a dynamic name based on key
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
### End block ######
return [value]
### Following part works fine ###
filtered = (rows | 'filter_rows' >> beam.Filter(lambda row: row['topic'] == 'analytics')
| 'apply_projection' >> beam.Map(apply_projection, projection_fields)
| 'group_by_key' >> beam.GroupByKey()
| 'par_upload_to_bigquery' >> beam.ParDo(par_upload())
| 'flat_map' >> beam.FlatMap(lambda l: l) #this step is just for testing
)
### This part works fine if I comment out the 'write_to_bq' block above
filtered | 'write_to_bq' >> beam.io.Write(
beam.io.BigQuerySink(
'PROJECT-NAME:analytics.another_table',
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
)
Error message:
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:root:Writing 1 rows to PROJECT-NAME:analytics.first_table table.
INFO:root:Final: Debug counters: {'element_counts': Counter({'CreatePInput0': 1, 'write_to_bq/native_write': 1})}
ERROR:root:Error while visiting par_upload_to_bigquery
Traceback (most recent call last):
File "split_events.py", line 137, in <module>
run()
File "split_events.py", line 132, in run
p.run()
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 159, in run
return self.runner.run(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 102, in run
super(DirectPipelineRunner, self).run(pipeline)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 98, in run
pipeline.visit(RunVisitor(self))
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 182, in visit
self._root_transform().visit(visitor, self, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 419, in visit
part.visit(visitor, pipeline, visited)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/pipeline.py", line 422, in visit
visitor.visit_transform(self)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 93, in visit_transform
self.runner.run_transform(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in run_transform
return m(transform_node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 98, in func_wrapper
func(self, pvalue, *args, **kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/direct_runner.py", line 180, in run_ParDo
runner.process(v)
File "apache_beam/runners/common.py", line 133, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4483)
File "apache_beam/runners/common.py", line 139, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4311)
File "apache_beam/runners/common.py", line 150, in apache_beam.runners.common.DoFnRunner.reraise_augmented (apache_beam/runners/common.c:4677)
File "apache_beam/runners/common.py", line 137, in apache_beam.runners.common.DoFnRunner.process (apache_beam/runners/common.c:4245)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 149, in process
return self.run(self.dofn.process, context, args, kwargs)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/typehints/typecheck.py", line 134, in run
result = method(context, *args, **kwargs)
File "split_events.py", line 73, in process
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 724, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 445, in __ror__
return _MaterializePValues(cache).visit(result)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 105, in visit
return self._pvalue_cache.get_unwindowed_pvalue(node)
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 262, in get_unwindowed_pvalue
return [v.value for v in self.get_pvalue(pvalue)]
File "/Users/dimitri/anaconda/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 244, in get_pvalue
value_with_refcount = self._cache[self.key(pvalue)]
KeyError: "(4384177040, None) [while running 'par_upload_to_bigquery']"
Edit (after the first answer):
I didn't realise my value needs to be a PCollection.
I've changed my code to this now (which is probably very inefficient):
key_pipe = p | 'pipe_' + key >> beam.Create(value)
key_pipe | 'write_' + key >> beam.io.Write(beam.io.BigQuerySink(..))
Which now works fine locally but not with BlockingDataflowPipelineRunner :-(
The pipeline fails with the following error:
JOB_MESSAGE_ERROR: (979394c29490e588): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 474, in do_work
work_executor.execute()
File "dataflow_worker/executor.py", line 901, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:24331)
op.start()
File "dataflow_worker/executor.py", line 465, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:14193)
def start(self):
File "dataflow_worker/executor.py", line 469, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13499)
fn, args, kwargs, tags_and_types, window_fn = (
ValueError: too many values to unpack (expected 5)
In the similar threads, the only suggestion to do BigQuery write operations in a ParDo was to use the BigQuery API directly, or using a client.
The code that you wrote is putting a Dataflow ParDo class beam.io.BigQuerySink() into a DoFn function. The ParDo class expects to work on a PCollection like filtered in the working code example. Which is not the case for the non-functioning code working on value.
I think the easiest option would be to take a look at the gcloud-python BigQuery function insert_data() and put this inside your ParDo.