How can I get pdfplumber to recognise paragraphs in a cell? - pdf

I am using pdfplumber for table-extraction. By extracting a cell with multiple paragraphs pdfplumber recognises each paragraph as row, but they are in a cell and should be seen as one row. I tried some table_settings in the extract_table function, but could not change output.
The code I used:
import pdfplumber
with pdfplumber.open(PDFPfad) as pdf:
Seite = pdf.pages[4]
Tabelle = Seite.extract_table()
print(Tabelle)
current output:
Tabelle = [
['Inhaltsstoff', 'CAS-Nr.', 'Wert', '', 'Zu', '', 'Grundlage'],
[None, None, None, None, 'überwachende', None, None],
[None, None, None, None, 'Parameter', None, None],
...
]
desired output:
Tabelle = [
['Inhaltsstoff', 'CAS-Nr.', 'Wert', '', 'Zu \nüberwachende \nParameter', '', 'Grundlage'],
...
]
I don't know which settings in extract_table(table_settings={...}) can lead to my desired output. I would be happy if you could help me.
Table example: https://i.stack.imgur.com/9oERz.png

Related

How to print a FlatMapDataset?

For debugging, I would like to print a tensorflow FlatMapDataset.
When trying to use the print method of a tf.data.Dataset, I get the error: AttributeError: 'FlatMapDataset' object has no attribute 'print'.
What I expected was some kind of print-out to assess whether the content of the dataset is what I expected.
Apparently a FlatMapDataset does not have the method:
['_GeneratorState', '__abstractmethods__', '__bool__', '__class__', '__class_getitem__', '__debug_string__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__tf_tracing_type__', '__weakref__', '_abc_impl', '_add_trackable_child', '_add_variable_with_custom_getter', '_apply_debug_options', '_as_serialized_graph', '_checkpoint_dependencies', '_common_args', '_consumers', '_convert_variables_to_tensors', '_deferred_dependencies', '_deserialization_dependencies', '_deserialize_from_proto', '_export_to_saved_model_graph', '_flat_shapes', '_flat_structure', '_flat_types', '_functions', '_gather_saveables_for_checkpoint', '_graph', '_graph_attr', '_handle_deferred_dependencies', '_input_dataset', '_inputs', '_lookup_dependency', '_map_func', '_map_resources', '_maybe_initialize_trackable', '_maybe_track_assets', '_metadata', '_name', '_name_based_attribute_restore', '_name_based_restores', '_no_dependency', '_object_identifier', '_options', '_options_attr', '_options_tensor_to_options', '_preload_simple_restoration', '_restore_from_tensors', '_serialize_to_proto', '_serialize_to_tensors', '_setattr_tracking', '_shape_invariant_to_type_spec', '_structure', '_tf_api_names', '_tf_api_names_v1', '_trace_variant_creation', '_track_trackable', '_trackable_children', '_transformation_name', '_type_spec', '_unconditional_checkpoint_dependencies', '_unconditional_dependency_names', '_update_uid', '_variant_tensor', '_variant_tensor_attr', 'apply', 'as_numpy_iterator', 'batch', 'bucket_by_sequence_length', 'cache', 'cardinality', 'choose_from_datasets', 'concatenate', 'element_spec', 'enumerate', 'filter', 'flat_map', 'from_generator', 'from_tensor_slices', 'from_tensors', 'get_single_element', 'group_by_window', 'interleave', 'list_files', 'load', 'map', 'options', 'padded_batch', 'prefetch', 'random', 'range', 'reduce', 'rejection_resample', 'repeat', 'sample_from_datasets', 'save', 'scan', 'shard', 'shuffle', 'skip', 'snapshot', 'take', 'take_while', 'unbatch', 'unique', 'window', 'with_options', 'zip']
How can I print a FlatMapDataSet in some convenient way to review it's content?
You can print the flatmap dataset by iterating over the datset.
For example
import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices([[[1,2, 3], [3,4,5]]])
dataset = dataset.flat_map(lambda x : tf.data.Dataset.from_tensor_slices(x**2))
for i in dataset:
print(i)
output of above code:
tf.Tensor([1 4 9], shape=(3,), dtype=int32)
tf.Tensor([ 9 16 25], shape=(3,), dtype=int32)
Thank You.

Read csv with multiline text columns by dask

I have to read csv which contains full text data, which can be multiline. I am able to read this csv by pure pandas (tested on version 0.25.3 and 1.0.3) without any problems but when i try to read this csv by dask i receive ParserError: Error tokenizing data. C error: EOF inside string starting at row 28 the row number depends on file which i try to read.
I prepared the artificial dataframe to reproduce this error. Can i something tune in dask parameters, preprocess input file or this is dask implementation issue?
multiplication_factor = 71 # 70 works fine, 71 fail
number_of_columns = 100
import pandas as pd
import dask.dataframe as dd
import textwrap
pandas_default_kwargs = {
'cache_dates': True,
# 'chunksize': None, # not support by dask
'comment': None,
# 'compression': 'infer', # not support by dask
'converters': None,
'date_parser': None,
'dayfirst': False,
'decimal': b'.',
'delim_whitespace': False,
'delimiter': None,
'dialect': None,
'doublequote': True,
'dtype': object,
'encoding': None,
'engine': None,
'error_bad_lines': True,
'escapechar': None,
'false_values': None,
'float_precision': None,
'header': 'infer',
# 'index_col': None, # not support by dask
'infer_datetime_format': False,
# 'iterator': False, # not support by dask
'keep_date_col': False,
'keep_default_na': True,
'lineterminator': None,
'low_memory': True,
'mangle_dupe_cols': True,
'memory_map': False,
'na_filter': True,
'na_values': None,
'names': None,
'nrows': None,
'parse_dates': False,
'prefix': None,
'quotechar': '"',
'quoting': 0,
'sep': ',',
'skip_blank_lines': True,
'skipfooter': 0,
'skipinitialspace': False,
'skiprows': None,
'squeeze': False,
'thousands': None,
'true_values': None,
'usecols': None,
'verbose': False,
'warn_bad_lines': True,
}
artificial_df_1_row = pd.DataFrame(
data=[
(
textwrap.dedent(
f"""
some_data_for
column_number_{i}
"""
)
for i
in range(number_of_columns)
)
],
columns=[f'column_name_number_{i}' for i in range(number_of_columns)]
)
path_to_single_line_csv = './single_line.csv'
path_to_multi_line_csv = './multi_line.csv'
# prepare data to save
single_line_df = artificial_df_1_row
multi_line_df = pd.concat(
[single_line_df] * multiplication_factor,
)
# save data
single_line_df.to_csv(path_to_single_line_csv, index=False)
multi_line_df.to_csv(path_to_multi_line_csv, index=False)
# read 1 row csv by dask - works
dask_single_line_df = dd.read_csv(
path_to_single_line_csv,
blocksize=None, # read as single block
**pandas_default_kwargs
)
dask_single_line_df_count = dask_single_line_df.shape[0].compute()
print('[DASK] single line count', dask_single_line_df_count)
# read multiline csv by pandas - works
pandas_multi_line_df = pd.read_csv(
path_to_multi_line_csv,
**pandas_default_kwargs
)
pandas_multi_line_df_shape_0 = pandas_multi_line_df.shape[0]
print('[PANDAS] multi line count', pandas_multi_line_df_shape_0)
# read multine csv by dask - depends on number of rows fails or not
dask_multi_line_df = dd.read_csv(
path_to_multi_line_csv,
blocksize=None, # read as single block
**pandas_default_kwargs
)
dask_multi_line_df_shape_0 = dask_multi_line_df.shape[0].compute()
print('[DASK] multi line count', dask_multi_line_df_shape_0)
The only way you can read such a file is to ensure that the chunk boundaries are not within a quoted string which, unless you know a lot about the data layout, means not chunking a file at all (but you can still parallelise between files).
This is because, the only way to know whether or not you are in a quoted string is to parse a file from the start, and the way dask achieves parallelism is to have each chunk-reading task completely independent, needing only a file offset. In practice, dask reads from the offset and considers the first newline marker as the point to start parsing from.

Performing inference with a BERT (TF 1.x) saved model

I'm stuck on one line of code and have been stalled on a project all weekend as a result.
I am working on a project that uses BERT for sentence classification. I have successfully trained the model, and I can test the results using the example code from run_classifier.py.
I can export the model using this example code (which has been reposted repeatedly, so I believe that it's right for this model):
def export(self):
def serving_input_fn():
label_ids = tf.placeholder(tf.int32, [None], name='label_ids')
input_ids = tf.placeholder(tf.int32, [None, self.max_seq_length], name='input_ids')
input_mask = tf.placeholder(tf.int32, [None, self.max_seq_length], name='input_mask')
segment_ids = tf.placeholder(tf.int32, [None, self.max_seq_length], name='segment_ids')
input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
'label_ids': label_ids, 'input_ids': input_ids,
'input_mask': input_mask, 'segment_ids': segment_ids})()
return input_fn
self.estimator._export_to_tpu = False
self.estimator.export_savedmodel(self.output_dir, serving_input_fn)
I can also load the exported estimator (where the export function saves the exported model into a subdirectory labeled with a timestamp):
predict_fn = predictor.from_saved_model(self.output_dir + timestamp_number)
However, for the life of me, I cannot figure out what to provide to predict_fn as input for inference. Here is my best code at the moment:
def predict(self):
input = 'Test input'
guid = 'predict-0'
text_a = tokenization.convert_to_unicode(input)
label = self.label_list[0]
examples = [InputExample(guid=guid, text_a=text_a, text_b=None, label=label)]
features = convert_examples_to_features(examples, self.label_list,
self.max_seq_length, self.tokenizer)
predict_input_fn = input_fn_builder(features, self.max_seq_length, False)
predict_fn = predictor.from_saved_model(self.output_dir + timestamp_number)
result = predict_fn(predict_input_fn) # this generates an error
print(result)
It doesn't seem to matter what I provide to predict_fn: the examples array, the features array, the predict_input_fn function. Clearly, predict_fn wants a dictionary of some type - but every single thing that I've tried generates an exception due to a tensor mismatch or other errors that generally mean: bad input.
I presumed that the from_saved_model function wants the same sort of input as the model test function - apparently, that's not the case.
It seems that lots of people have asked this very question - "how do I use an exported BERT TensorFlow model for inference?" - and have gotten no answers:
Thread #1
Thread #2
Thread #3
Thread #4
Any help? Thanks in advance.
Thank you for this post. Your serving_input_fn was the piece I was missing! Your predict function needs to be changed to feed the features dict directly, rather than use the predict_input_fn:
def predict(sentences):
labels = [0, 1]
input_examples = [
run_classifier.InputExample(
guid="",
text_a = x,
text_b = None,
label = 0
) for x in sentences] # here, "" is just a dummy label
input_features = run_classifier.convert_examples_to_features(
input_examples, labels, MAX_SEQ_LEN, tokenizer
)
# this is where pred_input_fn is replaced
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in input_features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
pred_dict = {
'input_ids': all_input_ids,
'input_mask': all_input_mask,
'segment_ids': all_segment_ids,
'label_ids': all_label_ids
}
predict_fn = predictor.from_saved_model('../testing/1589418540')
result = predict_fn(pred_dict)
print(result)
pred_sentences = [
"That movie was absolutely awful",
"The acting was a bit lacking",
"The film was creative and surprising",
"Absolutely fantastic!",
]
predict(pred_sentences)
{'probabilities': array([[-0.3579178 , -1.2010787 ],
[-0.36648935, -1.1814401 ],
[-0.30407643, -1.3386648 ],
[-0.45970002, -0.9982413 ],
[-0.36113673, -1.1936386 ],
[-0.36672896, -1.1808994 ]], dtype=float32), 'labels': array([0, 0, 0, 0, 0, 0])}
However, the probabilities returned for sentences in pred_sentences do not match the probabilities I get use estimator.predict(predict_input_fn) where estimator is the fine-tuned model being used within the same (python) session. For example, [-0.27276006, -1.4324446 ] using estimator vs [-0.26713806, -1.4505868 ] using predictor.

What is good way to check a value existed in the tensor list in Tensorflow (batch version)?

Values Tensor: [[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]]
Query Tensor: [[1,2,8], [0,0,6], [11,12,13]]
Reult tensor: [[True, True, False],[False, False, True],[True, True, True]]
If I have a values tensor and query tensor, i want to check whether query tensor existed in values tensor one by one, then return the result tensor.
Could I ask do we have vector-based way to do this (rather than using tf.while_loop)?
Updated: I think as following, tf.sets.set_intersection may be useful.
import tensorflow as tf
a = tf.constant([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
b = tf.constant([[1,2,8], [0,0,6], [11,12,13]])
res = tf.sets.set_intersection(a, b)
res2 = tf.sparse_tensor_to_dense(
res, default_value=-1)
with tf.Session() as sess:
print(sess.run(res2))
[[ 1 2 -1]
[ 6 -1 -1]
[11 12 13]]
You can achieve through subtracting every element of b with every other element of a, and then finding the indices of the zeros:
find_match =tf.reduce_prod(tf.transpose(a)[...,None]- tf.abs(b[None,...]), 0)
find_idx = tf.equal(find_match,tf.zeros_like(find_match))
with tf.Session() as sess:
print(sess.run(find_idx))
#[[ True True False]
# [False False True]
# [ True True True]]

Trouble setting up the SimpleVector encoder

Using the commits from breznak for the encoders (I wasn't able to figure out "git checkout ..." with GitHub, so I just carefully copied over the three files - base.py, multi.py, and multi_test.py).
I ran multi_test.py without any problems.
Then I adjusted my model parameters (MODEL_PARAMS), so that the encoders portion of 'sensorParams' looks like this:
'encoders': {
'frequency': {
'fieldname': u'frequency',
'type': 'SimpleVector',
'length': 5,
'minVal': 0,
'maxVal': 210
}
},
I also adjusted the modelInput portion of my code, so it looked like this:
model = ModelFactory.create(model_params.MODEL_PARAMS)
model.enableInference({'predictedField': 'frequency'})
y = [1,2,3,4,5]
modelInput = {"frequency": y}
result = model.run(modelInput)
But I get the final error, regardless if I instantiate 'y' as a list or a numpy.ndarray
File "nta/eng/lib/python2.7/site-packages/nupic/encoders/base.py", line 183, in _getInputValue
return getattr(obj, fieldname)
AttributeError: 'list' object has no attribute 'idx0'
I also tried initializing a SimpleVector encoder inline with my modelInput, directly encoding my array, then passing it through modelInput. That violated the input parameters of my SimpleVector, because I was now double encoding. So I removed the encoders portion of my model parameters dictionary. That caused a spit up, because some part of my model was looking for that portion of the dictionary.
Any suggestions on what I should do next?
Edit: Here're the files I'm using with the OPF.
sendAnArray.py
import numpy
from nupic.frameworks.opf.modelfactory import ModelFactory
import model_params
class sendAnArray():
def __init__(self):
self.model = ModelFactory.create(model_params.MODEL_PARAMS)
self.model.enableInference({'predictedField': 'frequency'})
for i in range(100):
self.run()
def run(self):
y = [1,2,3,4,5]
modelInput = {"frequency": y}
result = self.model.run(modelInput)
anomalyScore = result.inferences['anomalyScore']
print y, anomalyScore
sAA = sendAnArray()
model_params.py
MODEL_PARAMS = {
'model': "CLA",
'version': 1,
'predictAheadTime': None,
'modelParams': {
'inferenceType': 'TemporalAnomaly',
'sensorParams': {
'verbosity' : 0,
'encoders': {
'frequency': {
'fieldname': u'frequency',
'type': 'SimpleVector',
'length': 5,
'minVal': 0,
'maxVal': 210
}
},
'sensorAutoReset' : None,
},
'spEnable': True,
'spParams': {
'spVerbosity' : 0,
'globalInhibition': 1,
'columnCount': 2048,
'inputWidth': 5,
'numActivePerInhArea': 60,
'seed': 1956,
'coincInputPoolPct': 0.5,
'synPermConnected': 0.1,
'synPermActiveInc': 0.1,
'synPermInactiveDec': 0.01,
},
'tpEnable' : True,
'tpParams': {
'verbosity': 0,
'columnCount': 2048,
'cellsPerColumn': 32,
'inputWidth': 2048,
'seed': 1960,
'temporalImp': 'cpp',
'newSynapseCount': 20,
'maxSynapsesPerSegment': 32,
'maxSegmentsPerCell': 128,
'initialPerm': 0.21,
'permanenceInc': 0.1,
'permanenceDec' : 0.1,
'globalDecay': 0.0,
'maxAge': 0,
'minThreshold': 12,
'activationThreshold': 16,
'outputType': 'normal',
'pamLength': 1,
},
'clParams': {
'regionName' : 'CLAClassifierRegion',
'clVerbosity' : 0,
'alpha': 0.0001,
'steps': '5',
},
'anomalyParams': {
u'anomalyCacheRecords': None,
u'autoDetectThreshold': None,
u'autoDetectWaitRecords': 2184
},
'trainSPNetOnlyIfRequested': False,
},
}
The problem seems to be that the SimpleVector class is accepting an array instead of a dict as its input, and then reconstructs that internally as {'list': {'idx0': 1, 'idx1': 2, ...}} (ie as if this dict had been the input). This is fine if it is done consistently, but your error shows that it's broken down somewhere. Have a word with #breznak about this.
Working through the OPF was difficult. I wanted to input an array of indices into the temporal pooler, so I opted to interface directly with the algorithms (I relied heavy on hello_tp.py). I ignored SimpleVector all together, and instead worked through the BitmapArray encoder.
Subutai has a useful email on the nupic-discuss listserve, where he breaks down the three main areas of the NuPIC API: algorithms, networks/regions, & the OPF. That helped me understand my options better.