should I trust the product of 500 probabilities - tensorflow

I thought special techniques are needed. But experiments show little difference.
import numpy as np
import tensorflow as tf
p = np.random.rand(500)
print(f'prod : {np.prod(p)}')
print(f'exp-sum-log: {np.exp(sum(np.log(p)))}')
e = tf.constant(p)
print(f'tensorflow : {tf.math.reduce_prod(e)}')
prod : 1.564231010023949e-224
exp-sum-log: 1.5642310100240046e-224
tensorflow : 1.5642310100239522e-224
prod : 7.854750422663386e-232
exp-sum-log: 7.854750422664323e-232
tensorflow : 7.854750422663366e-232
prod : 3.635104367139144e-211
exp-sum-log: 3.635104367137875e-211
tensorflow : 3.63510436713914e-211

Related

tflite_model_maker if obj['difficult'] == 'Unspecified': KeyError: 'difficult'

i am trying to train a tflite model using just people in coco dataset.
I am using tflite model maker to train and fiftyone to process dataset.
when running the training file .py i get the error below.
root#85ac26b47f92:/external# root#85ac26b47f92:/external# python demofie.py
2022-11-01 21:02:01.059188: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
2022-11-01 21:02:01.059234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 85ac26b47f92
2022-11-01 21:02:01.059242: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 85ac26b47f92
2022-11-01 21:02:01.059324: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-11-01 21:02:01.059381: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
2022-11-01 21:02:01.059821: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "demofie.py", line 20, in <module>
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader.py", line 217, in from_pascal_voc
cache_writer.write_files(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/core/data_util/object_detector_dataloader_util.py", line 252, in write_files
tf_example = create_pascal_tfrecord.dict_to_tf_example(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_examples/lite/model_maker/third_party/efficientdet/dataset/create_pascal_tfrecord.py", line 162, in dict_to_tf_example
if obj['difficult'] == 'Unspecified':
KeyError: 'difficult'
code that causes the error. can anyone with better coding knowledge than me shed some light on any mistakes i may have made.
I have added the fiftyone code below this (no error)
import numpy as np
import os
from tflite_model_maker.config import QuantizationConfig
from tflite_model_maker.config import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import object_detector
import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')
from absl import logging
logging.set_verbosity(logging.ERROR)
spec = model_spec.get('efficientdet_lite1')
train_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/train/data',annotations_dir='/external/train/labels', label_map=['person'],ignore_difficult_instances= False,num_shards = 100)
validation_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/val/data',annotations_dir='/external/val/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
test_data = object_detector.DataLoader.from_pascal_voc(images_dir='/external/test/data',annotations_dir='/external/test/labels',label_map= ['person'],ignore_difficult_instances= False,num_shards = 100)
model = object_detector.create(train_data, model_spec=spec, batch_size=8,epochs=2000, train_whole_model=True, validation_data=validation_data)
model.evaluate(test_data)
model.export(export_dir='/external/')
**dataset generation code
**
import fiftyone.zoo as foz
import fiftyone as fo
from fiftyone import ViewField as F
cocodataset_test = foz.load_zoo_dataset(
"coco-2017",
splits="test",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation = foz.load_zoo_dataset(
"coco-2017",
splits="validation",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50
)
cocodataset_train = foz.load_zoo_dataset(
"coco-2017",
splits="train",
label_types=["detections"],
classes=["person"],
only_matching=True,
# max_samples=50,
)
cocodataset_validation.export(
'/external/val',
fo.types.VOCDetectionDataset,
)
cocodataset_train.export(
'/external/train/',
fo.types.VOCDetectionDataset,
)
cocodataset_test.export(
'/external/test/',
fo.types.VOCDetectionDataset,
)

Python Panel dashboard causing BufferError and RuntimeErrors

I have struggled for some time to create a data streaming interface using Panel.
Essentially I have approximately 20 named python objects that I monitor and read the spectral output from.
I want to have a dashboard displaying this in the form of 20 plots which must continuously overwrite themselves as the spectral output must be displayed over the same x-range (channels).
The dashboard runs fine for some time and then I either get:
a) RuntimeError: _pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes
or
b) BufferError: Existing exports of data: object cannot be re-sized {PYTHON_ENV_PATH}/lib/python3.6/site-packages/bokeh/document/document.py:500: RuntimeWarning: coroutine 'WSHandler.send_message' was never awaited gc.collect()
I've drafted up a MRE as follows:
import numpy as np
import pandas as pd
import hvplot.streamz
import numpy as np
import panel as pn
from streamz.dataframe import PeriodicDataFrame
pn.extension()
#object from which data is collected:
class data_gen:
def __init__(self,name,size=1024,sets=4):
self.name = name
self.size = size
self.sets = sets
def get_data(self):
return np.random.randn(self.sets,self.size)
#Have a dictionary of items with name:
data_dict = {
"a" : data_gen("a"),
"b" : data_gen("b"),
"c" : data_gen("c"),
"d" : data_gen("d"),
"e" : data_gen("e"),
"f" : data_gen("f"),
}
#Generate dataframe
def name_dataFrame(**kwargs):
dct = {}
for name,dg in data_dict.items():
d = dg.get_data()
sets, size = d.shape
t_dict ={}
for i in range(sets):
t_dict[i] = {
c : d[i,c] for c in range(size)
}
t_df = pd.DataFrame(t_dict).transpose()
dct[name] = t_df
df = pd.concat(dct).transpose()
return df
#Have it be streamed
df = PeriodicDataFrame(name_dataFrame, interval='10s')
#Compose panel layout
pn_realtime = pn.Column("# Data Dashboard")
for name in data_dict:
pn_realtime.append(
(pn.Row(f"""##Name: {name}""")))
pn_realtime.append(pn.Row(
df[name].hvplot.line(backlog=1024, width = 600, height=500, xlabel="n", ylabel="f(n)", grid=True)
))
pn_realtime.servable()
My set up is:
# Name Version Build Channel
panel 0.12.1 pyhd3eb1b0_0
hvplot 0.7.3 pyhd3eb1b0_1
pandas 1.1.5 py36ha9443f7_0
streamz 0.6.3 pyhd3eb1b0_0
Python 3.6.13 :: Anaconda, Inc.
Ubuntu 20.04.3 LTS (Focal Fossa)
I'm pretty new to dashboard design (and pandas for that matter) so I wouldn't be surprised if there were a vastly simpler way to do what I am attempting to do.
My suspicion is that the appending of Panel objects is causing memory buffers to overfill and garbage collection cannot handle it. If so, what can I do?
Running this MRE on my beefier Windows machine with python 3.9.7 did not seem to crash, but perhaps that is simply because I've not run it for long enough?
I've also set ylims on the hvplot and that seemed to stop crashes from occurring (again maybe I did not run it for long enough), but due to the nature of my application, I cannot have static ylims.
I appreciate your time and input.
Cheers.

PYTHON: Logistic Regression p values

I am able to print the p-values of my regression but I would like my output to have the X2 value as the key and the p-value next to it.
I want the output to look like this:
attr1_1: 3.73178531e-01
sinc1_1: 4.97942222e-06
the code:
from sklearn.linear_model import LogisticRegression
from scipy import stats
X2 = dating[['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','age', 'race',]]
y = dating['match']
dating_log_model = LogisticRegression(solver='liblinear')
dating_log_model.fit(X2,y)
dating_log_model.score(X2,y)
# getting the p-values
from sklearn.feature_selection import chi2
scores, pvalues = chi2(X2, y)
print(pvalues)
# current output
[3.73178531e-01 4.97942222e-06 3.49411284e-02 1.14925100e-11
6.40544454e-02 7.46131800e-10 3.52640714e-58 1.31669842e-17
5.15620104e-15 1.42543106e-62 6.60005884e-15 1.52260795e-81
7.41356400e-02 8.19087227e-01]
try this instead of directly print the pvalues
com_dic = {'X2':X2.columns, 'pvalues':pvalues}
result = pd.DataFrame(com_dic)
print(result)

Apache Beam job (Python) using Tensorflow Transform is killed by Cloud Dataflow

I'm trying to run an Apache Beam job based on Tensorflow Transform on Dataflow but its killed. Someone has experienced that behaviour? This is a simple example with DirectRunner, that runs ok on my local but fails on Dataflow (I change the runner properly):
import os
import csv
import datetime
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
from apache_beam.io import textio
from apache_beam.io import tfrecordio
from tensorflow_transform.beam import impl as beam_impl
from tensorflow_transform.beam import tft_beam_io
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
import apache_beam as beam
NUMERIC_FEATURE_KEYS = ['feature_'+str(i) for i in range(2000)]
def _create_raw_metadata():
column_schemas = {}
for key in NUMERIC_FEATURE_KEYS:
column_schemas[key] = dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema(column_schemas))
return raw_data_metadata
def preprocessing_fn(inputs):
outputs={}
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(inputs[key])
return outputs
def main():
output_dir = '/tmp/tmp-folder-{}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M%S'))
RUNNER = 'DirectRunner'
with beam.Pipeline(RUNNER) as p:
with beam_impl.Context(temp_dir=output_dir):
raw_data_metadata = _create_raw_metadata()
_ = (raw_data_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(os.path.join(output_dir, 'rawdata_metadata'), pipeline=p))
m = numpy_dataset = np.random.rand(100,2000)*100
raw_data = (p
| 'CreateTestDataset' >> beam.Create([dict(zip(NUMERIC_FEATURE_KEYS, m[i,:])) for i in range(m.shape[0])]))
raw_dataset = (raw_data, raw_data_metadata)
transform_fn = (raw_dataset | 'Analyze' >> beam_impl.AnalyzeDataset(preprocessing_fn))
_ = (transform_fn | 'WriteTransformFn' >> tft_beam_io.WriteTransformFn(output_dir))
(transformed_data, transformed_metadata) = ((raw_dataset, transform_fn) | 'Transform' >> beam_impl.TransformDataset())
transformed_data_coder = tft.coders.ExampleProtoCoder(transformed_metadata.schema)
_ = transformed_data | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(os.path.join(output_dir, 'train'), file_name_suffix='.gz', coder=transformed_data_coder)
if __name__ == '__main__':
main()
Also, my production code (not shown) fail with the message: The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.
Any hint?
The restriction on the pipeline description size is documented here:
https://cloud.google.com/dataflow/quotas#limits
There is a way around that, instead of creating stages for each tensor that goes into tft.scale_to_0_1 we could fuse them by first stacking them together, and then passing them into tft.scale_to_0_1 with 'elementwise=True'.
The result will be the same, because the min and max are computed per 'column' instead of across the whole tensor.
This would look something like this:
stacked = tf.stack([inputs[key] for key in NUMERIC_FEATURE_KEYS], axis=1)
scaled_stacked = tft.scale_to_0_1(stacked, elementwise=True)
for key, tensor in zip(NUMERIC_FEATURE_KEYS, tf.unstack(scaled_stacked, axis=1)):
outputs[key] = tensor

Numpy slogdet computation error

There appears to be a major difference between numpy's slogdet and the exact result when computing the log determinant of Vanermonde matrix.
I compare against the exact log determinant, see eg here for proof.
The minimal code to see this is:
A = np.power.outer(np.linspace(0,1,50),range(50))
print np.linalg.slogdet(A)[1]
s = 0
for v1 in np.linspace(0,1,50):
for v2 in np.linspace(0,1,50):
if v1>v2:
s+= np.log(v1-v2)
print s
Which yeilds:
-1191.88408998
-1706.99560647
I was wondering if there was a more accurate log determinant implementation which I could use in this situation but also in non-Vandermonde matrix situation.
You can use sympy and mpmath like this:
import numpy as np
import sympy as smp
import mpmath as mp
mp.mp.dps = 50
linspace1 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
A = np.power.outer(list(map(float,linspace1)),range(50))
first_print = smp.mpmath.mpf(np.linalg.slogdet(A)[1])
print(first_print)
s = 0
linspace2 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
linspace3 = list(map(smp.mpmath.mpf,np.linspace(0,1,50)))
for v1 in linspace1:
for v2 in linspace2:
if v1>v2:
s+= mp.log(v1-v2)
print(s)
RESULTS
first_print = -1178.272517342130186079884879291057586669921875
s = -1706.9956064674289001970168329846189154212781094939