Python Panel dashboard causing BufferError and RuntimeErrors - pandas

I have struggled for some time to create a data streaming interface using Panel.
Essentially I have approximately 20 named python objects that I monitor and read the spectral output from.
I want to have a dashboard displaying this in the form of 20 plots which must continuously overwrite themselves as the spectral output must be displayed over the same x-range (channels).
The dashboard runs fine for some time and then I either get:
a) RuntimeError: _pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes
or
b) BufferError: Existing exports of data: object cannot be re-sized {PYTHON_ENV_PATH}/lib/python3.6/site-packages/bokeh/document/document.py:500: RuntimeWarning: coroutine 'WSHandler.send_message' was never awaited gc.collect()
I've drafted up a MRE as follows:
import numpy as np
import pandas as pd
import hvplot.streamz
import numpy as np
import panel as pn
from streamz.dataframe import PeriodicDataFrame
pn.extension()
#object from which data is collected:
class data_gen:
def __init__(self,name,size=1024,sets=4):
self.name = name
self.size = size
self.sets = sets
def get_data(self):
return np.random.randn(self.sets,self.size)
#Have a dictionary of items with name:
data_dict = {
"a" : data_gen("a"),
"b" : data_gen("b"),
"c" : data_gen("c"),
"d" : data_gen("d"),
"e" : data_gen("e"),
"f" : data_gen("f"),
}
#Generate dataframe
def name_dataFrame(**kwargs):
dct = {}
for name,dg in data_dict.items():
d = dg.get_data()
sets, size = d.shape
t_dict ={}
for i in range(sets):
t_dict[i] = {
c : d[i,c] for c in range(size)
}
t_df = pd.DataFrame(t_dict).transpose()
dct[name] = t_df
df = pd.concat(dct).transpose()
return df
#Have it be streamed
df = PeriodicDataFrame(name_dataFrame, interval='10s')
#Compose panel layout
pn_realtime = pn.Column("# Data Dashboard")
for name in data_dict:
pn_realtime.append(
(pn.Row(f"""##Name: {name}""")))
pn_realtime.append(pn.Row(
df[name].hvplot.line(backlog=1024, width = 600, height=500, xlabel="n", ylabel="f(n)", grid=True)
))
pn_realtime.servable()
My set up is:
# Name Version Build Channel
panel 0.12.1 pyhd3eb1b0_0
hvplot 0.7.3 pyhd3eb1b0_1
pandas 1.1.5 py36ha9443f7_0
streamz 0.6.3 pyhd3eb1b0_0
Python 3.6.13 :: Anaconda, Inc.
Ubuntu 20.04.3 LTS (Focal Fossa)
I'm pretty new to dashboard design (and pandas for that matter) so I wouldn't be surprised if there were a vastly simpler way to do what I am attempting to do.
My suspicion is that the appending of Panel objects is causing memory buffers to overfill and garbage collection cannot handle it. If so, what can I do?
Running this MRE on my beefier Windows machine with python 3.9.7 did not seem to crash, but perhaps that is simply because I've not run it for long enough?
I've also set ylims on the hvplot and that seemed to stop crashes from occurring (again maybe I did not run it for long enough), but due to the nature of my application, I cannot have static ylims.
I appreciate your time and input.
Cheers.

Related

What exactly to test (unittest) in a larger function containing several dataframe manipulations

Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?

Correct way of passing dataframe to ray

I am trying to do the simplest thing with Ray, but no matter what I do it just never releases memory and fails.
The usage case is simply
read parquet files to DF -> pass to pool of actors -> make changes to DF -> return DF
class Main_func:
def calculate(self,data):
#do some things with the DF
return df.copy(deep=True) <- one of many attempts to fix the problem, but didnt work
cpus = 24
actors = []
for _ in range(cpus):
actors.append(Main_func.remote())
from ray.util import ActorPool
pool = ActorPool(actors)
import os
arr = os.listdir("/some/files")
def to_ray():
try:
filename = arr.pop(0)
pf = ParquetFile("/some/files/" + filename)
df = pf.to_pandas()
pool.submit(lambda a,v:a.calculate.remote(v),df.copy(deep=True)
except Exception as e:
print(e)
for _ in range(cpus):
to_ray()
while(True):
res = pool.get_next_unordered()
write('./temp/' + random_filename, res,compression='GZIP')
del res
to_ray()
I have tried other ways of doing the same thing, manually submitting rather than the map command, but whatever i do it always locks memory and fails after a few 100 dataframes.
Does each task needs to preserve state among different files? Ray has tasks abstraction that should simplify things:
import ray
ray.init()
#ray.remote
def read_and_write(path):
df = pd.read_parquet(path)
... do things
df.to_parquet("./temp/...")
import os
arr = os.listdir("/some/files")
results = ray.get([read_and_write.remote(path) for path in arr])

Apache Beam job (Python) using Tensorflow Transform is killed by Cloud Dataflow

I'm trying to run an Apache Beam job based on Tensorflow Transform on Dataflow but its killed. Someone has experienced that behaviour? This is a simple example with DirectRunner, that runs ok on my local but fails on Dataflow (I change the runner properly):
import os
import csv
import datetime
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
from apache_beam.io import textio
from apache_beam.io import tfrecordio
from tensorflow_transform.beam import impl as beam_impl
from tensorflow_transform.beam import tft_beam_io
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
import apache_beam as beam
NUMERIC_FEATURE_KEYS = ['feature_'+str(i) for i in range(2000)]
def _create_raw_metadata():
column_schemas = {}
for key in NUMERIC_FEATURE_KEYS:
column_schemas[key] = dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema(column_schemas))
return raw_data_metadata
def preprocessing_fn(inputs):
outputs={}
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(inputs[key])
return outputs
def main():
output_dir = '/tmp/tmp-folder-{}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M%S'))
RUNNER = 'DirectRunner'
with beam.Pipeline(RUNNER) as p:
with beam_impl.Context(temp_dir=output_dir):
raw_data_metadata = _create_raw_metadata()
_ = (raw_data_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(os.path.join(output_dir, 'rawdata_metadata'), pipeline=p))
m = numpy_dataset = np.random.rand(100,2000)*100
raw_data = (p
| 'CreateTestDataset' >> beam.Create([dict(zip(NUMERIC_FEATURE_KEYS, m[i,:])) for i in range(m.shape[0])]))
raw_dataset = (raw_data, raw_data_metadata)
transform_fn = (raw_dataset | 'Analyze' >> beam_impl.AnalyzeDataset(preprocessing_fn))
_ = (transform_fn | 'WriteTransformFn' >> tft_beam_io.WriteTransformFn(output_dir))
(transformed_data, transformed_metadata) = ((raw_dataset, transform_fn) | 'Transform' >> beam_impl.TransformDataset())
transformed_data_coder = tft.coders.ExampleProtoCoder(transformed_metadata.schema)
_ = transformed_data | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(os.path.join(output_dir, 'train'), file_name_suffix='.gz', coder=transformed_data_coder)
if __name__ == '__main__':
main()
Also, my production code (not shown) fail with the message: The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.
Any hint?
The restriction on the pipeline description size is documented here:
https://cloud.google.com/dataflow/quotas#limits
There is a way around that, instead of creating stages for each tensor that goes into tft.scale_to_0_1 we could fuse them by first stacking them together, and then passing them into tft.scale_to_0_1 with 'elementwise=True'.
The result will be the same, because the min and max are computed per 'column' instead of across the whole tensor.
This would look something like this:
stacked = tf.stack([inputs[key] for key in NUMERIC_FEATURE_KEYS], axis=1)
scaled_stacked = tft.scale_to_0_1(stacked, elementwise=True)
for key, tensor in zip(NUMERIC_FEATURE_KEYS, tf.unstack(scaled_stacked, axis=1)):
outputs[key] = tensor

MemoryError when querying database from Process

I am trying to create a program with 3 processes that read from the same database. The code was working before I started introducing processes.
I am getting MemoryError when performing a select() from PeeWee, I suspect there is something wrong with sharing of resources. Minimal example:
models.py
from playhouse.pool import PooledSqliteExtDatabase
file_scanner_database = PooledSqliteExtDatabase(
None,
max_connections=32,
)
class FileModel(Model):
class Meta:
database = file_scanner_database
main.py
from file_scanner import FileScanner
from models import file_scanner_database
from models import FileModel
from multiprocessing import Process
def create_scanner_agent(data):
scanner = FileScanner(data)
scanner.start_scanner()
shared_info = {'db_location': '/absolute/path/to/database'}
file_scanner_database.init(shared_info['db_location'])
file_scanner_database.connect()
file_scanner_database.create_tables([FileModel], safe=True)
new_process = Process(
target=create_scanner_agent,
args=(shared_info,)
)
new_process.daemon = True
new_process.start()
try:
new_process.join()
except KeyboardInterrupt:
pass
new_process.terminate()
file_scanner.py
from models import file_scanner_database
from models import FileModel
class FileScanner:
def __init__(self, data):
for k, v in data.items():
setattr(self, k, v)
file_scanner_database.init(self.db_location)
file_scanner_database.connect()
def start_scanner(self):
while True:
# THIS IS WHERE THE PROGRAM CRASHES
for row in FileModel.select():
...
It looks like you're trying to access memory across a fork? Or some such craziness? I think the answer is that you're doing it wrong homie. Try opening your DB connection after the fork.

reindexObject fails during FileField to BlobField migration in Plone 4.0.7

I'm trying to migrate from plone 3.3.5 to plone 4.0.7 and I'm stuck on a step that converts all the FileFields to BlobFields.
Plone upgrade script successfully converts all native FileFields but I have several custom AT-based classes which have to be converted manually. I've tried two ways of doing the conversion which leads me to the same error.
Using schemaextender as outlined in Plone migration guide and a source code example
Renaming all FileFields to blob fields and then running this script:
from AccessControl.SecurityManagement import newSecurityManager
from AccessControl import getSecurityManager
from Products.CMFCore.utils import getToolByName
from zope.app.component.hooks import setSite
from Products.contentmigration.migrator import BaseInlineMigrator
from Products.contentmigration.walker import CustomQueryWalker
from plone.app.blob.field import BlobField
admin=app.acl_users.getUserById("admin")
newSecurityManager(None, admin)
portal = app.plone
setSite(portal)
def find_all_types_fields(portal_catalog, type_instance_to_search):
output = {}
searched = []
for k in catalog():
kobj = k.getObject()
if kobj.__class__.__name__ in searched:
continue
searched.append(kobj.__class__.__name__)
for field in kobj.schema.fields():
if isinstance(field, type_instance_to_search):
if kobj.__class__.__name__ in output:
output[kobj.__class__.__name__].append(field.__name__)
else:
output[kobj.__class__.__name__] = [field.__name__]
return output
def produce_migrator(field_map):
source_class = field_map.keys()[0]
fields = {}
for x in field_map.values()[0]: fields[x] = None
class FileBlobMigrator(BaseInlineMigrator):
'''Migrating ExtensionBlobField (which is still a FileField) to BlobField'''
src_portal_type = source_class
src_meta_type = source_class
fields_map = fields
def migrate_data(self):
'''Unfinished'''
for k in self.fields_map.keys():
#print "examining attributes"
#import pdb; pdb.set_trace()
#if hasattr(self.obj, k):
if k in self.obj.schema.keys():
print("***converting attribute:", k)
field = self.obj.getField(k).get(self.obj)
mutator = self.obj.getField(k).getMutator(self.obj)
mutator(field)
def last_migrate_reindex(self):
'''Unfinished'''
self.obj.reindexObject()
return FileBlobMigrator
def consume_migrator(portal_catalog, migrator):
walker = CustomQueryWalker(portal_catalog, migrator, full_transaction=True)
transaction.savepoint(optimistic=True)
walker_status = walker.go()
return walker.getOutput()
def migrate_blobs(catalog, migrate_type):
all_fields = find_all_types_fields(catalog, migrate_type)
import pdb; pdb.set_trace()
for k in [ {k : all_fields[k]} for k in all_fields]:
migrator = produce_migrator(k)
print consume_migrator(catalog, migrator)
catalog = getToolByName(portal, 'portal_catalog')
migrate_blobs(catalog, BlobField)
The problem occurs on self.obj.reindexObject() line where I receive the following traceback:
2011-08-09 17:21:12 ERROR Zope.UnIndex KeywordIndex: unindex_object could not remove documentId -1945041983 from index object_provides. This should not happen.
Traceback (most recent call last):
File "/home/alex/projects/plone4/eggs/Zope2-2.12.18-py2.6-linux-x86_64.egg/Products/PluginIndexes/common/UnIndex.py", line 166, in removeForwardIndexEntry indexRow.remove(documentId)
KeyError: -1945041983
> /home/alex/projects/plone4/eggs/Zope2-2.12.18-py2.6-linux-x86_64.egg/Products/PluginIndexes/common/UnIndex.py(192)removeForwardIndexEntry()
191 str(documentId), str(self.id)),
--> 192 exc_info=sys.exc_info())
193 else:
If I remove the line that triggers reindexing, the conversion completes successfully, but if I try to manually reindex catalog later, every object that's been converted can no longer be found, and I'm a bit at loss of what to do now.
The site has LinguaPlone installed, maybe it has something to do with this?
One option would be to run the migration without the reindexObject() call and do a "Clear and Rebuild" in the catalog ZMI Advanced tab after migrating.