Dask process hangs after warning about "full garbage collection took x% cpu time recently (threshold: y%)" - pandas

I'm using Dask to process a massive dataset and eventually build a model for a classification task and I'm running into problems. I hope I can get some help.
Main Task
I'm working with clinical notes. Each clinical note has a note type associated with it. However, over 60% of the notes are of type *Missing*. I'm trying to train a classifier on the notes that are labeled and run inference on the notes that have the missing type.
Data
I'm working with 3 years worth of clinical notes. The total data size is ~1.3TB. These were pulled from a database using PySpark (I have no control over this process) and are organized as year/month/partitions.parquet. The root directory is raw_data. The number of partitions within each month varies (e.g, one of the months has 2620 partitions). The total number of partitions is over 50,000.
Machine
Cores: 64
Memory: 1TB
Machine is shared with others so I won't be able to access the entire hardware resources at a given time.
Code
As a first step towards building the model, I want to preprocess the data and do some EDA. I'm using the package Textdescriptives which uses SpaCy to get some basic information about the text.
def replace_empty(text, replace=np.nan):
"""
Replace empty notes with nan's which can be removed later
"""
if pd.isnull(text):
return text
elif text.isspace() or text == '':
return replace
return text
def fix_ws(text):
"""
Replace multiple carriage returns with a single newline
and multiple new lines with a single new line
"""
text = re.sub('\r', '\n', text)
text = re.sub('\n+', '\n', text)
return text
def replace_empty_part(df, **kwargs):
return df.apply(replace_empty)
def fix_ws_part(df, **kwargs):
return df.apply(fix_ws)
def fix_missing_part(df, **kwargs):
return df.apply(lambda t: *Missing* if t == 'Unknown at this time' else t)
def extract_td_metrics(text, spacy_model):
try:
doc = spacy_model(text)
metrics_df = td.extract_df(doc)[cols]
return metrics_df.squeeze()
except:
return pd.Series([np.nan for _ in range(len(cols))], index=cols)
def extract_metrics_part(df, **kwargs):
spacy_model = spacy.load('en_core_web_sm', disable=['tok2vec', 'parser', 'ner', 'attribute_ruler', 'lemmantizer'])
spacy_model.add_pipe('textdescriptives')
return df.apply(extract_td_metrics, spacy_model=spacy_model)
client = Client(n_workers=32)
notes_df = dd.read_parquet(single_month)
notes_df['Text'] = notes_df['Text'].map_partitions(replace_empty_part, meta='string')
notes_df = notes_df.dropna()
notes_df['Text'] = notes_df['Text'].map_partitions(fix_ws_part, meta='string')
notes_df['NoteType'] = notes_df['NoteType'].map_partitions(fix_missing_part, meta='string')
metrics_df = notes_df['Text'].map_partitions(extract_metrics_part)
notes_df = dd.concat([notes_df, metrics_df], axis=1)
notes_df = notes_df.dropna()
notes_df = notes_df.repartition(npartitions=4)
notes_df.to_parquet(processed_notes, schema={'NoteType': pa.string(), 'Text': pa.string(), write_index=False)
All of this code was tested on a small sample with Pandas to make sure it works and on Dask (on the same sample) to make sure the results matched. When I run this code on only a single month worth of data, after running for a few seconds, the process just hangs outputing a stream of warnings of this type:
timestamp - distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
The machine is in a secure enclave so I don't have copy/paste facility so I'm typing out everything here. After some research I came across two threads here and here. While there wasn't a direct solution in either one of them, suggestions included disabling Python garbage collection using gc.disable and starting a clean environment with dask freshly installed. Both of these didn't help me. I'm wondering if I can maybe modify my code so that this problem doesn't happen. There is no way to load all this data in memory and use Pandas directly.
Thanks.

Related

TensorFlow example, MemoryError while run text_classification_character_cnn.py

I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.
I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.

How to receive a finite number of samples at a future time using UHD/GNURadio?

I'm using the GNURadio python interface to UHD, and I'm trying to set a specific time to start collecting samples and either collect a specific number of samples or stop the collection of samples at a specific time. Essentially, creating a timed snapshot of samples. This is something similar to the C++ Ettus UHD example 'rx_timed_sample'.
I can get a flowgraph to start at a specific time, but I can't seem to get it to stop at a specific time (at least without causing overflows). I've also tried doing a finite aquisition, which works, but I can't get it to start at a specific time. So I'm kind of lost at what to do next.
Here is my try at the finite acquisition (seems to just ignore the start time and collects 0 samples):
num_samples = 1000
usrp = uhd.usrp_source(
",".join(("", "")),
uhd.stream_args(
cpu_format="fc32",
channels=range(1),
),
)
...
usrp.set_start_time(absolute_start_time)
samples = usrp.finite_acquisition(num_samples)
I've also tried some combinations of following without success (TypeError: in method 'usrp_source_sptr_issue_stream_cmd', argument 2 of type '::uhd::stream_cmd_t const &'):
usrp.set_command_time(absolute_start_time)
usrp.issue_stream_cmd(uhd.stream_cmd.STREAM_MODE_NUM_SAMPS_AND_DONE)
I also tried the following in a flowgraph:
...
usrp = flowgrah.uhd_usrp_source_0
absolute_start_time = uhd.uhd_swig.time_spec_t(start_time)
usrp.set_start_time(absolute_start_time)
flowgrah.start()
stop_cmd = uhd.stream_cmd(uhd.stream_cmd.STREAM_MODE_STOP_CONTINUOUS)
absolute_stop_time = absolute_start_time + uhd.uhd_swig.time_spec_t(collection_time)
usrp.set_command_time(absolute_stop_time)
usrp.issue_stream_cmd(stop_cmd)
For whatever reason the flowgraph one generated overflows consistently for anything greater than a .02s collection time.
I was running into a similar issue and solved it by using the head block.
Here's a simple example which saves 10,000 samples from a sine wave source then exits.
#!/usr/bin/env python
# Evan Widloski - 2017-09-03
# Logging test in gnuradio
from gnuradio import gr
from gnuradio import blocks
from gnuradio import analog
class top_block(gr.top_block):
def __init__(self, output):
gr.top_block.__init__(self)
sample_rate = 32e3
num_samples = 10000
ampl = 1
source = analog.sig_source_f(sample_rate, analog.GR_SIN_WAVE, 100, ampl)
head = blocks.head(4, num_samples)
sink = blocks.file_sink(4, output)
self.connect(source, head)
self.connect(head, sink)
if __name__ == '__main__':
try:
top_block('/tmp/out').run()
except KeyboardInterrupt:
pass

newbie: holoviews curves from pandas follow up: issues with stream

The pandas dataframe rows correspond to successive time samples of a Kalman filter. I want to display the trajectory (truth, measurements and filter estimates) in a stream.
def show_tracker(index,data=run_tracker()):
i = int(index)
sleep(0.1)
p = \
hv.Scatter(data[0:i], kdims=['x'], vdims=['y'])(style=dict(color='r')) *\
hv.Curve (data[0:i], kdims=['x.true'], vdims=['y.true']) *\
hv.Scatter(data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='darkgreen')) *\
hv.Curve (data[0:i], kdims=['x.est'], vdims=['y.est'])(style=dict(color='lightgreen'))
return p
%%opts Scatter [width=600,height=280]
ndx=TimeIndex()
hv.DynamicMap(show_tracker, kdims=[], streams=[ndx])
for i in range(N):
ndx.update(index=i)
Issue 1: Axes are automatically set to the bounds of the data.
Consequently, trajectory updates occur at the very edge of the plot boundaries.
Is there a setting to allow some slop,
or do I have to compute appropriate bounds in the show_tracker function?
Issue 2: Bokeh backend;
I can zoom and pan, but
"Reset" causes the data set to be lost. How do I fix that?
Issue 3: The default data argument to show_tracker
requires the function to be reexecuted to generate a new dataframe.
Is there an easy way to address that?
Issue 1
This is one of the last outstanding issues for the 1.7 release coming next week, track this issue for updates. However we also just changed how the ranges are updated on a DynamicMap, if you want to update the ranges make sure to set %%opts Scatter {+framewise} or norm=dict(framewise=True) on one of the displayed objects as you're already doing for the style options.
Issue 2
This is an unfortunate shortcoming of the reset tool in bokeh, you can track this issue for updates.
Issue 3:
That depends on what exactly you're doing, has the data already been generated or are you updating it on the fly? If you just have to generate the data once you can just create it outside function, which means it will be in scope:
data = run_tracker()
def show_tracker(index):
i = int(index)
sleep(0.1)
...
return p
If you actually want to generate new data dynamically the easiest thing to do is write a little class to keep track of the state. You can even make that class a Stream so you don't have to define it separately. Here's what that might look like:
class KalmanTracker(hv.streams.Stream):
index = param.Integer(default=1)
def __init__(self, **params):
# Initializes empty data and parameters
self.data = None
super(KalmanTracker, self).__init__(**params)
def update_data(self, index):
# Update self.data here
def get_view(self, index):
# Update index exceeds data length and
# create a holoviews view of the data
if self.data is None or len(self.data) < index:
self.update_data(index)
data = self.data[:index]
....
return hv_obj
def show(self):
# Create DynamicMap to display and
# pass in self as the Stream
return hv.DynamicMap(self.get_view, kdims=[],
streams=[self])
tracker = KalmanTracker()
tracker.show()
# Should update data and plot
tracker.update(index=10)
Once you've done that you can also use the paramnb library to generate widgets from this class. You'd simply do this:
tracker = KalmanTracker()
paramnb.Widgets(tracker, callback=tracker.update)
tracker.show()

Parallelizing apply function in pandas taking longer than expected

I have a simple cleaner function which removes special characters from a dataframe (and other preprocessing stuff). My dataset is huge and I want to make use of multiprocessing to improve performance. My idea was to break the dataset into chunks and run this cleaner function in parallel on each of them.
I used dask library and also the multiprocessing module from python. However, it seems like the application is stuck and is taking longer than running with a single core.
This is my code:
from multiprocessing import Pool
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
def process_columns(data):
for i in data.columns:
data[i] = data[i].apply(cleaner_func)
return data
mydf2 = parallelize_dataframe(mydf, process_columns)
I can see from the resource monitor that all cores are being used, but as I said before, the application is stuck.
P.S.
I ran this on windows server 2012 (where the issue happens). Running this code on unix env, I was actually able to see some benefit from the multiprocessing library.
Thanks in advance.

Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big?

I am using the netCDF4 library in python and just came across the issue stated in the title. At first I was blaming groups for this, but it turns out that it is a difference between the NETCDF4 and NETCDF3_CLASSIC formats (edit: and it appears related to our Linux installation of the netcdf libraries).
In the program below, I am creating a simple time series netcdf file of the same data in 2 different ways: 1) as NETCDF3_CLASSIC file, 2) as NETCDF4 flat file (creating groups in the netcdf4 file doesn't make much of a difference). What I find with a simple timing and the ls command is:
1) NETCDF3 1.3483 seconds 1922704 bytes
2) NETCDF4 flat 8.5920 seconds 15178689 bytes
It's exactly the same routine which creates 1) and 2), the only difference is the format argument in the netCDF4.Dataset method. Is this a bug or a feature?
Thanks, Martin
Edit: I have now found that this must have something to do with our local installation of the netcdf library on a Linux computer. When I use the program version below (trimmed down to the essentials) on my Windows laptop, I get similar file sizes, and netcdf4 is actually almost 2-times as fast as netcdf3! When I run the same program on our linux system, I can reproduce the old results. Thus, this question is apparently not related to python.
Sorry for the confusion.
New code:
import datetime as dt
import numpy as np
import netCDF4 as nc
def write_to_netcdf_single(filename, data, series_info, format='NETCDF4'):
vname = 'testvar'
t0 = dt.datetime.now()
with nc.Dataset(filename, "w", format=format) as f:
# define dimensions and variables
dim = f.createDimension('time', None)
time = f.createVariable('time', 'f8', ('time',))
time.units = "days since 1900-01-01 00:00:00"
time.calendar = "gregorian"
param = f.createVariable(vname, 'f4', ('time',))
param.units = "kg"
# define global attributes
for k, v in sorted(series_info.items()):
setattr(f, k, v)
# store data values
time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)
param[:] = data.value
t1 = dt.datetime.now()
print "Writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())
if __name__ == "__main__":
# create an array with 1 mio values and datetime instances
time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) for v in range(1000000)])
values = np.arange(0., 1000000.)
data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])
data = data.view(np.recarray)
series_info = {'attr1':'dummy', 'attr2':'dummy2'}
filename = "testnc4.nc"
write_to_netcdf_single(filename, data, series_info)
filename = "testnc3.nc"
write_to_netcdf_single(filename, data, series_info, format='NETCDF3_CLASSIC')
[old code deleted because it had too much unnecessary stuff]
The two file formats do have different characteristics. the classic file format was dead simple (well, more simple than the new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Classic-Format-Spec.html#Classic-Format-Spec ): A small header described all the data, and then (since you have 3 record variables) the 3 record variables get interleaved.
nice and simple, but you only get one UNLIMITED dimension, there's no facility for parallel I/O, and no way to manage data into groups.
Enter the new HDF5-based back-end, introduced in NetCDF-4.
In exchange for new features, more flexibility, and fewer restrictions on file and variable size, you have to pay a bit of a price. For large datasets, the costs are amortized, but your variables are (relatively speaking) kind of small.
I think the file size discrepancy is exacerbated by your use of record variables. in order to support arrays grow-able in N dimensions, there is more metadata associated with each record entry in the Netcdf-4 format.
HDF5 uses the "reader makes right" convention, too. classic NetCDF says "all data will be big-endian", but HDF5 encodes a bit of information about how the data was stored. If the reader process is the same architecture as the writer process (which is common, as it would be on your laptop or if restarting from a simulation checkpoint), then no conversion need be conducted.
This question is unlikely to help others as it appears to be a site-specific problem related to the interplay between netcdf libraries and the python netCDF4 module.