How to receive a finite number of samples at a future time using UHD/GNURadio? - gnuradio

I'm using the GNURadio python interface to UHD, and I'm trying to set a specific time to start collecting samples and either collect a specific number of samples or stop the collection of samples at a specific time. Essentially, creating a timed snapshot of samples. This is something similar to the C++ Ettus UHD example 'rx_timed_sample'.
I can get a flowgraph to start at a specific time, but I can't seem to get it to stop at a specific time (at least without causing overflows). I've also tried doing a finite aquisition, which works, but I can't get it to start at a specific time. So I'm kind of lost at what to do next.
Here is my try at the finite acquisition (seems to just ignore the start time and collects 0 samples):
num_samples = 1000
usrp = uhd.usrp_source(
",".join(("", "")),
uhd.stream_args(
cpu_format="fc32",
channels=range(1),
),
)
...
usrp.set_start_time(absolute_start_time)
samples = usrp.finite_acquisition(num_samples)
I've also tried some combinations of following without success (TypeError: in method 'usrp_source_sptr_issue_stream_cmd', argument 2 of type '::uhd::stream_cmd_t const &'):
usrp.set_command_time(absolute_start_time)
usrp.issue_stream_cmd(uhd.stream_cmd.STREAM_MODE_NUM_SAMPS_AND_DONE)
I also tried the following in a flowgraph:
...
usrp = flowgrah.uhd_usrp_source_0
absolute_start_time = uhd.uhd_swig.time_spec_t(start_time)
usrp.set_start_time(absolute_start_time)
flowgrah.start()
stop_cmd = uhd.stream_cmd(uhd.stream_cmd.STREAM_MODE_STOP_CONTINUOUS)
absolute_stop_time = absolute_start_time + uhd.uhd_swig.time_spec_t(collection_time)
usrp.set_command_time(absolute_stop_time)
usrp.issue_stream_cmd(stop_cmd)
For whatever reason the flowgraph one generated overflows consistently for anything greater than a .02s collection time.

I was running into a similar issue and solved it by using the head block.
Here's a simple example which saves 10,000 samples from a sine wave source then exits.
#!/usr/bin/env python
# Evan Widloski - 2017-09-03
# Logging test in gnuradio
from gnuradio import gr
from gnuradio import blocks
from gnuradio import analog
class top_block(gr.top_block):
def __init__(self, output):
gr.top_block.__init__(self)
sample_rate = 32e3
num_samples = 10000
ampl = 1
source = analog.sig_source_f(sample_rate, analog.GR_SIN_WAVE, 100, ampl)
head = blocks.head(4, num_samples)
sink = blocks.file_sink(4, output)
self.connect(source, head)
self.connect(head, sink)
if __name__ == '__main__':
try:
top_block('/tmp/out').run()
except KeyboardInterrupt:
pass

Related

Dask process hangs after warning about "full garbage collection took x% cpu time recently (threshold: y%)"

I'm using Dask to process a massive dataset and eventually build a model for a classification task and I'm running into problems. I hope I can get some help.
Main Task
I'm working with clinical notes. Each clinical note has a note type associated with it. However, over 60% of the notes are of type *Missing*. I'm trying to train a classifier on the notes that are labeled and run inference on the notes that have the missing type.
Data
I'm working with 3 years worth of clinical notes. The total data size is ~1.3TB. These were pulled from a database using PySpark (I have no control over this process) and are organized as year/month/partitions.parquet. The root directory is raw_data. The number of partitions within each month varies (e.g, one of the months has 2620 partitions). The total number of partitions is over 50,000.
Machine
Cores: 64
Memory: 1TB
Machine is shared with others so I won't be able to access the entire hardware resources at a given time.
Code
As a first step towards building the model, I want to preprocess the data and do some EDA. I'm using the package Textdescriptives which uses SpaCy to get some basic information about the text.
def replace_empty(text, replace=np.nan):
"""
Replace empty notes with nan's which can be removed later
"""
if pd.isnull(text):
return text
elif text.isspace() or text == '':
return replace
return text
def fix_ws(text):
"""
Replace multiple carriage returns with a single newline
and multiple new lines with a single new line
"""
text = re.sub('\r', '\n', text)
text = re.sub('\n+', '\n', text)
return text
def replace_empty_part(df, **kwargs):
return df.apply(replace_empty)
def fix_ws_part(df, **kwargs):
return df.apply(fix_ws)
def fix_missing_part(df, **kwargs):
return df.apply(lambda t: *Missing* if t == 'Unknown at this time' else t)
def extract_td_metrics(text, spacy_model):
try:
doc = spacy_model(text)
metrics_df = td.extract_df(doc)[cols]
return metrics_df.squeeze()
except:
return pd.Series([np.nan for _ in range(len(cols))], index=cols)
def extract_metrics_part(df, **kwargs):
spacy_model = spacy.load('en_core_web_sm', disable=['tok2vec', 'parser', 'ner', 'attribute_ruler', 'lemmantizer'])
spacy_model.add_pipe('textdescriptives')
return df.apply(extract_td_metrics, spacy_model=spacy_model)
client = Client(n_workers=32)
notes_df = dd.read_parquet(single_month)
notes_df['Text'] = notes_df['Text'].map_partitions(replace_empty_part, meta='string')
notes_df = notes_df.dropna()
notes_df['Text'] = notes_df['Text'].map_partitions(fix_ws_part, meta='string')
notes_df['NoteType'] = notes_df['NoteType'].map_partitions(fix_missing_part, meta='string')
metrics_df = notes_df['Text'].map_partitions(extract_metrics_part)
notes_df = dd.concat([notes_df, metrics_df], axis=1)
notes_df = notes_df.dropna()
notes_df = notes_df.repartition(npartitions=4)
notes_df.to_parquet(processed_notes, schema={'NoteType': pa.string(), 'Text': pa.string(), write_index=False)
All of this code was tested on a small sample with Pandas to make sure it works and on Dask (on the same sample) to make sure the results matched. When I run this code on only a single month worth of data, after running for a few seconds, the process just hangs outputing a stream of warnings of this type:
timestamp - distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
The machine is in a secure enclave so I don't have copy/paste facility so I'm typing out everything here. After some research I came across two threads here and here. While there wasn't a direct solution in either one of them, suggestions included disabling Python garbage collection using gc.disable and starting a clean environment with dask freshly installed. Both of these didn't help me. I'm wondering if I can maybe modify my code so that this problem doesn't happen. There is no way to load all this data in memory and use Pandas directly.
Thanks.

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.
[...]
---> 67 init_buffer_actor.run()
[...]
InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]
The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).
My code is as follows:
def train_pong(
env_name='ALE/Pong-v5',
initial_collect_steps=50000,
max_episode_frames_collect=50000,
batch_size=32,
learning_rate=0.00025,
replay_capacity=1000):
# load atari environment
collect_env = suite_atari.load(
env_name,
max_episode_steps=max_episode_frames_collect,
gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
# create tensor specs
observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
spec_utils.get_tensor_specs(collect_env))
# create training util
train_step = train_utils.create_train_step()
# calculate no. of actions
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
# create agent
agent = dqn_agent.DqnAgent(
time_step_tensor_spec,
action_tensor_spec,
q_network=create_DL_q_network(num_actions),
optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
# create uniform replay buffer
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=1,
max_length=replay_capacity)
# observer of replay buffer
rb_observer = replay_buffer.add_batch
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps = 2,
single_deterministic_pass=False).prefetch(3)
# create callable function for actor
experience_dataset_fn = lambda: dataset
# create random policy for buffer init
random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
collect_env.action_spec())
# create initalizer
init_buffer_actor = actor.Actor(
collect_env,
random_policy,
train_step,
steps_per_run=initial_collect_steps,
observers=[replay_buffer.add_batch])
# initialize buffer with random samples
init_buffer_actor.run()
(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)
I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.
My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach
Previous (successful) attempt:
from tf_agents.environments.tf_py_environment import TFPyEnvironment
tf_collect_env = TFPyEnvironment(collect_env)
init_driver = DynamicStepDriver(
tf_collect_env,
random_policy,
observers=[replay_buffer.add_batch],
num_steps=200)
init_driver.run()
I would be very grateful if someone could explain me what I'm overseeing here.
I fixed it...partly, but the next error is (in my opinion) an architectural problem.
The problem is that the Actor/Learner setup is build on a PyEnvironment whereas the
TFUniformReplayBuffer is using the TFPyEnvironment which ends up in the failure above...
Using the PyUniformReplayBuffer with a converted py-spec solved this problem.
from tf_agents.specs import tensor_spec
# convert agent spec to py-data-spec
py_collect_data_spec = tensor_spec.to_array_spec(agent.collect_data_spec)
# create replay buffer based on the py-data-spec
replay_buffer = py_uniform_replay_buffer.PyUniformReplayBuffer(
data_spec= py_collect_data_spec,
capacity=replay_capacity*batch_size
)
This snippet solved the issue of having an incompatible buffer in the background but ends up in another issue
--> The add_batch function does not work
I found this approach which advises to use either a batched environment or to make the following adaptions for the replay observer (add_batch method).
from tf_agents.utils.nest_utils import batch_nested_array
#********* Adpations add_batch method - START *********#
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
#********* Adpations add_batch method - END *********#
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
single_deterministic_pass=False)
experience_dataset_fn = lambda: dataset
This helped me to solve the issue regarding this post but now I run into another problem where I need to ask someone of the tf-agents-team...
--> It seems that the Learner/Actor structure is no able to work with another buffer than the ReverbBuffer, because the data-spec which is processed by the PyUniformReplayBuffer sets up a wrong buffer structure...
For anyone who has the same problem: I just created this Github-Issue report to get further answers and/or fix my lack of knowledge.
the full fix is shown below...
--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape
--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape
rb_observer = replay_buffer.add_batch
Long story short, this line should be replaced by
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.
The full replay buffer is shown below:
# create buffer for storing experience
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
agent.collect_data_spec,
1,
max_length=1000000)
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
num_steps = 2,
single_deterministic_pass=False).prefetch(4)
# create batched nested array input for rb_observer
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
# create batched readout of dataset
experience_dataset_fn = lambda: dataset

Pyinstaller, Multiprocessing, and Pandas - No such file/directory [duplicate]

Python v3.5, Windows 10
I'm using multiple processes and trying to captures user input. Searching everything I see there are odd things that happen when using input() with multiple processes. After 8 hours+ of trying, nothing I implement worked, I'm positive I am doing it wrong but I can't for the life of me figure it out.
The following is a very stripped down program that demonstrates the issue. Now it works fine when I run this program within PyCharm, but when I use pyinstaller to create a single executable it fails. The program constantly is stuck in a loop asking the user to enter something as shown below:.
I am pretty sure it has to do with how Windows takes in standard input from things I've read. I've also tried passing the user input variables as Queue() items to the functions but the same issue. I read you should put input() in the main python process so I did that under if __name__ = '__main__':
from multiprocessing import Process
import time
def func_1(duration_1):
while duration_1 >= 0:
time.sleep(1)
print('Duration_1: %d %s' % (duration_1, 's'))
duration_1 -= 1
def func_2(duration_2):
while duration_2 >= 0:
time.sleep(1)
print('Duration_2: %d %s' % (duration_2, 's'))
duration_2 -= 1
if __name__ == '__main__':
# func_1 user input
while True:
duration_1 = input('Enter a positive integer.')
if duration_1.isdigit():
duration_1 = int(duration_1)
break
else:
print('**Only positive integers accepted**')
continue
# func_2 user input
while True:
duration_2 = input('Enter a positive integer.')
if duration_2.isdigit():
duration_2 = int(duration_2)
break
else:
print('**Only positive integers accepted**')
continue
p1 = Process(target=func_1, args=(duration_1,))
p2 = Process(target=func_2, args=(duration_2,))
p1.start()
p2.start()
p1.join()
p2.join()
You need to use multiprocessing.freeze_support() when you produce a Windows executable with PyInstaller.
Straight out from the docs:
multiprocessing.freeze_support()
Add support for when a program which uses multiprocessing has been frozen to produce a Windows executable. (Has been tested with py2exe, PyInstaller and cx_Freeze.)
One needs to call this function straight after the if name == 'main' line of the main module. For example:
from multiprocessing import Process, freeze_support
def f():
print('hello world!')
if __name__ == '__main__':
freeze_support()
Process(target=f).start()
If the freeze_support() line is omitted then trying to run the frozen executable will raise RuntimeError.
Calling freeze_support() has no effect when invoked on any operating system other than Windows. In addition, if the module is being run normally by the Python interpreter on Windows (the program has not been frozen), then freeze_support() has no effect.
In your example you also have unnecessary code duplication you should tackle.

PyMC3 sample() function does not accept the "start" value to generate a trace

I am new to PyMC3 and Bayesian inference methods. I have a simple code that tries to infer the value of some decay constant (=1) from the artificial data generated using a truncated exponential distribution:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pymc3 as pm
import arviz as az
T = stats.truncexpon(b = 10.)
t = T.rvs(1000)
#Bayesian Inference
with pm.Model() as model:
#Define Priors
lam = pm.Gamma('$\lambda$', alpha=1, beta=1)
#Define Likelihood
time = pm.Exponential('time', lam = lam, observed = t)
#Inference
trace = pm.sample(20, start = {'lam': 10.}, \
step=pm.Metropolis(), chains=1, cores=1, \
progressbar = True)
az.plot_trace(trace)
plt.show()
This code produces a trace like below
I am really confused as to why the starting value of 10. is not accepted by the sampler. The trace above should start at 10. I am using python 3.7 to run the code.
Thank you.
Few things going on:
when the sampler first starts it has a tuning phase; samples during this phase are discarded by default, but this can be controlled with the discard_tuned_samples argument
the keys in the start argument dictionary need to correspond to the name given to the RandomVariable ('$\lambda$') not the Python variable
Incorporating those two, one can try
trace = pm.sample(20, start = {'$\lambda$': 10.},
step=pm.Metropolis(), chains=1, cores=1,
discard_tuned_samples=False)
However, the other possible issue is that
the starting value isn't guaranteed to be emitted in the first draw; only if the first proposal sample is rejected, which is down to chance.
Fixing the game (setting a random seed), though, we can get glimpse:
trace = pm.sample(20, start = {'$\lambda$': 10.},
step=pm.Metropolis(), chains=1, cores=1,
discard_tuned_samples=False, random_seed=1)
...
trace.get_values(varname='$\lambda$')[:10]
# array([10. , 5.42397358, 3.19841997, 1.09383329, 1.09383329,
# 1.09383329, 1.09383329, 1.09383329, 1.09383329, 1.09383329])

Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big?

I am using the netCDF4 library in python and just came across the issue stated in the title. At first I was blaming groups for this, but it turns out that it is a difference between the NETCDF4 and NETCDF3_CLASSIC formats (edit: and it appears related to our Linux installation of the netcdf libraries).
In the program below, I am creating a simple time series netcdf file of the same data in 2 different ways: 1) as NETCDF3_CLASSIC file, 2) as NETCDF4 flat file (creating groups in the netcdf4 file doesn't make much of a difference). What I find with a simple timing and the ls command is:
1) NETCDF3 1.3483 seconds 1922704 bytes
2) NETCDF4 flat 8.5920 seconds 15178689 bytes
It's exactly the same routine which creates 1) and 2), the only difference is the format argument in the netCDF4.Dataset method. Is this a bug or a feature?
Thanks, Martin
Edit: I have now found that this must have something to do with our local installation of the netcdf library on a Linux computer. When I use the program version below (trimmed down to the essentials) on my Windows laptop, I get similar file sizes, and netcdf4 is actually almost 2-times as fast as netcdf3! When I run the same program on our linux system, I can reproduce the old results. Thus, this question is apparently not related to python.
Sorry for the confusion.
New code:
import datetime as dt
import numpy as np
import netCDF4 as nc
def write_to_netcdf_single(filename, data, series_info, format='NETCDF4'):
vname = 'testvar'
t0 = dt.datetime.now()
with nc.Dataset(filename, "w", format=format) as f:
# define dimensions and variables
dim = f.createDimension('time', None)
time = f.createVariable('time', 'f8', ('time',))
time.units = "days since 1900-01-01 00:00:00"
time.calendar = "gregorian"
param = f.createVariable(vname, 'f4', ('time',))
param.units = "kg"
# define global attributes
for k, v in sorted(series_info.items()):
setattr(f, k, v)
# store data values
time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)
param[:] = data.value
t1 = dt.datetime.now()
print "Writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())
if __name__ == "__main__":
# create an array with 1 mio values and datetime instances
time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) for v in range(1000000)])
values = np.arange(0., 1000000.)
data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])
data = data.view(np.recarray)
series_info = {'attr1':'dummy', 'attr2':'dummy2'}
filename = "testnc4.nc"
write_to_netcdf_single(filename, data, series_info)
filename = "testnc3.nc"
write_to_netcdf_single(filename, data, series_info, format='NETCDF3_CLASSIC')
[old code deleted because it had too much unnecessary stuff]
The two file formats do have different characteristics. the classic file format was dead simple (well, more simple than the new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Classic-Format-Spec.html#Classic-Format-Spec ): A small header described all the data, and then (since you have 3 record variables) the 3 record variables get interleaved.
nice and simple, but you only get one UNLIMITED dimension, there's no facility for parallel I/O, and no way to manage data into groups.
Enter the new HDF5-based back-end, introduced in NetCDF-4.
In exchange for new features, more flexibility, and fewer restrictions on file and variable size, you have to pay a bit of a price. For large datasets, the costs are amortized, but your variables are (relatively speaking) kind of small.
I think the file size discrepancy is exacerbated by your use of record variables. in order to support arrays grow-able in N dimensions, there is more metadata associated with each record entry in the Netcdf-4 format.
HDF5 uses the "reader makes right" convention, too. classic NetCDF says "all data will be big-endian", but HDF5 encodes a bit of information about how the data was stored. If the reader process is the same architecture as the writer process (which is common, as it would be on your laptop or if restarting from a simulation checkpoint), then no conversion need be conducted.
This question is unlikely to help others as it appears to be a site-specific problem related to the interplay between netcdf libraries and the python netCDF4 module.