How to select row value from given columns based on comparison of other column values in Pandas data frame? - pandas

I have the following Pandas DataFrame:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1
0 0 0.628205 0.371795 1 0.491648 0.508352
0 0 0.564113 0.435887 1 0.474973 0.525027
0 1 0.463897 0.536103 0 0.660307 0.339693
0 1 0.454559 0.545441 0 0.512349 0.487651
0 0 0.608345 0.391655 1 0.499531 0.500469
0 0 0.816127 0.183873 1 0.456669 0.543331
0 1 0.442693 0.557307 0 0.573354 0.426646
1 0 0.653497 0.346503 1 0.487212 0.512788
0 1 0.392380 0.607620 0 0.627419 0.372581
0 1 0.375816 0.624184 0 0.631532 0.368468
This is a collection of disagreeing ML model predictions with labels and label probabilities of two models (m1, m2) and the actual label (true_y).
I would like to have any of the hard label predictions (m1_labels or m2_labels) which have a higher probability to the respective predicted class of their respective models per row. So for row #1, I expect 0 (as the m1 model has a higher probability for its prediction 0 than the m2 model for its prediction 1). Basically, this is intended to be a manual voting ensemble of the two models.
How can I get this vector with a Pandas query?

You can use the apply function for this:
df.apply(lambda x: x["m1_labels"] if max(x["m1_probs_0"], x["m1_probs_1"]) > max(x["m2_probs_0"], x["m2_probs_1"]) else x["m2_labels"], axis=1)
This select the first model label if the probabilty of its predicted class is higher than the probability of the second model predicted class. Otherwise, it selects the label from the second model.

You can use:
# get max probability for m1
p1 = df.filter(like='m1_probs').max(axis=1)
# get max probability for m2
p2 = df.filter(like='m2_probs').max(axis=1)
# m1_label if it has a greater probability, else m2_label
df['best'] = df['m1_labels'].where(p1.gt(p2), df['m2_labels'])
output:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1 best
0 0 0 0.628205 0.371795 1 0.491648 0.508352 0
1 0 0 0.564113 0.435887 1 0.474973 0.525027 0
2 0 1 0.463897 0.536103 0 0.660307 0.339693 0
3 0 1 0.454559 0.545441 0 0.512349 0.487651 1
4 0 0 0.608345 0.391655 1 0.499531 0.500469 0
5 0 0 0.816127 0.183873 1 0.456669 0.543331 0
6 0 1 0.442693 0.557307 0 0.573354 0.426646 0
7 1 0 0.653497 0.346503 1 0.487212 0.512788 0
8 0 1 0.392380 0.607620 0 0.627419 0.372581 0
9 0 1 0.375816 0.624184 0 0.631532 0.368468 0

Related

One hot encoding a multi-valued categorical column where not all categories are represented

I have a column in a dataset called 'Crop', that represents the crops grown in a field over a period of time. The column might have a single string, like Cotton, or it may have multiple strings, like Cotton, Soy. And, depending on the dataset, there may be crops that are categories, but not represented in the particular dataset I'm training with at the time.
I've tried this:
possible_categories = list(['Corn', 'Sorghum', 'Hemp', 'Cotton', 'Soy'])
#df = (X.Crop).str.split(', ', expand=True)
#ohe_crop = pd.get_dummies(df, columns=possible_categories, sparse=True)
#print(ohe_crop)
X.Crop = (X.Crop).astype(pd.CategoricalDtype(categories=possible_categories))
ohe_crop = pd.get_dummies(X.Crop, columns=possible_categories, sparse=True)
which yields this:
Corn Sorghum Hemp Cotton Soy
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ...
35512 0 0 0 0 1
35513 0 0 0 0 1
35514 0 0 0 0 1
35515 0 0 0 0 1
35516 0 0 0 0 1
[35517 rows x 5 columns]
In reality, the 1st row of Crop was Cotton, Soy, Sorghum so I expected:
Corn Sorghum Hemp Cotton Soy
0 0 1 0 1 1
I think what happened here is that get_dummies() creates dummy columns for all possible permutations of the crop data:
Corn, Cotton, Soy
Corn, Soy Cotton ...
Hemp, Cotton,
Soy Hemp,
Soy Soy
so unless the field had crops that fit into these patterns, the row gets a 0.
I'd like to specify the possible categories, split Crop into multiple columns delimited by the commas that are in the rows, and then be able to populate multiple columns if there were multiple crops grown, but I can't figure out how to make it happen. Any advice?

Undersampling for multilabel imbalanced datasets in pandas

I'm working on a roll-your-own undersampling function, since imblearn does not work neatly with multi-label classification (e.g. it only accepts one dimensional y).
I want to iterate through X and y, removing a row every 2 or 3 rows that are part of the majority class. The goal is a quick and dirty way to reduce the number of rows in the majority class.
def undersample(X, y):
counter = 0
for index, row in y.itertuples():
if row['rectangle_here'] == 0:
counter += 1
if counter > 3:
counter = 0
X.drop(index, inplace=True)
y.drop(index, inplace=True)
return X, y
But it crashes my kernel on even a small amount of rows (~30,000).
y is something like this, where anytime f2 or f3 is present, f1 is present
So, let's count the number of times 0 happens in f1 and then delete a 0 row every 3rd time:
f1 f2 f3
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1
find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

Python particles simulator: out-of-core processing

Problem description
In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.
The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.
The simulation has 3 steps (the code below is for an all-in-RAM version):
Simulate (and store) an emission rate array (contains many almost-0 elements):
shape (n_particles x time_size), float32, size 80GB
Compute counts array, (random values from a Poisson process with previously computed rates):
shape (n_particles x time_size), uint8, size 20GB
counts = np.random.poisson(lam=emission).astype(np.uint8)
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
# Loop across the particles
timestamps = [np.nonzero(c) for c in counts]
I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.
Question
I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.
What I would like it exist
I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).
Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.
Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.
What are the practical options
As a first option
I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.
So my options for which I seek an answer are
implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).
Tentative solution (pyTables)
I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.
I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and
generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):
dt = np.dtype([('counts', 'u1')])
for ip in xrange(n_particles):
name = "particle_%d" % ip
data_file.create_table(
group, name, description=dt, chunkshape=chunksize,
expectedrows=time_size,
title='Binned timetrace of emitted ph (bin = t_step)'
' - particle_%d' % particle)
Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.
EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.
Dask.array can perform chunked operations like max, cumsum, etc. on an on-disk array like PyTables or h5py.
import h5py
d = h5py.File('myfile.hdf5')['/data']
import dask.array as da
x = da.from_array(d, chunks=(1000, 1000))
X looks and feels like a numpy array and copies much of the API. Operations on x will create a DAG of in-memory operations which will execute efficiently using multiple cores streaming from disk as necessary
da.exp(x).mean(axis=0).compute()
http://dask.pydata.org/en/latest/
conda install dask
or
pip install dask
See here for how to store your parameters in the HDF5 file (it pickles, so you can store them how you have them; their is a 64kb limit on the size of the pickle).
import pandas as pd
import numpy as np
n_particles = 10
chunk_size = 1000
# 1) create a new emission file, compressing as we go
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')
# generate simulated data
for i in range(10):
df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')
# create a globally unique index (time)
# http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397
try:
nrows = emission.get_storer('df').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
emission.append('df',df)
emission.close()
# 2) create counts
cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')
# this is an iterator, can be any size
for df in pd.read_hdf('emission.hdf','df',chunksize=200):
counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))
# set the index as the same
counts.index = df.index
# store the sum across all particles (as most are zero this will be a
# nice sub-selector
# better maybe to have multiple of these sums that divide the particle space
# you don't have to do this but prob more efficient
# you can do this in another file if you want/need
counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)
counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)
# make the non_zero column indexable
cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])
cs.close()
# 3) find interesting counts
print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')
print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')
You can alternatively, make each particle a data_column and select on them individually.
and some output (pretty active emission in this case :)
0 1 2 3 4 5 6 7 8 9 particles_0_4 particles_5_9
0 2 2 2 3 2 1 0 2 1 0 9 4
1 1 0 0 0 1 0 1 0 3 0 1 4
2 0 2 0 0 2 0 0 1 2 0 2 3
3 0 0 0 1 1 0 0 2 0 3 1 2
4 3 1 0 2 1 0 0 1 0 0 6 1
5 1 0 0 1 0 0 0 3 0 0 2 3
6 0 0 0 1 1 0 1 0 0 0 1 1
7 0 2 0 2 0 0 0 0 2 0 4 2
8 0 0 0 1 3 0 0 0 0 1 1 0
10 1 0 0 0 0 0 0 0 0 1 1 0
11 0 0 1 1 0 2 0 1 2 1 2 5
12 0 2 2 4 0 0 1 1 0 1 8 2
13 0 2 1 0 0 0 0 1 1 0 3 2
14 1 0 0 0 0 3 0 0 0 0 1 3
15 0 0 0 1 1 0 0 0 0 0 1 0
16 0 0 0 4 3 0 3 0 1 0 4 4
17 0 2 2 3 0 0 2 2 0 2 7 4
18 0 1 2 1 0 0 3 2 1 2 4 6
19 1 1 0 0 0 0 1 2 1 1 2 4
20 0 0 2 1 2 2 1 0 0 1 3 3
22 0 1 2 2 0 0 0 0 1 0 5 1
23 0 2 4 1 0 1 2 0 0 2 7 3
24 1 1 1 0 1 0 0 1 2 0 3 3
26 1 3 0 4 1 0 0 0 2 1 8 2
27 0 1 1 4 0 1 2 0 0 0 6 3
28 0 1 0 0 0 0 0 0 0 0 1 0
29 0 2 0 0 1 0 1 0 0 0 2 1
30 0 1 0 2 1 2 0 2 1 1 3 5
31 0 0 1 1 1 1 1 0 1 1 2 3
32 3 0 2 1 0 0 1 0 1 0 6 2
33 1 3 1 0 4 1 1 0 1 4 5 3
34 1 1 0 0 0 0 0 3 0 1 2 3
35 0 1 0 0 1 1 2 0 1 0 1 4
36 1 0 1 0 1 2 1 2 0 1 2 5
37 0 0 0 1 0 0 0 0 3 0 1 3
38 2 5 0 0 0 3 0 1 0 0 7 4
39 1 0 0 2 1 1 3 0 0 1 3 4
40 0 1 0 0 1 0 0 4 2 2 1 6
41 0 3 3 1 1 2 0 0 2 0 7 4
42 0 1 0 2 0 0 0 0 0 1 3 0
43 0 0 2 0 5 0 3 2 1 1 2 6
44 0 2 0 1 0 0 1 0 0 0 3 1
45 1 0 0 2 0 0 0 1 4 0 3 5
46 0 2 0 0 0 0 0 1 1 0 2 2
48 3 0 0 0 0 1 1 0 0 0 3 2
50 0 1 0 1 0 1 0 0 2 1 2 3
51 0 0 2 0 0 0 2 3 1 1 2 6
52 0 0 2 3 2 3 1 0 1 5 5 5
53 0 0 0 2 1 1 0 0 1 1 2 2
54 0 1 2 2 2 0 1 0 2 0 5 3
55 0 2 1 0 0 0 0 0 3 2 3 3
56 0 1 0 0 0 2 2 0 1 1 1 5
57 0 0 0 1 1 0 0 1 0 0 1 1
58 6 1 2 0 2 2 0 0 0 0 9 2
59 0 1 1 0 0 0 0 0 2 0 2 2
60 2 0 0 0 1 0 0 1 0 1 2 1
61 0 0 3 1 1 2 0 0 1 0 4 3
62 2 0 1 0 0 0 0 1 2 1 3 3
63 2 0 1 0 1 0 1 0 0 0 3 1
65 0 0 1 0 0 0 1 5 0 1 1 6
.. .. .. .. .. .. .. .. .. .. ... ...
[9269 rows x 12 columns]
PyTable Solution
Since functionality provided by Pandas is not needed, and the processing is much slower (see notebook below), the best approach is using PyTables or h5py directly. I've tried only the pytables approach so far.
All tests were performed in this notebook:
Python particles simulator: numpy out-of-core processing
Introduction to pytables data-structures
Reference: Official PyTables Docs
Pytables allows store data in HDF5 files in 2 types of formats: arrays and tables.
Arrays
There are 3 types of arrays Array, CArray and EArray. They all allow to store and retrieve (multidimensional) slices with a notation similar to numpy slicing.
# Write data to store (broadcasting works)
array1[:] = 3
# Read data from store
in_ram_array = array1[:]
For optimization in some use cases, CArray is saved in "chunks", whose size can be chosen with chunk_shape at creation time.
Array and CArray size is fixed at creation time. You can fill/write the array chunk-by-chunk after creation though. Conversely EArray can be extended with the .append() method.
Tables
The table is a quite different beast. It's basically a "table". You have only 1D index and each element is a row. Inside each row there are the "columns" data types, each columns can have a different type. It you are familiar with numpy record-arrays, a table is basically an 1D record-array, with each element having many fields as the columns.
1D or 2D numpy arrays can be stored in tables but it's a bit more tricky: we need to create a row data type. For example to store an 1D uint8 numpy array we need to do:
table_uint8 = np.dtype([('field1', 'u1')])
table_1d = data_file.create_table('/', 'array_1d', description=table_uint8)
So why using tables? Because, differently from arrays, tables can be efficiently queried. For example, if we want to search for elements > 3 in a huge disk-based table we can do:
index = table_1d.get_where_list('field1 > 3')
Not only it is simple (compared with arrays where we need to scan the whole file in chunks and build index in a loop) but it is also very extremely fast.
How to store simulation parameters
The best way to store simulation parameters is to use a group (i.e. /parameters), convert each scalar to numpy array and store it as CArray.
Array for "emission"
emission is the biggest array that is generated and read sequentially. For this usage pattern A good data structure is EArray. On "simulated" data with ~50% of zeros elements blosc compression (level=5) achieves 2.2x compression ratio. I found that a chunk-size of 2^18 (256k) has the minimum processing time.
Storing "counts"
Storing also "counts" will increase the file size by 10% and will take 40% more time to compute timestamps. Having counts stored is not an advantage per-se because only the timestamps are needed in the end.
The advantage is that recostructing the index (timestamps) is simpler because we query the full time axis in a single command (.get_where_list('counts >= 1')). Conversely, with chunked processing, we need to perform some index arithmetics that is a bit tricky, and maybe a burden to maintain.
However the the code complexity may be small compared to all the other operations (sorting and merging) that are needed in both cases.
Storing "timestamps"
Timestamps can be accumulated in RAM. However, we don't know the arrays size before starting and a final hstack() call is needed to "merge" the different chunks stored in a list. This doubles the memory requirements so the RAM may be insufficient.
We can store as-we-go timestamps to a table using .append(). At the end we can load the table in memory with .read(). This is only 10% slower than all-in-memory computation but avoids the "double-RAM" requirement. Moreover we can avoid the final full-load and have minimal RAM usage.
H5Py
H5py is a much simpler library than pytables. For this use-case of (mainly) sequential processing seems a better fit than pytables. The only missing feature is the lack of 'blosc' compression. If this results in a big performance penalty remains to be tested.
Use OpenMM to simulate particles (https://github.com/SimTk/openmm) and MDTraj (https://github.com/rmcgibbo/mdtraj) to handle trajectory IO.
The pytables vs pandas.HDFStore tests in the accepted answer is completely misleading:
The first critical error is the timing did not apply os.fsync after
flush, which make the speed test unstable. So sometime, the pytables
function is accidentally much faster.
The 2nd problem is the pytables and pandas versions have completely
different shapes due to misunderstanding the pytables.EArray's
shape arg. The author try to append column into pytables version but
append row into pandas version.
The 3rd problem is the author used different chunkshape during
comparison.
The author also forgot to disable the table index generation during store.append() which is a time consuming process.
The follow table showed the performance results from his version and my fixes.
tbold is his pytables version, pdold is his pandas version. tbsync and pdsync are his version with fsync() after flush() and also disable the table index generation during append. the tbopt and pdopt are my optimized version, with blosc:lz4 and complevel 9.
| name | dt | data size [MB] | comp ratio % | chunkshape | shape | clib | indexed |
|:-------|-----:|-----------------:|---------------:|:-------------|:--------------|:----------------|:----------|
| tbold | 5.11 | 300.00 | 84.63 | (15, 262144) | (15, 5242880) | blosc[5][1] | False |
| pdold | 8.39 | 340.00 | 39.26 | (1927,) | (5242880,) | blosc[5][1] | True |
| tbsync | 7.47 | 300.00 | 84.63 | (15, 262144) | (15, 5242880) | blosc[5][1] | False |
| pdsync | 6.97 | 340.00 | 39.27 | (1927,) | (5242880,) | blosc[5][1] | False |
| tbopt | 4.78 | 300.00 | 43.07 | (4369, 15) | (5242880, 15) | blosc:lz4[9][1] | False |
| pdopt | 5.73 | 340.00 | 38.53 | (3855,) | (5242880,) | blosc:lz4[9][1] | False |
The pandas.HDFStore uses pytables under the hood. Thus if we use them correctly, they should have no difference at all.
We can see the pandas version has larger data size. This is because the pandas use pytables.Table instead of EArray. And the pandas.DataFrame always have an index column. The first column of the Table object is this DataFrame index which require some extra space to save. This only affect IO performance a little but provide more features such as out-of-core query. So I still recommend pandas here. #MRocklin also mentioned a nicer out-of-core package dask, if most features you used are just array operations instead of table-like query. But the IO performance won't have distinguishable difference.
h5f.root.emission._v_attrs
Out[82]:
/emission._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := 'UTF-8',
index_cols := [(0, 'index')],
info := {1: {'names': [None], 'type': 'RangeIndex'}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0']]
Here is the functions:
def generate_emission(shape):
"""Generate fake emission."""
emission = np.random.randn(*shape).astype('float32') - 1
emission.clip(0, 1e6, out=emission)
assert (emission >=0).all()
return emission
def test_puretb_earray(outpath,
n_particles = 15,
time_chunk_size = 2**18,
n_iter = 20,
sync = True,
clib = 'blosc',
clevel = 5,
):
time_size = n_iter * time_chunk_size
data_file = pytb.open_file(outpath, mode="w")
comp_filter = pytb.Filters(complib=clib, complevel=clevel)
emission = data_file.create_earray('/', 'emission', atom=pytb.Float32Atom(),
shape=(n_particles, 0),
chunkshape=(n_particles, time_chunk_size),
expectedrows=n_iter * time_chunk_size,
filters=comp_filter)
# generate simulated emission data
t0 =time()
for i in range(n_iter):
emission_chunk = generate_emission((n_particles, time_chunk_size))
emission.append(emission_chunk)
emission.flush()
if sync:
os.fsync(data_file.fileno())
data_file.close()
t1 = time()
return t1-t0
def test_puretb_earray2(outpath,
n_particles = 15,
time_chunk_size = 2**18,
n_iter = 20,
sync = True,
clib = 'blosc',
clevel = 5,
):
time_size = n_iter * time_chunk_size
data_file = pytb.open_file(outpath, mode="w")
comp_filter = pytb.Filters(complib=clib, complevel=clevel)
emission = data_file.create_earray('/', 'emission', atom=pytb.Float32Atom(),
shape=(0, n_particles),
expectedrows=time_size,
filters=comp_filter)
# generate simulated emission data
t0 =time()
for i in range(n_iter):
emission_chunk = generate_emission((time_chunk_size, n_particles))
emission.append(emission_chunk)
emission.flush()
if sync:
os.fsync(data_file.fileno())
data_file.close()
t1 = time()
return t1-t0
def test_purepd_df(outpath,
n_particles = 15,
time_chunk_size = 2**18,
n_iter = 20,
sync = True,
clib='blosc',
clevel=5,
autocshape=False,
oldversion=False,
):
time_size = n_iter * time_chunk_size
emission = pd.HDFStore(outpath, mode='w', complib=clib, complevel=clevel)
# generate simulated data
t0 =time()
for i in range(n_iter):
# Generate fake emission
emission_chunk = generate_emission((time_chunk_size, n_particles))
df = pd.DataFrame(emission_chunk, dtype='float32')
# create a globally unique index (time)
# http://stackoverflow.com/questions/16997048/how-does-one-append-large-
# amounts-of-data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397
try:
nrows = emission.get_storer('emission').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
if autocshape:
emission.append('emission', df, index=False,
expectedrows=time_size
)
else:
if oldversion:
emission.append('emission', df)
else:
emission.append('emission', df, index=False)
emission.flush(fsync=sync)
emission.close()
t1 = time()
return t1-t0
def _test_puretb_earray_nosync(outpath):
return test_puretb_earray(outpath, sync=False)
def _test_purepd_df_nosync(outpath):
return test_purepd_df(outpath, sync=False,
oldversion=True
)
def _test_puretb_earray_opt(outpath):
return test_puretb_earray2(outpath,
sync=False,
clib='blosc:lz4',
clevel=9
)
def _test_purepd_df_opt(outpath):
return test_purepd_df(outpath,
sync=False,
clib='blosc:lz4',
clevel=9,
autocshape=True
)
testfns = {
'tbold':_test_puretb_earray_nosync,
'pdold':_test_purepd_df_nosync,
'tbsync':test_puretb_earray,
'pdsync':test_purepd_df,
'tbopt': _test_puretb_earray_opt,
'pdopt': _test_purepd_df_opt,
}

Setting values in a matrix in bulk

The question is about bulk-changing values in a matrix based on data contained in a vector.
Suppose I have a matrix 5x4 matrix of zeroes.
octave> Z = zeros(5,4)
Z =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And a column vector of length equal to the number of rows in Z, that is, 5. The rows in the vector y correspond to rows in the matrix Z.
octave> y = [1; 3; 2; 1; 3]
y =
1
3
2
1
3
What I want is to set 1's in the matrix Z in the columns whose indices are contained as values in the corresponding row of the vector y. Namely, I'd like to have Z matrix like this:
Z = # y =
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
0 1 0 0 # <-- 2 nd column
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
Is there a concise way of doing it? I know I can implement it using a loop over y, but I have a feeling Octave could have a more laconic way. I am new to Octave.
Since Octave has automatic broadcasting (you'll need Octave 3.6.0 or later), the easies way I can think is to use this with a comparison. Here's how
octave> 1:5 == [1 3 2 1 3]'
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Broadcasting is explained on the Octave manual but Scipy also has a good explanation for it with nice pictures.
Found another solution that does not use broadcasting. It does not need a matrix of zeroes either.
octave> y = [1; 3; 2; 1; 3]
octave> eye(5)(y,:)
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Relevant reading here:
http://www.gnu.org/software/octave/doc/interpreter/Creating-Permutation-Matrices.html