Index pandas series by an hour - pandas

In my code I am currently doing the following kind of operation with Pandas:
ser = oldser.dropna().copy()
for i in range(24):
ind = ser.groupby(ser.index.hour).get_group(i).index
ser[ind]=something
This code copies a series, and then for takes each hour separately and does something to it. This seems very messy though - any ways to nicely clean it up?
What I really want, is something analogous to
series['2011']
which gets all data from 2011, but instead
series['2pm']
getting all data at 2pm.

Certainly you want to do the groupby operation once, a slight refactor:
g = ser.groupby(ser.index.hour)
for i, ind in g.indices:
ser.iloc[ind] = something
But most likely you can do a transform or apply (depending on what something is):
g.transform(something)
g.apply(something)

Related

Best way to parallelize multi-table function in Python (using Pandas)

I have this function below that iterates through every row of a data frame (using pandas apply) and determines what values are valid from a prediction-probability matrix (L2) by referencing another data frame (GST) to obtain the valid values for a given row. The function just returns the row back with the maximum valid probability assigned to the previously blank value for that row (Predicted Level 2) in the data frame passed to the function (test_x2)
Not a terribly complex function and it works fine on smaller datasets but when I scale to like 3-5 million records, it starts to take way too long. I tried using the multiprocessing module as well as dask/numba but nothing was able to improve the runtime (not sure if this is just due to the fact the function is not vectorizable).
My question is two fold:
1) Is there a better way to write this? (I'm guessing there is)
2) If not, what parallel computing strategies could work with this type of function? I've already tried a number of different python options but I'm just leaning more towards running the larger datasets on totally separate machines at this point. Feel free to provide any suggested code to parallelize something like this. Thanks in advance for any guidance provided.
l2 = MNB.predict_proba(test_x)
l2_classes = MNB.classes_
L2 = pd.DataFrame(l2, columns = MNB.classes_)
test_x2["Predicted Level 2"] = ""
def predict_2(row):
s = row["Predicted Level 1"]
s = GST.loc[s,:]
s.reset_index(inplace = True)
Valid_Level2s = s["GST Level 2"].tolist()
p2 = L2.ix[row.name, Valid_Level2s]
max2 = p2.idxmax(axis = 1)
output = row["Predicted Level 2"] = max2
return row
test_x2 = test_x2.apply(predict_2, axis = 1)

Custom equations with bygroup and apply in pandas - MemoryError

All,
I am running code to calcluate a new variable (newvar) for each constituent (group) in a panel using the apply function:
df['newvar'] = df.groupby('group')['var1'].apply(lambda x : x - x.shift() + df['var2'] - df['var3'])
The code returns a memory error (MemoryError). I think what's happening is that the code generates a large number of separate dataframes which then causes the system to run out of memory because df itself is quite a large file. I can probalby do this with a for-loop, but is there a more verbose/computationally efficient way of doing this?
Many thanks,
Andres

Accessing intermediate results from a tensorflow graph

If I have a complex calculation of the form
tmp1 = tf.fun1(placeholder1,placeholder2)
tmp2 = tf.fun2(tmp1,placeholder3)
tmp3 = tf.fun3(tmp2)
ret = tf.fun4(tmp3)
and I calculate
ret_vals = sess.run(ret,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
fun1, fun2 etc are possibly costly operations on a lot of data.
If I run to get ret_vals as above, is it possible to later or at the same time access the intermediate values as well without re-running everything up to that value? For example, to get tmp2, I could re-run everything using
tmp2_vals = sess.run(tmp2,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
But this looks like a complete waste of computation? Is there a way to access several of the intermediate results in a graph after performing one run?
The reason why I want to do this is for debugging or testing or logging of progress when ret_vals gets calculated e.g. in an optimization loop. Every step where I run the ret_vals calculations is costly but I want to see some of the intermediate results that were calculated.
If I do something like
tmp2_vals, ret_vals = sess.run([tmp2, ret], ...)
does this guarantee that the graph will only get run once (instead of one time for tmp2 and one time for ret) like I want it?
Have you looked at tf.Print? This is an identity op with printing funciton. You can insert it in your graph right after tmp2 to get the value of it. Note that the default setting only allows you to print the first n values of the tensor, you can modify the value n by giving attribute first_n to the op.

Arrays with attributes in Julia

I am making my first steps in julia, and I would like to reproduce something I achieved with numpy.
I would like to write a new array-like type which is essentially an vector of elements of arbitrary type, and, to keep the example simple, an scalar attribute such as the sampling frequency fs.
I started with something like
type TimeSeries{T} <: DenseVector{T,}
data::Vector{T}
fs::Float64
end
Ideally, I would like:
1) all methods that take a Vector{T} as argument to take on TimeSeries{T}.
e.g.:
ts = TimeSeries([1,2,3,1,543,1,24,5], 12.01)
median(ts)
2) that indexing a TimeSeries always returns a TimeSeries:
ts[1:3]
3) built-in functions that return a Vector to return a TimeSeries:
ts * 2
ts + [1,2,3,1,543,1,24,5]
I have started by implementing size, getindex and so on, but I definitely do not see how it could be possible to match points 2 and 3.
numpy has a quite comprehensive way to doing this: http://docs.scipy.org/doc/numpy/user/basics.subclassing.html. R also seems to allow linking attributes attr()<- to arrays.
Do you have any idea about the best strategy to implement this sort of "array with attributes".
Maybe I'm not understanding, why is for say point 3 it not sufficient to do
(*)(ts::TimeSeries, n) = TimeSeries(ts.data*n, ts.fs)
(+)(ts::TimeSeries, n) = TimeSeries(ts.data+n, ts.fs)
As for point 2
Base.getindex(ts::TimeSeries, r::Range) = TimeSeries(ts.data[r], ts.fs)
Or are you asking for some easier way where you delegate all these operations to the internal vector? You can clever things like
for op in (:(+), :(*))
#eval $(op)(ts::TimeSeries, x) = TimeSeries($(op)(ts.data,x), ts.fs)
end

h5py selective read in

I have a problem regarding a selective read-in routine while using h5py.
f = h5py.File('file.hdf5','r')
data = f['Data']
I have several positive values in the 'Data'- dataset and also some placeholders with -9999.
How I can get only all positive values for calculations like np.min?
np.ma.masked_array creates a full copy of the array and all the benefits from using h5py are lost ... (regarding memory usage). The problem is, that I get errors if I try to read data sets that exceed 100 millions of values per data set using data = f['Data'][:,0]
Or if this is not possible is something like that possible?
np.place(data[...], data[...] <= -9999, float('nan'))
Thanks in advance
You could use:
mask = f['Data'] >= 0
data = f['Data'][mask]
although I am not sure how much memory the mask calculation itself uses.