Adjusting intervals in Pandas - pandas

I created intervals in pandas for a frequency table. The first interval looks like this: (22, 29]
and is open from the left - I want just this first interval to be closed from both sides like this: [22, 29]. I tried intervals[0].closed = "both" but did not work.
intervals = pd.interval_range(start = 22, end = 64, freq = 7)
vek_freq_table = pd.Series([0,0,0,0,0,0], index = intervals)
for x in df.loc[df.loc[:,"c"].notnull(), "c"]:
for y in c_freq_table.index:
if int(x) in y:
c_freq_table.loc[y] +=1
break

You have to construct your own interval index with a list comprehension (or loop):
intervals = [pd.Interval(i.left, i.right)
if no != 0 else pd.Interval(i.left, i.right, closed='both')
for (no, i) in enumerate(intervals)]
intervals
Output:
[Interval(22, 29, closed='both'),
Interval(29, 36, closed='right'),
Interval(36, 43, closed='right'),
Interval(43, 50, closed='right'),
Interval(50, 57, closed='right'),
Interval(57, 64, closed='right')]
Note: A simpler solution might seem just to change the first element like:
new_first_elem = pd.Interval(intervals[0].left, intervals[0].right, closed='both')
intervals[0] = new_first_elem
However, this code throws an TypeError:
TypeError: Index does not support mutable operations

Related

find the array index which its element is most near greater than a value

I have a sorted array.
x = [1, 10, 12, 16, 19, 20, 21, ....]
for any given number y which is between [x[0], x[-1]], I want to find the index of the element which is the most near greater than y, for example, if y = 0, it returns 0, if y = 18, it returns 4
Is there a function available?
Without any external library, you can use bisect
i = bisect.bisect_right(x, y)
i will be the index of the element you wanted.
Given the sorted nature, we can use np.searchsorted -
idx = np.searchsorted(x,y,'right')
You can use numpy.argmin on the absolute value of the difference:
import numpy as np
x = np.array([1, 10, 12, 16, 19, 20, 21])
def find_closest(x,y):
return (np.abs(x-y)).argmin()
for y in [0,18]:
print(find_closest(x,y))
0
4

Finding those elements in an array which are "close"

I have an 1 dimensional sorted array and would like to find all pairs of elements whose difference is no larger than 5.
A naive approach would to be to make N^2 comparisons doing something like
diffs = np.tile(x, (x.size,1) ) - x[:, np.newaxis]
D = np.logical_and(diffs>0, diffs<5)
indicies = np.argwhere(D)
Note here that the output of my example are indices of x. If I wanted the values of x which satisfy the criteria, I could do x[indicies].
This works for smaller arrays, but not arrays of the size with which I work.
An idea I had was to find where there are gaps larger than 5 between consecutive elements. I would split the array into two pieces, and compare all the elements in each piece.
Is this a more efficient way of finding elements which satisfy my criteria? How could I go about writing this?
Here is a small example:
x = np.array([ 9, 12,
21,
36, 39, 44, 46, 47,
58,
64, 65,])
the result should look like
array([[ 0, 1],
[ 3, 4],
[ 5, 6],
[ 5, 7],
[ 6, 7],
[ 9, 10]], dtype=int64)
Here is a solution that iterates over offsets while shrinking the set of candidates until there are none left:
import numpy as np
def f_pp(A, maxgap):
d0 = np.diff(A)
d = d0.copy()
IDX = []
k = 1
idx, = np.where(d <= maxgap)
vidx = idx[d[idx] > 0]
while vidx.size:
IDX.append(vidx[:, None] + (0, k))
if idx[-1] + k + 1 == A.size:
idx = idx[:-1]
d[idx] = d[idx] + d0[idx+k]
k += 1
idx = idx[d[idx] <= maxgap]
vidx = idx[d[idx] > 0]
return np.concatenate(IDX, axis=0)
data = np.cumsum(np.random.exponential(size=10000)).repeat(np.random.randint(1, 20, (10000,)))
pairs = f_pp(data, 1)
#pairs = set(map(tuple, pairs))
from timeit import timeit
kwds = dict(globals=globals(), number=100)
print(data.size, 'points', pairs.shape[0], 'close pairs')
print('pp', timeit("f_pp(data, 1)", **kwds)*10, 'ms')
Sample run:
99963 points 1020651 close pairs
pp 43.00256529124454 ms
Your idea of slicing the array is a very efficient approach. Since your data are sorted you can just calculate the difference and split it:
d=np.diff(x)
ind=np.where(d>5)[0]
pieces=np.split(x,ind)
Here pieces is a list, where you can then use in a loop with your own code on every element.
The best algorithm is highly dependent on the nature of your data which I'm unaware. For example another possibility is to write a nested loop:
pairs=[]
for i in range(x.size):
j=i+1
while x[j]-x[i]<=5 and j<x.size:
pairs.append([i,j])
j+=1
If you want it to be more clever, you can edit the outer loop in a way to jump when j hits a gap.

Group numpy into multiple sub-arrays using an array of values

I have an array of points along a line:
a = np.array([18, 56, 32, 75, 55, 55])
I have another array that corresponds to the indices I want to use to access the information in a (they will always have equal lengths). Neither array a nor array b are sorted.
b = np.array([0, 2, 3, 2, 2, 2])
I want to group a into multiple sub-arrays such that the following would be possible:
c[0] -> array([18])
c[2] -> array([56, 75, 55, 55])
c[3] -> array([32])
Although the above example is simple, I will be dealing with millions of points, so efficient methods are preferred. It is also essential later that any sub-array of points can be accessed in this fashion later in the program by automated methods.
Here's one approach -
def groupby(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
A simpler, but lesser efficient approach would be to use np.split to replace the last few lines and get the output, like so -
out = np.split(a_sorted, np.flatnonzero(b_sorted[1:] != b_sorted[:-1])+1 )
Sample run -
In [38]: a
Out[38]: array([18, 56, 32, 75, 55, 55])
In [39]: b
Out[39]: array([0, 2, 3, 2, 2, 2])
In [40]: groupby(a, b)
Out[40]: [array([18]), array([56, 75, 55, 55]), array([32])]
To get sub-arrays covering the entire range of IDs in b -
def groupby_perID(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Create cut indices for all unique IDs in b
n = b_sorted[-1]+2
cut_idxe = np.full(n, cut_idx[-1], dtype=int)
insert_idx = b_sorted[cut_idx[:-1]]
cut_idxe[insert_idx] = cut_idx[:-1]
cut_idxe = np.minimum.accumulate(cut_idxe[::-1])[::-1]
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idxe[:-1],cut_idxe[1:])]
return out
Sample run -
In [241]: a
Out[241]: array([18, 56, 32, 75, 55, 55])
In [242]: b
Out[242]: array([0, 2, 3, 2, 2, 2])
In [243]: groupby_perID(a, b)
Out[243]: [array([18]), array([], dtype=int64),
array([56, 75, 55, 55]), array([32])]

Pandas select data in q quantile

I have a pandas time series ts = pd.TimeSeries(np.random.normal(0, 1, 100)) and I want to select only the samples in the first q-1 quantiles.
I am able to get quantiles interval with: pd.qcut(ts, 10) but how can I select only the samples in the first 9 quantiles?
Use the labels=False option in the qcut() function.
ts = pd.DataFrame(pd.TimeSeries(np.random.normal(0, 1, 100)))
ts[1] = pd.qcut(ts[0], 10, labels=False)
ts.loc[ts[1] < 9]
You could label your quantiles with integers, join it to the dataframe and write a boolean expression to select:
quantiles = pd.qcut(ts, 10, labels=range(10))
quantiles.name = 'quantiles'
df = pd.DataFrame(ts).join(quantiles)
df[df['quantiles'] < 9]
pd.TimeSeries is deprecated. Just use pd.Series
ts = pd.Series(np.random.normal(0, 1, 100))
ts[pd.qcut(ts, 10, labels=False) < 9]

TensorFlow: How to get Intermediate value of a variable in tf.while_loop()?

I need to fetch the intermediate value of a tensor in tf.while_loop(), however, it only gives me the last returned value.
For example, I have a variable x, which has 3 pages and its dimension is 3*2*4. Now I want to fetch each page one time and calculate the total sum, the page sum, the mean, max and min value of each page. Then I define the condition and body function and want to use tf.while_loop() to calculate the needed results. The source code is as bellow.
import tensorflow as tf
x = tf.constant([[[41, 8, 48, 82],
[9, 56, 67, 23]],
[[95, 89, 44, 54],
[11, 33, 29, 1]],
[[34, 9, 5, 70],
[14, 35, 18, 17]]], dtype=tf.int32)
def cond(out, count, x):
return count < 3
def body(out, count, x):
outTemp = tf.slice(x, [count, 0, 0], [1, -1, -1])
count += 1
outPack = tf.unpack(out)
outPack[0] += tf.reduce_sum(outTemp)
outPack[1] = tf.reduce_sum(outTemp)
outPack[2] = tf.reduce_mean(outTemp)
outPack[3] = tf.reduce_max(outTemp)
outPack[4] = tf.reduce_min(outTemp)
out = tf.pack(outPack)
return out, count, x
out = tf.Variable(tf.constant([0, 0, 0, 0, 0])) # total sum, page sum, mean, max, min
count = tf.Variable(tf.constant(0))
result = tf.while_loop(cond, body, [out, count, x])
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
print(sess.run(x))
print(sess.run(result)[0])
When I run the program, it only gives me the returned value of the last time and I can only get the results of the last page.
So the question is, How can I get the results of each page and How can I get the intermediate value from tf.while_loop()?
Thank you.
To get the "intermediate value" of any variable, you can simply make use of the tf.Print op which really is an identity operation with the side effect of printing a relevant message when evaluating the aforementioned variable.
As an example,
x = tf.Print(x, [x], "Value of x is: ")
Can be placed in any line where you want the value to be reported.