Pandas select data in q quantile

Pandas select data in q quantile - pandas

I have a pandas time series ts = pd.TimeSeries(np.random.normal(0, 1, 100)) and I want to select only the samples in the first q-1 quantiles.
I am able to get quantiles interval with: pd.qcut(ts, 10) but how can I select only the samples in the first 9 quantiles?

Use the labels=False option in the qcut() function.
ts = pd.DataFrame(pd.TimeSeries(np.random.normal(0, 1, 100)))
ts[1] = pd.qcut(ts[0], 10, labels=False)
ts.loc[ts[1] < 9]

You could label your quantiles with integers, join it to the dataframe and write a boolean expression to select:
quantiles = pd.qcut(ts, 10, labels=range(10))
quantiles.name = 'quantiles'
df = pd.DataFrame(ts).join(quantiles)
df[df['quantiles'] < 9]

pd.TimeSeries is deprecated. Just use pd.Series
ts = pd.Series(np.random.normal(0, 1, 100))
ts[pd.qcut(ts, 10, labels=False) < 9]

Related

Sorted MultiIndex DataFrame indexig by index number

I've a MultiIndex DataFrame as follows:
header = pd.MultiIndex.from_product([['#'],
['TE', 'SS', 'M', 'MR']])
dat = ([[100, 20, 21, 35], [100, 12, 5, 15]])
df = pd.DataFrame(dat, index=['JC', 'TTo'], columns=header)
df = df.stack()
df = df.sort_values('#', ascending=False).sort_index(level=0, sort_remaining=False)
And I want to get the next rows indexig by index number not by name, that is the third row of every level 0 index:
JC M 21
TTo SS 12
Of all that I have tried, what is closest to what I am looking for is:
df.loc[pd.IndexSlice[:, df.index[2]], '#']
But this doesn't work also as intended.

You can do the following:
df["idx"] = df[df.groupby(level=0).cumcount() == 2]
df.loc[df.idx == 2]
One line solution from Quang Hoang:
df[df.groupby(level=0).cumcount() == 2]

Another way using df.xs:
df.set_index(df.groupby(level=0).cumcount()+1,append=True).xs(3,level=2)
#
JC M 21
TTo SS 12

Try with groupby then
out = df.groupby(level=0).apply(lambda x: x.iloc[[2]])
Out[141]:
#
JC JC SS 20
TTo TTo SS 12

find the array index which its element is most near greater than a value

I have a sorted array.
x = [1, 10, 12, 16, 19, 20, 21, ....]
for any given number y which is between [x[0], x[-1]], I want to find the index of the element which is the most near greater than y, for example, if y = 0, it returns 0, if y = 18, it returns 4
Is there a function available?

Without any external library, you can use bisect
i = bisect.bisect_right(x, y)
i will be the index of the element you wanted.

Given the sorted nature, we can use np.searchsorted -
idx = np.searchsorted(x,y,'right')

You can use numpy.argmin on the absolute value of the difference:
import numpy as np
x = np.array([1, 10, 12, 16, 19, 20, 21])
def find_closest(x,y):
return (np.abs(x-y)).argmin()
for y in [0,18]:
print(find_closest(x,y))
0
4

Adjusting intervals in Pandas

I created intervals in pandas for a frequency table. The first interval looks like this: (22, 29]
and is open from the left - I want just this first interval to be closed from both sides like this: [22, 29]. I tried intervals[0].closed = "both" but did not work.
intervals = pd.interval_range(start = 22, end = 64, freq = 7)
vek_freq_table = pd.Series([0,0,0,0,0,0], index = intervals)
for x in df.loc[df.loc[:,"c"].notnull(), "c"]:
for y in c_freq_table.index:
if int(x) in y:
c_freq_table.loc[y] +=1
break

You have to construct your own interval index with a list comprehension (or loop):
intervals = [pd.Interval(i.left, i.right)
if no != 0 else pd.Interval(i.left, i.right, closed='both')
for (no, i) in enumerate(intervals)]
intervals
Output:
[Interval(22, 29, closed='both'),
Interval(29, 36, closed='right'),
Interval(36, 43, closed='right'),
Interval(43, 50, closed='right'),
Interval(50, 57, closed='right'),
Interval(57, 64, closed='right')]
Note: A simpler solution might seem just to change the first element like:
new_first_elem = pd.Interval(intervals[0].left, intervals[0].right, closed='both')
intervals[0] = new_first_elem
However, this code throws an TypeError:
TypeError: Index does not support mutable operations

Dataframe column filter from a list of tuples

I'm trying to create a function to filter a dataframe from a list of tuples. I've created the below function but it doesn't seem to be working.
The list of tuples would be have dataframe column name, and a min value and a max value to filter.
eg:
eg_tuple = [('colname1', 10, 20), ('colname2', 30, 40), ('colname3', 50, 60)]
My attempted function is below:
def col_cut(df, cutoffs):
for c in cutoffs:
df_filter = df[ (df[c[0]] >= c[1]) & (df[c[0]] <= c[2])]
return df_filter
Note that the function should not filter on rows where the value is equal to max or min. Appreciate the help.

The problem is that you each time take df as the source to filter. You should filter with:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
dfcol = df_filter[col]
df_filter = df_filter[(dfcol >= mn) & (dfcol <= mx)]
return df_filter
Note that you can use .between(..) [pandas-doc] here:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
df_filter = df_filter[df_filter[col].between(mn, mx)]
return df_filter

Use np.logical_and + reduce of all masks created by list comprehension with Series.between:
def col_cut(df, cutoffs):
mask = np.logical_and.reduce([df[col].between(min1,max1) for col,min1,max1 in cutoffs])
return df[mask]

Finding those elements in an array which are "close"

I have an 1 dimensional sorted array and would like to find all pairs of elements whose difference is no larger than 5.
A naive approach would to be to make N^2 comparisons doing something like
diffs = np.tile(x, (x.size,1) ) - x[:, np.newaxis]
D = np.logical_and(diffs>0, diffs<5)
indicies = np.argwhere(D)
Note here that the output of my example are indices of x. If I wanted the values of x which satisfy the criteria, I could do x[indicies].
This works for smaller arrays, but not arrays of the size with which I work.
An idea I had was to find where there are gaps larger than 5 between consecutive elements. I would split the array into two pieces, and compare all the elements in each piece.
Is this a more efficient way of finding elements which satisfy my criteria? How could I go about writing this?
Here is a small example:
x = np.array([ 9, 12,
21,
36, 39, 44, 46, 47,
58,
64, 65,])
the result should look like
array([[ 0, 1],
[ 3, 4],
[ 5, 6],
[ 5, 7],
[ 6, 7],
[ 9, 10]], dtype=int64)

Here is a solution that iterates over offsets while shrinking the set of candidates until there are none left:
import numpy as np
def f_pp(A, maxgap):
d0 = np.diff(A)
d = d0.copy()
IDX = []
k = 1
idx, = np.where(d <= maxgap)
vidx = idx[d[idx] > 0]
while vidx.size:
IDX.append(vidx[:, None] + (0, k))
if idx[-1] + k + 1 == A.size:
idx = idx[:-1]
d[idx] = d[idx] + d0[idx+k]
k += 1
idx = idx[d[idx] <= maxgap]
vidx = idx[d[idx] > 0]
return np.concatenate(IDX, axis=0)
data = np.cumsum(np.random.exponential(size=10000)).repeat(np.random.randint(1, 20, (10000,)))
pairs = f_pp(data, 1)
#pairs = set(map(tuple, pairs))
from timeit import timeit
kwds = dict(globals=globals(), number=100)
print(data.size, 'points', pairs.shape[0], 'close pairs')
print('pp', timeit("f_pp(data, 1)", **kwds)*10, 'ms')
Sample run:
99963 points 1020651 close pairs
pp 43.00256529124454 ms

Your idea of slicing the array is a very efficient approach. Since your data are sorted you can just calculate the difference and split it:
d=np.diff(x)
ind=np.where(d>5)[0]
pieces=np.split(x,ind)
Here pieces is a list, where you can then use in a loop with your own code on every element.
The best algorithm is highly dependent on the nature of your data which I'm unaware. For example another possibility is to write a nested loop:
pairs=[]
for i in range(x.size):
j=i+1
while x[j]-x[i]<=5 and j<x.size:
pairs.append([i,j])
j+=1
If you want it to be more clever, you can edit the outer loop in a way to jump when j hits a gap.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas select data in q quantile - pandas

I have a pandas time series ts = pd.TimeSeries(np.random.normal(0, 1, 100)) and I want to select only the samples in the first q-1 quantiles. I am able to get quantiles interval with: pd.qcut(ts, 10) but how can I select only the samples in the first 9 quantiles?

Use the labels=False option in the qcut() function. ts = pd.DataFrame(pd.TimeSeries(np.random.normal(0, 1, 100))) ts[1] = pd.qcut(ts[0], 10, labels=False) ts.loc[ts[1] < 9]

You could label your quantiles with integers, join it to the dataframe and write a boolean expression to select: quantiles = pd.qcut(ts, 10, labels=range(10)) quantiles.name = 'quantiles' df = pd.DataFrame(ts).join(quantiles) df[df['quantiles'] < 9]

pd.TimeSeries is deprecated. Just use pd.Series ts = pd.Series(np.random.normal(0, 1, 100)) ts[pd.qcut(ts, 10, labels=False) < 9]

Related

Sorted MultiIndex DataFrame indexig by index number

find the array index which its element is most near greater than a value

Adjusting intervals in Pandas

Dataframe column filter from a list of tuples

Finding those elements in an array which are "close"

Categories

Resources