Explaining the result of pipeline execution of multiple hincrby commands in redis - redis

This rudimentary one has me stumped. I've been tinkering around with redis-py, trying to learn the ropes. One thing I'm trying is:
pipeline1 = my_server.pipeline()
for hash_obj in hash_objs:
num = pipeline1.hincrby(hash_obj,"num",amount=-1)
result1 = pipeline1.execute()
print result1
>>> [0L,0L]
There were two redis hashes in the list hash_objs. What I see printed on the screen is [0L,0L]. Can someone help me decipher what this output means? What's L? I was hoping to get the resulting int values of num for each hash_obj (e.g. [2,0]).
My objective is to decrement num in each hash_obj by 1, and wherever num ends up as 0, delete the hash_obj.
I'm trying to accomplish that in two separate pipelines; the code above is the attempting to decrement all num values in all hash_objs. After this, I would delete the relevant hash_objs if warranted. I'm still developing my understanding of how to effectively use pipelining in redis.

Nothing wrong with the code above - the L means long (integer) and the result printout is consistent assuming that the hashes were set to 1 before the run. If you set the hashes beforehand to 3 and 1 (steps 3 and 4 below), respectively, you'll get the expected result in step 9:
In [1]: import redis
In [2]: r = redis.StrictRedis()
In [3]: r.hset('h1', 'num', 3)
Out[3]: 1L
In [4]: r.hset('h2', 'num', 1)
Out[4]: 1L
In [5]: hashes = ['h1', 'h2']
In [6]: p = r.pipeline()
In [7]: for h in hashes:
...: p.hincrby(h, 'num', -1)
...:
In [8]: res = p.execute()
In [9]: res
Out[9]: [2L, 0L]
Note: the 1L in 3 and 4 means that the key was created.
Now you can iterate on the result and continue the processing. In your case, however, it would make more sense to use just one pipeline and instead of executing the hincrby call a Lua script decrements and deletes the key if the result is 0, such as the one below (which returns 1 if the key was deleted):
In [1]: import redis
In [2]: r = redis.StrictRedis()
In [3]: r.hset('h1', 'num', 3)
Out[3]: 0L
In [4]: r.hset('h2', 'num', 1)
Out[4]: 0L
In [5]: s = r.script_load('if redis.call("HINCRBY", KEYS[1], ARGV[1], ARGV[2]) <= 0 then redis.call("DEL", KEYS[1]) return 1 end return 0')
In [6]: p = r.pipeline()
In [7]: for h in ['h1', 'h2']:
...: p.evalsha(s, 1, h, 'num', -1)
...:
In [8]: p.execute()
Out[8]: [0L, 1L]

Related

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

Add to items, with multiple occurrences [duplicate]

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?
np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.
We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

How do you strip out only the integers of a column in pandas?

I am trying to strip out only the numeric values--which is the first 1 or 2 digits. Some values in the column contain pure strings and others contain special characters. See pic for the value count:
enter image description here
I have tried multiple methods:
breaks['_Size'] = breaks['Size'].fillna(0)
breaks[breaks['_Size'].astype(str).str.isdigit()]
breaks['_Size'] = breaks['_Size'].replace('\*','',regex=True).astype(float)
breaks['_Size'] = breaks['_Size'].str.extract('(\d+)').astype(int)
breaks['_Size'].map(lambda x: x.rstrip('aAbBcC'))
None are working. The dtype is object. To be clear, I am attempting to make a new column with only the digits (as an int/float) and if I could convert the fraction to a decimal that would be bonus
This works for dividing fractions and also allows for extra numbers to be present in the string (it returns you just the first sequence of numbers):
In [60]: import pandas as pd
In [61]: import re
In [62]: df = pd.DataFrame([0, "6''", '7"', '8in', 'text', '3/4"', '1a3'], columns=['_Size'])
In [63]: df
Out[63]:
_Size
0 0
1 6''
2 7"
3 8in
4 text
5 3/4"
6 1a3
In [64]: def cleaning_function(row):
...: row = str(row)
...: fractions = re.findall(r'(\d+)/(\d+)', row)
...: if fractions:
...: return float(int(fractions[0][0])/int(fractions[0][1]))
...: numbers = re.findall(r'[0-9]+', str(row))
...: if numbers:
...: return numbers[0]
...: return 0
...:
In [65]: df._Size.apply(cleaning_function)
Out[65]:
0 0
1 6
2 7
3 8
4 0
5 0.75
6 1
Name: _Size, dtype: object

apply generic function in a vectorized fashion using numpy/pandas

I am trying to vectorize my code and, thanks in large part to some users (https://stackoverflow.com/users/3293881/divakar, https://stackoverflow.com/users/625914/behzad-nouri), I was able to make huge progress. Essentially, I am trying to apply a generic function (in this case max_dd_array_ret) to each of the bins I found (see vectorize complex slicing with pandas dataframe for details on date vectorization and Start, End and Duration of Maximum Drawdown in Python for the rationale behind max_dd_array_ret). the problem is the following: I should be able to obtain the result df_2 and, to some degree, ranged_DD(asd_1.values, starts, ends+1) is what I am looking for, except for the tragic effect that it's as if the first two bins are merged and the last one is missing as it can be gauged by looking at the results.
any explanation and fix is very welcomed
import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic
def max_dd_array_ret(xs):
xs = (xs+1).cumprod()
i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
j = np.argmax(xs[:i])
max_dd = abs(xs[j]/xs[i] -1)
return max_dd if max_dd is not None else 0
def get_ranges_arr(starts,ends):
# Taken from https://stackoverflow.com/a/37626057/3293881
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
def ranged_DD(arr,starts,ends):
# Get all indices and the IDs corresponding to same groups
idx = get_ranges_arr(starts,ends)
id_arr = np.repeat(np.arange(starts.size),ends-starts)
slice_arr = arr[idx]
return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]
asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()
index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)
print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))
results:
df_2
[ 1.75893509 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085
1.43863472 1.85322338 1.84767224 1.32605754 1.48688414 5.44786663]
ranged_DD(asd_1.values, starts, ends+1)
[ 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472
1.85322338 1.84767224 1.32605754 1.48688414]
which are identical except for the first two:
[ 1.75893509 6.08002911 vs [ 6.08002911
and the last two
1.48688414 5.44786663] vs 1.48688414]
p.s.:while looking in more detail at the docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) I found that this might be the problem
"All but the last (righthand-most) bin is half-open. In other words,
if bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1,
but excluding 2) and the second [2, 3). The last bin, however, is [3,
4], which includes 4. New in version 0.11.0."
problem is I don't how to reset it.

Is there a Julia analogue to numpy.argmax?

In Python, there is numpy.argmax:
In [7]: a = np.random.rand(5,3)
In [8]: a
Out[8]:
array([[ 0.00108039, 0.16885304, 0.18129883],
[ 0.42661574, 0.78217538, 0.43942868],
[ 0.34321459, 0.53835544, 0.72364813],
[ 0.97914267, 0.40773394, 0.36358753],
[ 0.59639274, 0.67640815, 0.28126232]])
In [10]: np.argmax(a,axis=1)
Out[10]: array([2, 1, 2, 0, 1])
Is there a Julia analogue to Numpy's argmax? I only found a indmax, which only accept a vector, not a two dimensional array as np.argmax.
The fastest implementation will usually be findmax (which allows you to reduce over multiple dimensions at once, if you wish):
julia> a = rand(5, 3)
5×3 Array{Float64,2}:
0.867952 0.815068 0.324292
0.44118 0.977383 0.564194
0.63132 0.0351254 0.444277
0.597816 0.555836 0.32167
0.468644 0.336954 0.893425
julia> mxval, mxindx = findmax(a; dims=2)
([0.8679518267243425; 0.9773828942695064; … ; 0.5978162823947759; 0.8934254589671011], CartesianIndex{2}[CartesianIndex(1, 1); CartesianIndex(2, 2); … ; CartesianIndex(4, 1); CartesianIndex(5, 3)])
julia> mxindx
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 1)
CartesianIndex(2, 2)
CartesianIndex(3, 1)
CartesianIndex(4, 1)
CartesianIndex(5, 3)
According to the Numpy documentation, argmax provides the following functionality:
numpy.argmax(a, axis=None, out=None)
Returns the indices of the maximum values along an axis.
I doubt a single Julia function does that, but combining mapslices and argmax is just the ticket:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> mapslices(argmax,a,dims=2)
5x1 Array{Int64,2}:
3
2
3
1
2
Of course, because Julia's array indexing is 1-based (whereas Numpy's array indexing is 0-based), each element of the resulting Julia array is offset by 1 compared to the corresponding element in the resulting Numpy array. You may or may not want to adjust that.
If you want to get a vector rather than a 2D array, you can simply tack [:] at the end of the expression:
julia> b = mapslices(argmax,a,dims=2)[:]
5-element Array{Int64,1}:
3
2
3
1
2
To add to the jub0bs's answer, argmax in Julia 1+ mirrors the behavior of np.argmax, by replacing axis with dims keyword, returning CarthesianIndex instead of index along given dimension:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> argmax(a, dims=2)
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 3)
CartesianIndex(2, 2)
CartesianIndex(3, 3)
CartesianIndex(4, 1)
CartesianIndex(5, 2)