I'm trying to create a set of new columns with growth rates within my df in a more efficient way than multiply imputing them one by one.
My df has +100 variables, but for simplicity, assume the following:
consumption = [5, 10, 15, 20, 25, 30, 35, 40]
wage = [10, 20, 30, 40, 50, 60, 70, 80]
period = [1, 2, 3, 4, 5, 6, 7, 8]
id = [1, 1, 1, 1, 1, 1, 1, 1]
tup= list(zip(id , period, wage))
df = pd.DataFrame(tup,
columns=['id ', 'period', 'wage'])
With two variables I could simply do this:
df['wage_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['wage'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
df['consumption_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['consumption'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
But maybe by using a for loop or something I could iterate over my column names creating new growth rate columns with the name columnname_chg as in the example above.
Any ideas?
Thanks
You can try DataFrame operation rather than Series operation in groupby.apply
cols = ['wage', 'columnname']
out = df.join(df.sort_values(by=['id', 'period'])
.groupby(['id'])[cols]
.apply(lambda g: (g/g.shift(4)-1)).fillna(0)
.add_suffix('_chg'))
Related
I want to create a subset of my data by applying tf.data.Dataset filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset = tf.data.Dataset.from_tensor_slices(data)
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set = tf.data.Dataset.from_tensor_slices((a, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))
I am a newbie in numpy. I have an array A of size [x,] of values and an array B of size [y,] (y > x). I want as result an array C of size [x,] filled with indices of B.
Here is an example of inputs and outputs:
>>> A = [10, 20, 30, 10, 40, 50, 10, 50, 20]
>>> B = [10, 20, 30, 40, 50]
>>> C = #Some operations
>>> C
[0, 1, 2, 0, 3, 4, 0, 4, 1]
I didn't find the way how to do this. Please advice me. Thank you.
I think you are looking for searchsorted, assuming that B is sorted increasingly:
C = np.searchsorted(B,A)
Output:
array([0, 1, 2, 0, 3, 4, 0, 4, 1])
Update for general situation where B is not sorted. We can do an argsort:
# let's swap 40 and 50 in B
# expect the output to have 3 and 4 swapped
B = [10, 20, 30, 50, 40]
BB = np.sort(B)
C = np.argsort(B)[np.searchsorted(BB,A)]
Output:
array([0, 1, 2, 0, 4, 3, 0, 3, 1], dtype=int64)
You can double check:
(np.array(B)[C] == A).all()
# True
For general python lists
A = [10, 20, 30, 10, 40, 50, 10, 50, 20]
B = [10, 20, 30, 40, 50]
C = [A.index(e) for e in A if e in B]
print(C)
You can try this code
A = np.array([10, 20, 30, 10, 40, 50, 10, 50, 20])
B = np.array([10, 20, 30, 40, 50])
np.argmax(B==A[:,None],axis=1)
I have a dataframe of the following form.
df = {'X': [0, 3, 6, 7, 8, 11],
'Y1': [8, 5, 4, 3, 2, 1.5],
'Y2': [1, 2, 4, 5, 5, 5]}
I would like to create a new dataframe where I use interpolate where 'X' is stepping in fixed steps [0, 2, 4, 6, 8, 10].
To find the new 'Y' values I need to find f(x)=Y1 and then I can evaluate for each step in X. But since I have many Y's I think there must be a more clever way to do this.
The solution I found was the following:
step_size = 0.25
no_steps = int(np.floor(max(b['X'])/step_size))
for i in range(0,no_steps+1):
b = b.append({'X' : 0.25*i, 'StepNo' : 10, 'PointNo' : 23+i}, ignore_index=True)
b = b.sort_values(['X'])
b = b.set_index(['X'])
c = b.interpolate('index')
c = c.reset_index()
c = c.sort_values(['PointNo'])
So first I define step size. Then I calculate number of steps. Then I append the steps into the dataframe. Sort the dataframe and reindex so I can use interpolate using 'index' as values.
I have two numpy arrays. One of them is 2D while the other is 1D.
>>> a = np.arange(0,20).reshape(2,10)
>>> a
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
>>> b = np.full( a.shape[1], 10 )
>>> b
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
I want to compare them column-wise:
If the columns elements in a is identical to the column element of b, then store row number(s) of a.
Else, find the closest matching of a to b and store the row number(s).
In my example, the output from the comparison should be:
[ 1, [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1] ]
How do I do this in NumPy?
I was thinking of using np.where( a==b, run a function to get row(s) if same, run another function to get row(s) of diff )? Is this the way?
The following script computes R-squared value between two numpy arrays(x and y).
The R-squared value is very low due to outliers in the data. How can I extract the indices of those outliers?
import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
r2 = stats.linregress(x, y) [3]**2
print r2
plt.scatter(x, y)
plt.show()
An outlier is defined as: value-mean > 2*standard deviation.
You can do this with the line
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
What is does:
A list is constructed from the indices of x, where the element at that index satisfies the condition described above.
A quick test:
x = np.random.random_integers(1,50,50)
this gives me the array:
array([16, 6, 13, 18, 21, 37, 31, 8, 1, 48, 4, 40, 9, 14, 6, 45, 20,
15, 14, 32, 30, 8, 19, 8, 34, 22, 49, 5, 22, 23, 39, 29, 37, 24,
45, 47, 21, 5, 4, 27, 48, 2, 22, 8, 12, 8, 49, 12, 15, 18])
Now I add some outliers manually as there are none initially:
x[4] = 200
x[15] = 178
lets test:
[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]
result:
[4, 15]
Is this what you was looking for?
EDIT:
I added the abs() function in the line above, because when you are working with negative numbers this might end bad. The abs() function takes the absolute value.
I think Sander's approach is the correct one, but if you must see R2 without those outliers before making a decision here is a way to do it.
Setup data and introduce outlier:
In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)
x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)
y[5] = 100
Calculate R2 taking out one y value at a time (along with matching x value):
m = np.eye(y.shape[0])
r2 = np.apply_along_axis(lambda a: stats.linregress(np.delete(x, a.argmax()), np.delete(y, a.argmax()))[3]**2, 0, m)
Get index of the biggest outlier:
r2.argmax()
Out[1]:
5
Get R2 when this outlier is taken out:
In [2]:
r2[r2.argmax()]
Out[2]:
0.85892084723588935
Get the value of the outlier:
In [3]:
y[r2.argmax()]
Out[3]:
100
To get top n outliers:
In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]
Out [4]:
array([ 5, 27, 34, 0, 17], dtype=int64)