appending rows to pandas dataframe results in duplicate rows - pandas

here's a MWE that illustrates a problem I'm having, where incrementally saving values to a dataframe over the course of a series of loops results in what looks like the overwriting of previous rows.
import pandas as pd
import numpy as np
saved = pd.DataFrame(columns = ['value1', 'value2'])
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved.loc[t] = m
print(saved)
The output I get is:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[5. 10.]
value1 value2
0 2.0 4.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 510.0
Why is the first row of the saved dataframe not 1.0, 2.0?
Edit:
Here's another articulation of the problem, now using lists for saving then configuring as dataframe at end. The following code in a .py script
import numpy as np
import pandas as pd
saved_list = []
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved_list.append(m)
saved = pd.DataFrame(saved_list, columns = ['value1', 'value2'])
print(saved)
gives this output from the command line:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[ 5. 10.]
value1 value2
0 5.0 10.0
1 5.0 10.0
2 5.0 10.0
3 5.0 10.0
4 5.0 10.0
Why are the previous saved_list items being overwritten?

It works as expected without any change. This is a print screen from Google Colab.

Well, it seems that making a copy of the array within the loop for saving solves both scenarios.
For the first, I used
saved.loc[t] = m.copy() and for the second I used saved_list.append(m.copy()).
It may be obvious to some that when the array is defined outside the loop, the items saved to either the list or the frame are pointers to the original item so anything saved within the loop ends up pointing to the final version.
Now I know.

Related

Find the longest notnull segment in a ndarray using numpy

I have an array ab of shape (2,12)
ab = np.array([[0,3,6,3,np.nan,3,7,3,5,4,3,np.nan],
[5,9,np.nan,3,7,5,3,6,4,np.nan,np.nan,np.nan]])
I am trying to get the longest segment of consecutive notnull values between the two rows. From the example above, the output should be:
[[3. 7. 3. 5.]
[5. 3. 6. 4.]]
I used the solution proposed for a similar question here: Find longest subsequence without NaN values in set of series, after converting my array into a dataframe:
df = pd.DataFrame(ab.T)
seq = np.array(df.dropna(how='any').index)
longest_seq = max(np.split(seq, np.where(np.diff(seq)!=1)[0]+1), key=len)
print(df.iloc[longest_seq])
0 1
5 3.0 5.0
6 7.0 3.0
7 3.0 6.0
8 5.0 4.0
However, is it possible to find a solution using numpy only?
Thanks
I am not sure your code handles the case where the length of such sequences differs from one row to the other. Instead, I would proceed row-by-row:
res = []
for array in ab:
# First, let's prepend a nan for regularity:
arr = np.append(np.nan, array)
nanindexes = np.nonzero(np.isnan(arr))[0]
longest = max(np.split(arr, nanindexes), key=len) # select the biggest slice, they all start with nan
longest = longest[1:] # remove the nan we added, or the starting one
res.append(longest)
print(res)
[array([3., 7., 3., 5., 4., 3.]), array([3., 7., 5., 3., 6., 4.])]
I am not too familiar with numpy, so I took your question as an exercise. There are probably many ways to improve that code.

How to transpose DataFrame column to a zero mean and one standard deviation

I am trying to adjust some columns to have a mean of zero and one SD. But I am not sure how to do that.
E.g. given the following dataframe, how do you create a new column with mean 0 and sd 1?
df = pd.DataFrame([8.2,18,15,9], columns=['temp'])
Here is something I have tried with Standard Scaler
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame([[8.2,57],[18,60],[15,45],[9,30]], columns=['temp','rh'])
print(df)
scaler = StandardScaler(copy=False, with_mean=True, with_std=True)
scaler.fit(df)
print(f"Means: {scaler.mean_}")
df2 = scaler.transform(df)
print(f"Transformed Data Frame:\n{df2}")
m = np.mean(df2, axis=0)
s = np.std(df2, axis=0)
print(f"Column means:\n{m}")
print(f"Column SD:\n{s}")
But the results are not a mean of zero or sd=1 at all.
temp rh
0 8.2 57
1 18.0 60
2 15.0 45
3 9.0 30
Means: [12.55 48. ]
Transformed Data Frame:
[[-1.06105451 0.76200076]
[ 1.32936715 1.01600102]
[ 0.59760542 -0.25400025]
[-0.86591805 -1.52400152]]
Column means:
[-2.49800181e-16 0.00000000e+00]
Column SD:
[1. 1.]
from sklearn.preprocessing import StandardScaler
df1 = StandardScaler().fit_transform(df)
Will do the trick.

Context expansion for speech frames in tensorflow or keras

Assume I have a tensor of shape [batch_size, T, d] where
T is number of frames for a speech file and d is the dimension of MFCC. Now I would like to expand the context for the left and right frames like this function in numpy:
def make_context(feature, left, right):
'''
Takes a 2-D numpy feature array, and pads each frame with a specified
number of frames on either side.
'''
feature = [feature]
for i in range(left):
feature.append(numpy.vstack((feature[-1][0], feature[-1][:-1])))
feature.reverse()
for i in range(right):
feature.append(numpy.vstack((feature[-1][1:], feature[-1][-1])))
return numpy.hstack(feature)
How to implement this function in tensorflow or keras?
You can use tf.map_fn and tf.py_func to implement this function in tensorflow. tf.map_fn can be used to handle every element in batch. tf.py_func can apply this function to element. For example:
import tensorflow as tf
import numpy as np
def make_context(feature, left, right):
feature = [feature]
for i in range(left):
feature.append(np.vstack((feature[-1][0], feature[-1][:-1])))
feature.reverse()
for i in range(right):
feature.append(np.vstack((feature[-1][1:], feature[-1][-1])))
return np.hstack(feature)
# numpy usage
feature = np.array([[1,2],[3,4],[5,6]])
print(make_context(feature, 2, 3))
# tensorflow usage
feature_tf = tf.placeholder(shape=(None,None,None),dtype=tf.float32)
result = tf.map_fn(lambda element: tf.py_func(lambda feature, left, right: make_context(feature, left, right)
,[element,2,3]
,tf.float32)
,feature_tf,tf.float32)
with tf.Session() as sess:
print(sess.run(result,feed_dict={feature_tf:np.array([feature,feature])}))
# print
[[1 2 1 2 1 2 3 4 5 6 5 6]
[1 2 1 2 3 4 5 6 5 6 5 6]
[1 2 3 4 5 6 5 6 5 6 5 6]]
[[[1. 2. 1. 2. 1. 2. 3. 4. 5. 6. 5. 6.]
[1. 2. 1. 2. 3. 4. 5. 6. 5. 6. 5. 6.]
[1. 2. 3. 4. 5. 6. 5. 6. 5. 6. 5. 6.]]
[[1. 2. 1. 2. 1. 2. 3. 4. 5. 6. 5. 6.]
[1. 2. 1. 2. 3. 4. 5. 6. 5. 6. 5. 6.]
[1. 2. 3. 4. 5. 6. 5. 6. 5. 6. 5. 6.]]]

Pandas groupby in combination with sklean preprocessing continued

Continue from this post:
Pandas groupby in combination with sklearn preprocessing
I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method
import pandas as pd
import numpy as np
from sklearn.preprocessing import robust_scale,minmax_scale
df = pd.DataFrame( dict( id=list('AAAAABBBBB'),
loc = (10,20,10,20,10,20,10,20,10,20),
value=(0,10,10,20,100,100,200,30,40,100)))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale(x.astype(float) ))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:robust_scale(x ))
The second one give me error like this:
ValueError: Expected 2D array, got 1D array instead: array=[ 0. 10.
100.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a
single sample.
If I use reshape I got error like this:
Exception: Data must be 1-dimensional
If I ever print out the grouped data, g['value'] is pandas series.
for n, g in df.groupby(['id','loc']):
print(type(g['value']))
Do you know what might cause it?
Thanks.
Base on the warning code , you should add reshape and concatenate
df.groupby(['id','loc']).value.transform(lambda x:np.concatenate(robust_scale(x.values.reshape(-1,1))))
Out[606]:
0 -0.2
1 -1.0
2 0.0
3 1.0
4 1.8
5 0.0
6 1.0
7 -2.0
8 -1.0
9 0.0
Name: value, dtype: float64

How to invert a numpy histogram back to intensities

I'm wondering if there is a numpythonic way of inverting a histogram back to an intensity signal.
For example:
>>> A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
>>> H, edge = np.histogram(A, bins=10, range=(0,10))
>>> np.sort(A)
[ 0 1 2 4 7 7 8 10]
>>> H
[1 1 1 0 1 0 0 2 1 1]
>>> edge
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
Is there a way to reconstruct the original A intensities using the H and edge? Of course, positional information will have been lost, but I'd just like to recover the intensities and relative number of occurrences.
I have this loopy way of doing it:
>>> reco = []
>>> for i, h in enumerate(H):
... for _ in range(h):
... reco.append(edge[i])
...
>>> reco
[0.0, 1.0, 2.0, 4.0, 7.0, 7.0, 8.0, 9.0]
# I've done something wrong with the right-most histogram bin, but we can ignore that for now
For large histograms, the loopy way is inefficient. Is there a vectorized equivalent of what I did in the loop? (my gut says that numpy.digitize will be involved..)
Sure, you can use np.repeat for this:
import numpy as np
A = np.array([7, 2, 1, 4, 0, 7, 8, 10])
counts, edges = np.histogram(A, bins=10, range=(0,10))
print(np.repeat(edges[:-1], counts))
# [ 0. 1. 2. 4. 7. 7. 8. 9.]
Obviously it's impossible to recover the exact position of a value within a bin, since you lose that information in the process of generating the histogram. You could either use the lower or upper bin edge (as in the example above), or you could use the center value, e.g.:
print(np.repeat((edges[:-1] + edges[1:]) / 2., counts))
# [ 0.5 1.5 2.5 4.5 7.5 7.5 8.5 9.5]