How to shuffle groups of values but not within groups in numpy?

How to shuffle groups of values but not within groups in numpy? - numpy

I need to shuffle by rows knowing that the first value in each row is a day number. Rows of the same day number should be kept together. Groups may contain 1, 2, 3 or 4 rows. Each row has the same number of values. Hope the examples below will tell you more.
I have this:
a = np.array([
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[1, 0.12, 0.13, 0.14],
[2, 0.22, 0.23, 0.24],
[2, 0.22, 0.23, 0.24],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34]
])
I need to have this:
a = np.array([
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[2, 0.22, 0.23, 0.24],
[2, 0.22, 0.23, 0.24],
[1, 0.12, 0.13, 0.14]
])

This approach using python's random.sample to permute a list of arrays is not fast, but easier to follow. This only works if groups are sorted in blocks beforehand.
import random
random.seed(25) # used for reproducibility only
groups = a[:,0].astype('int')
idx = (groups[1:] ^ groups[:-1]).nonzero()[0] + 1
np.vstack(random.sample(np.split(a, idx), len(idx)+1))
Output
array([[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[0. , 0.02, 0.03, 0.04],
[0. , 0.02, 0.03, 0.04],
[0. , 0.02, 0.03, 0.04],
[2. , 0.22, 0.23, 0.24],
[2. , 0.22, 0.23, 0.24],
[1. , 0.12, 0.13, 0.14]])

Assuming groups are not split in the input array, you can apply the following strategy:
# Find the number of groups and the number of item in each group
unique, srcCounts = np.unique(a[:,0], return_counts=True)
shuffledGroupPos = np.random.permutation(np.arange(len(unique)))
# Compute the source start/end group indices
srcEnd = np.cumsum(srcCounts)
srcStart = srcEnd - srcCounts
# Find the destination start/end group indices
dstCounts = srcCounts[shuffledGroupPos]
dstEnd = np.cumsum(dstCounts)
dstStart = dstEnd - dstCounts
# Remap the source start/end group indices regarding the destination indices
srcStart = srcStart[shuffledGroupPos]
srcEnd = srcEnd[shuffledGroupPos]
# Output array
result = np.empty_like(a)
# Loop iterating over the groups.
# While this loop can be avoided, the code far much simpler with it.
for i in range(unique.size):
result[dstStart[i]:dstEnd[i]] = a[srcStart[i]:srcEnd[i]]

Related

create a lineplot from a few variables

I have a dataframe with 3 variables, each one is representing different time point for the same outcome (e.g. weight):
df = pd.DataFrame({"Time_1": [-4.5, -0.8, -3.0, 0.2, -2.5], \
"Time_2": [-3, -0.2, -2.5, 0.3, 1], "TIme_3": [-2, 0, -1, 0.5, 1]})
I want to plot a trajectory for this variable identical to this graph:
Where I have a first point of (0,0) for the basline and three additional points on X axis with the correspondign values.

You could just use df.shift().fillna(0).cumsum().plot(marker='D') to get a plot of the 3 variables together. Shift and fillna are used so that the first line can be 0 for all the variables.
df = pd.DataFrame({"Time_1": [-4.5, -0.8, -3.0, 0.2, -2.5], \
"Time_2": [-3, -0.2, -2.5, 0.3, 1], "Time_3": [-2, 0, -1, 0.5, 1]})
df.shift().fillna(0).cumsum().plot(marker='D')

Partitioning datasets and get the dynamic averages of rows with the same ID(objects in this case)

I have a large dataset with thousands of rows though fewer columns, i have ordered them by row values so that each of the 'objects' are grouped together, just like the dataset in Table1 below:
#Table1 :
data = [['ALFA', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['ALFA', 401740.00, 0.43, 0.26, 0.23, 0.16, 0.09],
['ALFA', 892350.00, 0.58, 0.24, 0.05, 0.07, 0.4],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 401740.00, 0.43, 0.26, 0.14, 0.37, 0.06],
['Charlie', 511830.00, 0.52, 0.16, 0.13, 0.22, 0.01],
['Delta', 590030.00, 0.75, 0.2, 0.34, 0.3, 0],
['Delta', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Delta', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['Echo', 892350.00, 0.58, 0.24, 0.23, 0.16, 0.09],
['Echo', 590030.00, 0.75, 0.2, 0.05, 0.07, 0.4],
['Echo', 590030.00, 0.75, 0.2, 0.08, 0.26, 0],
['Echo', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Foxtrot', 401740.00, 0.43, 0.26, 0.27, 0.2, 0.01],
['Foxtrot', 511830.00, 0.52, 0.16, 0.29, 0.11, 0.04],
['Golf', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Golf', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Golf', 351740.00, 0.31, 0.22, 0.13, 0.22, 0.01],
['Hotel', 892350.00, 0.58, 0.24, 0.34, 0.3, 0],
['Hotel', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Hotel', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04]]
df = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df
However i would like to write a query to go through the dataset, partition the data by these objects and get only the averages for all the columns (for each object) in a separate table much like the Table2 below:
#Table2:
data2 = [['ALFA', 548610.00, 0.44, 0.24, 0.24, 0.14, 0.18],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 545615.00, 0.66, 0.20, 0.21, 0.25, 0.03],
['Delta', 510600.00, 0.60, 0.21, 0.26, 0.26, 0.02],
['Echo', 665610.00, 0.71, 0.21, 0.13, 0.22, 0.14],
['Foxtrot', 456785.00, 0.48, 0.21, 0.28, 0.16, 0.03],
['Golf', 510600.00, 0.60, 0.21, 0.18, 0.26, 0.03],
['Hotel', 690803.33, 0.69, 0.21, 0.21, 0.23, 0.01]]
df2 = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df2
Please note that the number of the objects vary across the dataset so the query should be able to count the number of objects and use that number to get the average of all the columns for each object and then present all these values in a new table (much Like what partition windows function does).
For instance note that the '548610.00' alues in Table2 for ALFA(column1) is merely an addition of Column1 values of ALFA in Table1 (351740.00 + 401740.00 + 401740.00) and divide by the count of ALFA being '3'

I believe a simple avg() function should answer your question
SELECT Objects,
AVG(Column1),
AVG(Column2),
AVG(Column3),
AVG(Column4),
AVG(Column5),
AVG(Column6)
FROM tableA
GROUP BY Objects
db fiddle link

Random valid data items in numpy array

Suppose I have a numpy array as follows:
data = np.array([[1, 3, 8, np.nan], [np.nan, 6, 7, 9], [np.nan, 0, 1, 2], [5, np.nan, np.nan, 2]])
I would like to randomly select n-valid items from the array, including their indices.
Does numpy provide an efficient way of doing this?

Example
data = np.array([[1, 3, 8, np.nan], [np.nan, 6, 7, 9], [np.nan, 0, 1, 2], [5, np.nan, np.nan, 2]])
n = 5
Get valid indices
y_val, x_val = np.where(~np.isnan(data))
n_val = y_val.size
Pick random subset of size n by index
pick = np.random.choice(n_val, n)
Apply index to valid coordinates
y_pick, x_pick = y_val[pick], x_val[pick]
Get corresponding data
data_pick = data[y_pick, x_pick]
Admire
data_pick
# array([2., 8., 1., 1., 2.])
y_pick
# array([3, 0, 0, 2, 3])
x_pick
# array([3, 2, 0, 2, 3])

Find nonzeros by :
In [37]: a = np.array(np.nonzero(data)).reshape(-1,2)
In [38]: a
Out[38]:
array([[0, 0],
[0, 0],
[1, 1],
[1, 1],
[2, 2],
[2, 3],
[3, 3],
[3, 0],
[1, 2],
[3, 0],
[1, 2],
[3, 0],
[2, 3],
[0, 1],
[2, 3]])
Now pick a random choice :
In [44]: idx = np.random.choice(np.arange(len(a)))
In [45]: data[a[idx][0],a[idx][1]]
Out[45]: 2.0

tensorflow how do one get the output the same size as input tensor after segment sum

I'm using the tf.unsorted_segment_sum method of TensorFlow and it works.
For example:
tf.unsorted_segment_sum(tf.constant([0.2, 0.1, 0.5, 0.7, 0.8]),
tf.constant([0, 0, 1, 2, 2]), 3)
Gives the right result:
array([ 0.3, 0.5 , 1.5 ], dtype=float32)
I want to get:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)

I've solved it.
data = tf.constant([0.2, 0.1, 0.5, 0.7, 0.8])
gr_idx = tf.constant([0, 0, 1, 2, 2])
y, idx, count = tf.unique_with_count(gr_idx)
group_sum = tf.segment_sum(data, gr_idx)
group_sup = tf.gather(group_sum, idx)
answer:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)

Add an extra column to ndarray in python

I have a ndarray as follows.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
I have a position ndarray as follows.
position = [10, 20, 30]
Now I want to add the position value at the beginning of the feature_matrix as follows.
[[10, 0.1, 0.3], [20, 0.7, 0.8], [30, 0.8, 0.8]]
I tried the answers in this: How to add an extra column to an numpy array
E.g.,
feature_matrix = np.concatenate((feature_matrix, position), axis=1)
However, I get the error saying that;
ValueError: all the input arrays must have same number of dimensions
Please help me to resolve this prblem.

This solved my problem. I used np.column_stack.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
position = [10, 20, 30]
feature_matrix = np.column_stack((position, feature_matrix))

It is the shape of the position array which is incorrect regarding the shape of the feature_matrix.
>>> feature_matrix
array([[ 0.1, 0.3],
[ 0.7, 0.8],
[ 0.8, 0.8]])
>>> position
array([10, 20, 30])
>>> position.reshape((3,1))
array([[10],
[20],
[30]])
The solution is (with np.concatenate):
>>> np.concatenate((position.reshape((3,1)), feature_matrix), axis=1)
array([[ 10. , 0.1, 0.3],
[ 20. , 0.7, 0.8],
[ 30. , 0.8, 0.8]])
But np.column_stack is clearly great in your case !

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to shuffle groups of values but not within groups in numpy? - numpy

Related

create a lineplot from a few variables

Partitioning datasets and get the dynamic averages of rows with the same ID(objects in this case)

Random valid data items in numpy array

tensorflow how do one get the output the same size as input tensor after segment sum

Add an extra column to ndarray in python

Categories

Resources