Randomness of np.random.shuffle - numpy

I have two arrays (i and j) that are exactly the same. I shuffle them with a specified random seed.
import numpy as np
np.random.seed(42)
i = np.array([0, 1, 2, 3, 4, 5, 6, 7])
j = np.array([0, 1, 2, 3, 4, 5, 6, 7])
np.random.shuffle(i)
np.random.shuffle(j)
print(i, j)
# [1 5 0 7 2 4 3 6] [3 7 0 4 5 2 1 6]
They were supposed to be the same after shuffling, but it is not the case.
Do you have any ideas about how to get the same results (like the example below) after shuffling?
# [1 5 0 7 2 4 3 6] [1 5 0 7 2 4 3 6]
Many thanks in advance!

Calling seed() sets the state of a global random number generator. Each call of shuffle continues with the same global random number generator, so the results are different, as they should be. If you want them to be the same, reset the seed before each call of shuffle.

Related

Find pattern in pandas dataframe, reorder it row-wise, and reset index

This is a multipart problem. I have found solutions for each separate part, but when I try to combine these solutions, I don't get the outcome I want.
Let's say this is my dataframe:
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
df
Values Vals
0 1 6
1 3 7
2 6 7
3 7 9
4 7 5
5 8 3
6 4 1
Let's say I want to find the pattern [6, 7, 7] in the 'Values' column.
I can use a modified version of the second solution given here:
Pandas: How to find a particular pattern in a dataframe column?
pattern = [6, 7, 7]
pat_i = [df[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[ Values Vals
2 6 7
3 7 9
4 7 5]
The only way I've found to narrow this down to just index values is the following:
pat_i = [df.index[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[RangeIndex(start=2, stop=5, step=1)]
Once I've found the pattern, what I want to do, within the original dataframe, is reorder the pattern to [7, 7, 6], moving the entire associated rows as I do this. In other words, going by the index, I want to get output that looks like this:
df.reindex([0, 1, 3, 4, 2, 5, 6])
Values Vals
0 1 6
1 3 7
3 7 9
4 7 5
2 6 7
5 8 3
6 4 1
Then, finally, I want to reset the index so that the values in all the columns stay in the new re-ordered place;
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1
In order to use pat_i as a basis for re-ordering, I've tried to modify the second solution given here:
Python Pandas: How to move one row to the first row of a Dataframe?
target_row = 2
# Move target row to first element of list.
idx = [target_row] + [i for i in range(len(df)) if i != target_row]
However, I can't figure out how to exploit the pat_i RangeIndex object to use it with this code. The solution, when I find it, will be applied to hundreds of dataframes, each one of which will contain the [6, 7, 7] pattern that needs to be re-ordered in one place, but not the same place in each dataframe.
Any help appreciated...and I'm sure there must be an elegant, pythonic way of doing this, as it seems like it should be a common enough challenge. Thank you.
I just sort of rewrote your code. I held the first and last indexes to the side, reordered the indexes of interest, and put everything together in a new index. Then I just use the new index to reorder the data.
import pandas as pd
from pandas import RangeIndex
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
pattern = [6, 7, 7]
new_order = [1, 2, 0] # new order of pattern
for i in list(df[df['Values'] == pattern[0]].index):
if all(df['Values'][i:i+len(pattern)] == pattern):
pat_i = df[i:i+len(pattern)]
front_ind = list(range(0, pat_i.index[0]))
back_ind = list(range(pat_i.index[-1]+1, len(df)))
pat_ind = [pat_i.index[i] for i in new_order]
new_ind = front_ind + pat_ind + back_ind
df = df.loc[new_ind].reset_index(drop=True)
df
Out[82]:
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

How to get the index of each increment in pandas series?

how to get the index of pandas series when the value incremented by one?
Ex. The input is
A
0 0
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
the output should be: [0, 1, 4, 6, 7]
You can use Series.duplicated and access the index, should be slightly faster.
df.index[~df.A.duplicated()]
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
If you really want a list, you can do this,
df.index[~df.A.duplicated()].tolist()
# [0, 1, 4, 6, 7]
Note that duplicated (and drop_duplicates) will only work if your Series does not have any decrements.
Alternatively, you can use diff here, and index into df.index, similar to the previous solution:
np.insert(df.index[df.A.diff().gt(0)], 0, 0)
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
It is drop_duplicates
df.drop_duplicates('A').index.tolist()
[0, 1, 4, 6, 7]
This makes sure the second row is incremented by one (not by two or anything else!)
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
output is numpy array:
array([2, 5])
Example:
# * * here value increase by 1
# 0 1 2 3 4 5 6 7
df = pd.DataFrame({ 'A' : [1, 1, 1, 2, 8, 3, 4, 4]})
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
array([2, 5])

Saving with numpy savetxt. Array elements as columns

I am pretty new to Python and trying to kick my Matlab addiction. I am converting a lot of my lab's machine vision code over to Python but I am just stuck on one aspect of the saving. At each line of the code we save 6 variables in an array. I'd like these to be entered in as one of 6 columns in a txt file with bumpy.savetxt. Each iteration of the tracking loop would then add similar variables for that given frame as the next row in the txt file.
But I keep getting either a single column that just grows with every loop. I've attached a simple code to show my problem. As it loops through, there will be a variable generated that is called output. I would like this to be the three columns of the txt file and each iteration of the loop to be a new row. Is there any easy way to do this?
import numpy as np
dataFile_Path = "dataFile.txt"
dataFile_id = open(dataFile_Path, 'w+')
for x in range(0, 9):
variable = np.array([2,3,4])
output = x*variable+1
output.astype(float)
print(output)
np.savetxt(dataFile_id, output, fmt="%d")
dataFile_id.close()
In [160]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: output.astype(float)
...: print(output)
...:
[1 1 1]
[3 4 5]
[5 7 9]
[ 7 10 13]
[ 9 13 17]
[11 16 21]
[13 19 25]
[15 22 29]
[17 25 33]
So you are writing one row at a time. savetxt normally is used to write a 2d array.
Notice that the print is still integers - astype returns a new array, it does not change things inplace.
But because you are giving it 1d arrays it writes those as columns:
In [177]: f = open('txt','bw+')
In [178]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, output, fmt='%d')
...:
In [179]: f.close()
In [180]: cat txt
1
1
1
3
4
5
5
7
9
if instead I give savetxt a 2d array ((1,3) shape), it writes
In [181]: f = open('txt','bw+')
In [182]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, [output], fmt='%d')
...:
...:
In [183]: f.close()
In [184]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33
But a better approach is to construct the 2d array, and write that with one savetxt call:
In [185]: output = np.array([2,3,4])*np.arange(9)[:,None]+1
In [186]: output
Out[186]:
array([[ 1, 1, 1],
[ 3, 4, 5],
[ 5, 7, 9],
[ 7, 10, 13],
[ 9, 13, 17],
[11, 16, 21],
[13, 19, 25],
[15, 22, 29],
[17, 25, 33]])
In [187]: np.savetxt('txt', output, fmt='%10d')
In [188]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33

Aggregating a time series in Pandas given a window size

Lets say I have this data
a = pandas.Series([1,2,3,4,5,6,7,8])
a
Out[313]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dtype: int64
I would like aggregate data which groups data n rows at a time and sums them up. So if n=2 the new series would look like {3,7,11,15}.
try this:
In [39]: a.groupby(a.index//2).sum()
Out[39]:
0 3
1 7
2 11
3 15
dtype: int64
In [41]: a.index//2
Out[41]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3], dtype='int64')
n=3
In [42]: n=3
In [43]: a.groupby(a.index//n).sum()
Out[43]:
0 6
1 15
2 15
dtype: int64
In [44]: a.index//n
Out[44]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
you can use pandas rolling mean and get it like the following:
if n is your interval:
sums = list(a.rolling(n).sum()[n-1::n])
# Optional !!!
rem = len(a)%n
if rem != 0:
sums.append(a[-rem:].sum())
The first line perfectly adds the rows if the data can be properly divided into groups, else, we also can add the remaining sum (depends on your preference).
For e.g., in the above case, if n=3, then you may want to get either {6, 15, 15} or just {6, 15}. The code above is for the former case. And skipping the optional part gives you just {6, 15}.