Efficiently assigning values to multidimensional array based on indices list on one dimension - optimization

I have a matrix M of size [S1, S2, S3].
I have another matrix K that serves as the indices in the first dimension that I want to assign, with size [1, S2, S3].
And V is a [1, S2, S3] matrix which contains the values to be assigned correspondingly.
With for loops, this is how I did it:
for x2 = 1:S2
for x3 = 1:S3
M(K(1,x2,x3), x2, x3) = V(1, x2, x3)
endfor % x3
endfor % x2
Is there a more efficient way to do this?
Visualization for 2D case:
M =
1 4 7 10
2 5 8 11
3 6 9 12
K =
2 1 3 2
V =
50 80 70 60
Desired =
1 80 7 10
50 5 8 60
3 6 70 12
Test case:
M = reshape(1:24, [3,4,2])
K = reshape([2,1,3,2,3,3,1,2], [1,4,2])
V = reshape(10:10:80, [1,4,2])
s = size(M)
M = assign_values(M, K, V)
M =
ans(:,:,1) =
1 20 7 10
10 5 8 40
3 6 30 12
ans(:,:,2) =
13 16 70 22
14 17 20 80
50 60 21 24
I'm looking for an efficient way to implement assign_values there.
Running Gelliant's answer somehow gives me this:
key = sub2ind(s, K, [1:s(2)])
error: sub2ind: all subscripts must be of the same size

You can use sub2ind to use your individual subscripts to linear indices. These can then be used to replace them with the values in V.
M = [1 4 7 10 ;...
2 5 8 11 ;...
3 6 9 12];
s=size(M);
K = [2 1 3 2];
K = sub2ind(s,K,[1:s(2)])
V = [50 80 70 60];
M(K)=V;
You don't need reshape and M=M(:) for it to work in Matlab.

I found that this works:
K = K(:)'+(S1*(0:numel(K)-1));
M(K) = V;
Perhaps this is supposed to work the same way as Gelliant's answer, but I couldn't make his answer work, somehow =/

Related

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

Add one item each time into a cell of dataframe pandas

I need to add/append the results of various functions into a dataframe, each result in one cell. In the example below I'm putting only 3 functions. I could add to lists first and then drop the lists into each column, but there are too many functions to do this individually. Any help is most welcome!
import pandas as pd
J = pd.DataFrame()
J['f1'] = []
J['f2'] = []
J['f3'] = []
for i in range(1000):
f1x = i + 1
f2x = i ** 2
f3x = 3 * i
J['f1'].append(f1x)
J['f2'].append(f2x)
J['f3'].append(f3x)
print(J)
J.to_csv(r'00/G/graficos/Resultado.csv')
A list comprehension to create the DataFrame directly is likely the most performant:
# Create DataFrame from computations
J = pd.DataFrame([[i + 1, i ** 2, 3 * i] for i in range(1_000)])
# Rename Columns
J.columns = 'f' + (J.columns + 1).astype(str)
This can also be used with def functions:
def f1(i):
return i + 1
def f2(i):
return i ** 2
def f3(i):
return 3 * i
# Create DataFrame from functions
J = pd.DataFrame([[f1(i), f2(i), f3(i)]
for i in range(1_000)])
# Rename Columns
J.columns = 'f' + (J.columns + 1).astype(str)
df:
f1 f2 f3
0 1 0 0
1 2 1 3
2 3 4 6
3 4 9 9
4 5 16 12
.. ... ... ...
995 996 990025 2985
996 997 992016 2988
997 998 994009 2991
998 999 996004 2994
999 1000 998001 2997
[1000 rows x 3 columns]
In my opinion, the best is to take advantage of vector operations:
def f1(i):
return i + 1
def f2(i):
return i ** 2
def f3(i):
return 3 * i
xs = np.arange(1000)
pd.DataFrame({f.__name__: f(xs) for f in (f1,f2,f3)})
output:
f1 f2 f3
0 1 0 0
1 2 1 3
2 3 4 6
3 4 9 9
4 5 16 12
.. ... ... ...
995 996 990025 2985
996 997 992016 2988
997 998 994009 2991
998 999 996004 2994
999 1000 998001 2997
Here is a comparison of the vector and loop approaches:
NB. the scales are log! This means the loop gets even worse when n is higher.

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

Setting values to MultiIndex DataFrame is getting slow while running

I have a 11729 rows × 8 columns DataFrame, I'd like to convert it to a 11729 × 30 × 8 matrix with MultiIndex, which 30 means every 30 lines of 11729 rows from 0 to 11728 - 30
for a shorter example:
the origin 2d DataFrame looks like:
col0 col1
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
the 3d MultiIndex DataFrame which I want to get would looks like:
col0 col1
0 c0 1 2
c1 3 4
c2 5 6
1 c0 3 4
c1 5 6
c2 7 8
2 c0 5 6
c1 7 8
c2 9 10
which means (0,c0)~(0,c2) from 0~2 rows in origin DataFrame, (1,c0)~(1,c2) from 1~3 rows in origin DataFrame, (2,c0)~(2,c2) from 2~4 rows in origin DataFrame.
I'm using the following code to convert the origin 2d DataFrame to MultiIndex 3d DataFrame:
multi_index = pd.MultiIndex(levels=[[],[]],
labels=[[],[]],
names=['', ''])
df = pd.DataFrame(index=multi_index, columns=origin_df.columns)
for i in range(n):
for j in range(i, len(origin_df) - (n - i)):
print("i{}/n{},j{}".format(i, n, j)) # print progress
df.loc[(j, 'c%d' % i), :] = origin_df.loc[origin_df.index[j]].tolist()
for i in range(n, len(origin_df)):
df.loc[(i, 'y'), :] = origin_df.loc[origin_df.index[i]].tolist()
return df
My problem is the insertion speed is getting slow while running.
At first the progress output is fast, but getting slower and slower.
How could I optimize this operation?
You should not be adding one by one. Here's what I would do:
# toy data:
df = pd.DataFrame(np.arange(11792*8).reshape(-1,8));
window = 30
new_len = len(df) - window + 1
# create new dataframe, ignoring the index
new_df = pd.concat(df.iloc[i:i+window] for i in range(new_len))
# modify the index
new_df.index = pd.MultiIndex.from_product([np.arange(new_len), [f'c{i}' for i in range(window)]])
That took about 1 second on a 6600k. With your sample data, the output is:
col0 col1
0 c0 1 2
c1 3 4
c2 5 6
1 c0 3 4
c1 5 6
c2 7 8
2 c0 5 6
c1 7 8
c2 9 10

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)