Index array with Tuple of Tuples - indexing

I have a vector of tuples, where each tuple represents a position in a 2d array.
I also have a 2d array of values
For example:
# create a vector of tuples
tupl1 = ((1,1), (2,3), (1,2), (3,1))
# create a 2d array of values
m1 = zeros(Int, (3,3))
m1[1:4] .= 1
I want to get all the values in the 2d array at each of the tuple positions. I thought the following might work:
m1[tupl1]
But this gives in invalid index error. Expected output would be:
4-element Vector{Int64}:
1
0
1
1
Any advice would be much appreciated.

One way to do this could be:
julia> [m1[t...] for t in tupl1]
4-element Vector{Int64}:
1
0
1
1
More verbose but faster with lesser number of allocations would be via CartesianIndex:
julia> getindex.(Ref(m1), CartesianIndex.(tupl1))
(1, 0, 1, 1)
A benchmark:
julia> #btime [$m1[t...] for t in $tupl1];
24.900 ns (1 allocation: 96 bytes)
julia> #btime getindex.(Ref($m1), CartesianIndex.($tupl1));
9.319 ns (1 allocation: 16 bytes)

If in your original question you had a vector of tuples (you have a tuple of tuples) like this:
julia> tupl1 = [(1,1), (2,3), (1,2), (3,1)]
4-element Vector{Tuple{Int64, Int64}}:
(1, 1)
(2, 3)
(1, 2)
(3, 1)
then you can do just:
julia> m1[CartesianIndex.(tupl1)]
4-element Vector{Int64}:
1
0
1
1

Related

Reshape a DataFrame based on column value, and pad missing slices with zeros

I have a Pandas DataFrame which looks like:
ID
order
other_column_1
other_column_x
A
0
10
20
A
1
11
21
A
2
12
22
B
0
31
41
B
2
33
43
I want to reshape it to a 3D matrix with shape (#IDs, #order, #other columns). For the example above, it should be of shape (2, 3, 2).
The order column holds the order of the 2nd dimension, so slice ['A', 0, :] should be [10, 20] and ['A', 1, :] [11, 21] etc. The values of order are identical for all ID (0, 1, 2 in this case).
Trouble is, sometimes a slice is missing e.g. for 'B', the slice (order) '1' is missing, which I want to make it a slice pad with all 0's, to keep the shape consistent.
I think of pre-sorting the whole DataFrame by ID and order, loop over each ID , insert missing slices, and stack them together. However, the DataFrame is huge so I try to avoid global sort and loop if possible.
I came up with a way to do it (if you have enough pc memory to allocate) where you dont have to loop the whole dataframe although I coudn't test it with 10M rows because of memory allocation. I tested it with 5M rows by 300 columns and I will show the results at the end of the answer.
The idea is to get all the combinations of the unique values of the first 2 columns as an index to build the first 2 dimensions of the 3D array.
After that you can merge the original dataframe with the dataframe containing index combinations to then fill all the missing values with 0.
Once the data is complete you can pass it to numpy and reshape it in 3 dimensions.
Code without comments:
# df = orginal dataframe
d1 = df.ID.unique()
d2 = df.order.unique()
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left')\
.fillna(0)
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Code with comments:
# df = orginal dataframe
# Get unique id for 1st dimension
d1 = df.ID.unique()
# Get unique order fpr 2nd dimension
d2 = df.order.unique()
# Get complete DF
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\ # Get missing values from 1st and 2nd dimensions as index
.to_frame().reset_index(drop=True)\ # Get Dataframe from multiindex and reset index
.merge(df, on=['ID', 'order'], how='left')\ # Merge the complete dimensions with the original values
.fillna(0) # fill missing values with 0
# get complete data as 2D array and reshape as 3D array
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Test:
First I tried to test with 10M rows but I could not allocate the memory needed for that.
To test the code I created a a dataframe with 6M rows x 300 columns (random float numbers) and dropped 1M rows to simulate the missing values.
Here is the code I used to test and the results.
Test code:
import random
import time
import pandas as pd
import numpy as np
# 100000 diff. ID and 60 diff. order
df_test = pd.MultiIndex.from_product((range(100000), range(60)), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.drop(random.sample(range(6_000_000), k=1_000_000))\ # Drop 1M rows to simulate missing rows
.reset_index(drop=True)
# 5M rows random data by 298 columns
df_test2 = pd.DataFrame(np.random.random(size=(5_000_000, 298)))
df = df_test.merge(df_test2, left_index=True, right_index=True)
start = time.time()
d1 = df.ID.unique()
print(f'time 1st Dimension: {round(time.time()-start, 3)}')
d2 = df.order.unique()
print(f'time 2nd Dimension: {round(time.time()-start, 3)}')
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left').fillna(0)
print(f'time merge: {round(time.time()-start, 3)}')
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
print(f'time ndarray: {round(time.time()-start, 3)}')
print(f'array shape: {np_3d_array.shape}')
print(f'array type: {type(np_3d_array)}')
Test Results:
time 1st Dimension: 0.035
time 2nd Dimension: 0.063
time merge: 47.202
time ndarray: 49.441
array shape: (100000, 60, 298)
array type: <class 'numpy.ndarray'>
ids = df.ID.unique()
orders = df.order.unique()
ar = (df.set_index(['ID','order'])
.reindex(pd.MultiIndex.from_product((ids, orders)))
.fillna(0)
.to_numpy()
.reshape(len(ids), len(orders), len(df.columns[2:])))
print(ar)
print(ar.shape)
Output:
[[[10. 20.]
[11. 21.]
[12. 22.]]
[[31. 41.]
[ 0. 0.]
[33. 43.]]]
(2, 3, 2)

Julia: sort two arrays (like lexsort in numpy)

Python example
In Numpy there is lexsort to sort one array within another:
Given multiple sorting keys, which can be interpreted as columns in a
spreadsheet, lexsort returns an array of integer indices that
describes the sort order by multiple columns.
So taking the following example:
import numpy as np
a = np.array([1,1,1,2,2,2])
b = np.array([10,8,11,4,8,0])
sorted_idx = np.lexsort((b,a))
print(b[sorted_idx])
# [ 8 10 11 0 4 8]
So this sorts b within a as we can see like:
1 1 1 2 2 2
8 10 11 0 4 8
I can't find anything similar in Julia so I wonder how this can be achieved? In my case two columns is sufficient
Julia
So lets take the same data to figure that out:
a = Vector([1,1,1,2,2,2])
b = Vector([10,8,11,4,8,0])
Little benchmark
using StructArrays
using DataFrames
using BenchmarkTools
a = rand(100000)
b = rand(100000)
function f1(a, b)
return sortperm(StructArray((a, b)))
end
function f2(a, b)
return sortperm(DataFrame(a=a,b=b, copycols=false))
end
function f3(a, b)
return sortperm(collect(zip(a, b)))
end
#btime f1(a,b)
#btime f2(a,b)
#btime f3(a,b)
Giving:
6.075 ms (8 allocations: 781.50 KiB)
13.808 ms (8291 allocations: 5.93 MiB)
15.892 ms (7 allocations: 2.29 MiB)
So StructArray is almost twice as fast as the other two and uses less memory
Similar to the DataFrames solution but a bit more lightweight in terms of dependencies, a nice solution is to use the StructArrays package, which lets you treat a pair of arrays as if it were an array of tuples, without making a copy of the data, which you can then sort (tuples sort lexicographically):
using StructArrays
i = sortperm(StructArray((a, b)))
Instead of finding the permutation array i and doing b[i], you can also do:
sort!(StructArray((a, b)))
which sorts both a and b in-place lexicographically by (a[j], b[j]).
Use sort and sortperm functions with a vector of tuples:
julia> a = [1, 1, 1, 2, 2, 2];
julia> b = [10, 8, 11, 4, 8, 0];
julia> x = collect(zip(a, b))
6-element Vector{Tuple{Int64, Int64}}:
(1, 10)
(1, 8)
(1, 11)
(2, 4)
(2, 8)
(2, 0)
julia> sort(x)
6-element Vector{Tuple{Int64, Int64}}:
(1, 8)
(1, 10)
(1, 11)
(2, 0)
(2, 4)
(2, 8)
julia> sortperm(x) #indices
6-element Vector{Int64}:
2
1
3
6
4
5
With DataFrames.jl it can be a bit shorter to write:
using DataFrames
sortperm(DataFrame(a=a,b=b, copycols=false))
copycols=false is to avoid unnecessary copy of vectors when creating a data frame. If you do not care about performance and want a short code then you can even write:
sortperm(DataFrame(; a, b))

Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Python, Numpy: all UNIQUE combinations of a numpy.array() vector

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.
You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.
For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object