Julia: sort two arrays (like lexsort in numpy) - numpy

Python example
In Numpy there is lexsort to sort one array within another:
Given multiple sorting keys, which can be interpreted as columns in a
spreadsheet, lexsort returns an array of integer indices that
describes the sort order by multiple columns.
So taking the following example:
import numpy as np
a = np.array([1,1,1,2,2,2])
b = np.array([10,8,11,4,8,0])
sorted_idx = np.lexsort((b,a))
print(b[sorted_idx])
# [ 8 10 11 0 4 8]
So this sorts b within a as we can see like:
1 1 1 2 2 2
8 10 11 0 4 8
I can't find anything similar in Julia so I wonder how this can be achieved? In my case two columns is sufficient
Julia
So lets take the same data to figure that out:
a = Vector([1,1,1,2,2,2])
b = Vector([10,8,11,4,8,0])
Little benchmark
using StructArrays
using DataFrames
using BenchmarkTools
a = rand(100000)
b = rand(100000)
function f1(a, b)
return sortperm(StructArray((a, b)))
end
function f2(a, b)
return sortperm(DataFrame(a=a,b=b, copycols=false))
end
function f3(a, b)
return sortperm(collect(zip(a, b)))
end
#btime f1(a,b)
#btime f2(a,b)
#btime f3(a,b)
Giving:
6.075 ms (8 allocations: 781.50 KiB)
13.808 ms (8291 allocations: 5.93 MiB)
15.892 ms (7 allocations: 2.29 MiB)
So StructArray is almost twice as fast as the other two and uses less memory

Similar to the DataFrames solution but a bit more lightweight in terms of dependencies, a nice solution is to use the StructArrays package, which lets you treat a pair of arrays as if it were an array of tuples, without making a copy of the data, which you can then sort (tuples sort lexicographically):
using StructArrays
i = sortperm(StructArray((a, b)))
Instead of finding the permutation array i and doing b[i], you can also do:
sort!(StructArray((a, b)))
which sorts both a and b in-place lexicographically by (a[j], b[j]).

Use sort and sortperm functions with a vector of tuples:
julia> a = [1, 1, 1, 2, 2, 2];
julia> b = [10, 8, 11, 4, 8, 0];
julia> x = collect(zip(a, b))
6-element Vector{Tuple{Int64, Int64}}:
(1, 10)
(1, 8)
(1, 11)
(2, 4)
(2, 8)
(2, 0)
julia> sort(x)
6-element Vector{Tuple{Int64, Int64}}:
(1, 8)
(1, 10)
(1, 11)
(2, 0)
(2, 4)
(2, 8)
julia> sortperm(x) #indices
6-element Vector{Int64}:
2
1
3
6
4
5

With DataFrames.jl it can be a bit shorter to write:
using DataFrames
sortperm(DataFrame(a=a,b=b, copycols=false))
copycols=false is to avoid unnecessary copy of vectors when creating a data frame. If you do not care about performance and want a short code then you can even write:
sortperm(DataFrame(; a, b))

Related

Index array with Tuple of Tuples

I have a vector of tuples, where each tuple represents a position in a 2d array.
I also have a 2d array of values
For example:
# create a vector of tuples
tupl1 = ((1,1), (2,3), (1,2), (3,1))
# create a 2d array of values
m1 = zeros(Int, (3,3))
m1[1:4] .= 1
I want to get all the values in the 2d array at each of the tuple positions. I thought the following might work:
m1[tupl1]
But this gives in invalid index error. Expected output would be:
4-element Vector{Int64}:
1
0
1
1
Any advice would be much appreciated.
One way to do this could be:
julia> [m1[t...] for t in tupl1]
4-element Vector{Int64}:
1
0
1
1
More verbose but faster with lesser number of allocations would be via CartesianIndex:
julia> getindex.(Ref(m1), CartesianIndex.(tupl1))
(1, 0, 1, 1)
A benchmark:
julia> #btime [$m1[t...] for t in $tupl1];
24.900 ns (1 allocation: 96 bytes)
julia> #btime getindex.(Ref($m1), CartesianIndex.($tupl1));
9.319 ns (1 allocation: 16 bytes)
If in your original question you had a vector of tuples (you have a tuple of tuples) like this:
julia> tupl1 = [(1,1), (2,3), (1,2), (3,1)]
4-element Vector{Tuple{Int64, Int64}}:
(1, 1)
(2, 3)
(1, 2)
(3, 1)
then you can do just:
julia> m1[CartesianIndex.(tupl1)]
4-element Vector{Int64}:
1
0
1
1

Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Pandas | How to effectively filter a column

I'm looking for a way to quickly and effectively filter through a dataframe column and remove values that don't meet a condition.
Say, I have a column with the numbers 4, 5 and 10. I want to filter the column and replace any numbers above 7 with 0. How would I go about this?
You're talking about two separate things - filtering and value replacement. They both have uses and end up being similar in nature but for filtering I'll point to this great answer.
Let's say our data frame is called df and looks like
A B
1 4 10
2 4 2
3 10 1
4 5 9
5 10 3
Column A fits your statement of a column only having values 4, 5, 10. If you wanted to replace numbers above 7 with 0, this would do it:
df["A"] = [0 if x > 7 else x for x in df["A"]]
If you read through the right-hand side it cleanly explains what it is doing. It helps to include parentheses to separate out the "what to do" with the "what you're doing it over":
df["A"] = [(0 if x > 7 else x) for x in df["A"]]
If you want to do a manipulation over multiple columns, then utilizing zip allows you to do it easily. For example, if you want the sum of columns A and B then:
df["sum"] = [x[0] + x[1] for x in zip(df["A"], df["B"])]
Take care when you overwrite data - this removes information. It's a good practice to have the transformed data in other columns so you can trace back when something inevitably goes wonky.
There is many options. One possibility for if then... is np.where
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1, 200, 4, 5, 6, 11],
'y': [4, 5, 10, 24, 4 , 3]})
df['y'] = np.where(df['y'] > 7, 0, df['y'])

Python, Numpy: all UNIQUE combinations of a numpy.array() vector

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.
You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.
For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)