How to apply a function to all vector pairs in a matrix - numpy

I want to calculate the sum of distances of a row vector with respect to all other row vectors in a matrix. Thus the result has to be a square matrix.
For a matrix M:
| a b c | | v1 |
M = | | = | |
| c d e | | v2 |
I'd like to calculate:
| (a-a)+(b-b)+(c-c) (a-c)+(b-d)+(c-e) | | v1-v1 v1-v2 |
M = | | = | |
| (c-a)+(d-b)+(e-c) (c-c)+(d-d)+(e-e) | | v2-v1 v2-v2 |
I am aware that I could do this in a nested for loop but is there a more elegant way to apply this, or any other operation like this, to a matrix with numpy?

Use broadcasting -
(M[:,None,:]- M).sum(2)
Sample run -
In [41]: M
Out[41]:
array([[8, 3, 2],
[6, 1, 2]])
In [42]: (M[:,None,:]- M).sum(2)
Out[42]:
array([[ 0, 4],
[-4, 0]])
If M is a NumPy matrix, get an array view into it with np.asarray() and then use it, like so -
M_arr = np.asarray(M)
out = np.asmatrix((M_arr[:,None,:]- M_arr).sum(2))
Sample run -
In [69]: M = np.asmatrix(np.random.randint(0,9,(2,3)))
In [70]: M
Out[70]:
matrix([[3, 8, 8],
[0, 5, 0]])
In [71]: M_arr = np.asarray(M)
In [72]: np.asmatrix((M_arr[:,None,:]- M_arr).sum(2))
Out[72]:
matrix([[ 0, 14],
[-14, 0]])
Let's also verify that are we indeed working with a view there with np.asarray() -
In [73]: np.may_share_memory(M, M_arr)
Out[73]: True

Related

Average of array column of two dataframes and find the maximum index in pyspark

I want to Combine the column values of two dataframe after performing some operations to create a new dataframe in pyspark. The columns of each dataframe are vectors with integer values. The operations done are taking the average of each values in the vectors of the dataframe and finding the index of the maximum element of the new vectors created.
Dataframe1:
|id| |value1 |
|:.| |:......|
| 0| |[0,1,2]|
| 1| |[3,4,5]|
Dataframe2:
|id| |value2 |
|:.| |:......|
| 0| |[1,2,3]|
| 1| |[4,5,6]|
Dataframe3:
|value3 |
|:............|
|[0.5,1.5,2.5]|
|[3.5,4.5,5.5]|
Dataframe4:
|value4|
|:.....|
|2 |
|2 |
Dataframe3 is obtained by taking the average of each elements of each vectors of dataframe 1 and 2 i.e.: first vector of dataframe3 [0.5,1.5,2.5] is obtained by [0+1/2,1+2/2,2+3/2]. Dataframe4 is obtained by taking the index of maximum value of each vector.i.e; Take first vector of dataframe3[0.5,1.5,2.5] maximum value is 2.5 and it occurs at index 2 so first element in Dataframe4 is 2. How we can implement this in pyspark .
V1:
+--------------------------------------+---+
|p1 |id |
+--------------------------------------+---+
|[0.01426862, 0.010903089, 0.9748283] |0 |
|[0.068229124, 0.89613986, 0.035630997]|1 |
+--------------------------------------+---+
V2:
+-------------------------+---+
|p2 |id |
+-------------------------+---+
|[0.0, 0.0, 1.0] |0 |
|[2.8160464E-27, 1.0, 0.0]|1 |
+-------------------------+---+
when df3 = v1.join(v2,on="id") is used
df3=
this is what I get
+-------------------------------------+---------------+
|p1 |p2 |
+-------------------------------------+---------------+
|[0.02203844, 0.010056663, 0.9679049] |[0.0, 0.0, 1.0]|
|[0.039553806, 0.015186918, 0.9452593]|[0.0, 0.0, 1.0]|
+-------------------------------------+---------------+
and when
df3 = df3.withColumn( "p3", F.expr("transform(arrays_zip(p1, p2), x -> (x.p1 + x.p2) / 2)"),)
df4 = df3.withColumn("p4",F.expr("array_position(p3, array_max(p3))"))
were p3 is the average value .I get all values of df4 as zero
First, I recreate your test data :
a = [
[0, [0,1,2]],
[1, [3,4,5]],
]
b = ["id", "value1"]
df1 = spark.createDataFrame(a,b)
c = [
[0, [1,2,3]],
[1, [4,5,6]],
]
d = ["id", "value2"]
df2 = spark.createDataFrame(c,d)
then, I process the data :
join
df3 = df1.join(df2, on="id")
df3.show()
+---+---------+---------+
| id| value1| value2|
+---+---------+---------+
| 0|[0, 1, 2]|[1, 2, 3]|
| 1|[3, 4, 5]|[4, 5, 6]|
+---+---------+---------+
create the average array
from pyspark.sql import functions as F, types as T
#F.udf(T.ArrayType(T.FloatType()))
def avg_array(array1, array2):
return list(map(lambda x: (x[0] + x[1]) / 2, zip(array1, array2)))
df3 = df3.withColumn("value3", avg_array(F.col("value1"), F.col("value2")))
# OR without UDF
df3 = df3.withColumn(
"value3",
F.expr("transform(arrays_zip(value1, value2), x -> (x.value1 + x.value2) / 2)"),
)
df3.show()
+---+---------+---------+---------------+
| id| value1| value2| value3|
+---+---------+---------+---------------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]|
+---+---------+---------+---------------+
get the index (the array_position start at 1, you can do a -1 if necessary)
df4 = df3.withColumn("value4",F.expr("array_position(value3, array_max(value3))"))
df4.show()
+---+---------+---------+---------------+------+
| id| value1| value2| value3|value4|
+---+---------+---------+---------------+------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]| 3|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]| 3|
+---+---------+---------+---------------+------+

Filtering based on value and creating list in spark dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Pandas column pairwise difference for each possible pair [duplicate]

This question already has answers here:
Pandas - Creating Difference Matrix from Data Frame
(3 answers)
Closed 4 years ago.
I have the following dataframe.
df = pd.DataFrame([['a', 4], ['b', 1], ['c', 2], ['d', 0], ], columns=['item', 'value'])
df
item | value
a | 4
b | 1
c | 2
d | 0
I want to calculate the pairwise absolute difference between each possible pair of item to give the following output.
item| a | b | c | d
a | 0.0 | 3.0 | 2.0 | 4.0
b | 3.0 | 0.0 | 1.0 | 1.0
c | 2.0 | 1.0 | 0.0 | 2.0
d | 4.0 | 1.0 | 2.0 | 0.0
After a lot of search, I could find answer only to direct element by element difference, which results in a single column output.
So far, I've tried
pd.pivot_table(df, values='value', index='item', columns='item', aggfunc=np.diff)
but this doesn't work.
This question has been answered here. The only difference is that you would need to add abs:
abs(df['value'].values - df['value'].values[:, None])
Not exactly the same output but taking a cue from here: https://stackoverflow.com/a/9704775/2064141
You can try this:
np.abs(np.array(df['value'])[:,np.newaxis] - np.array(df['value']))
Which gives:
array([[0, 3, 2, 4],
[3, 0, 1, 1],
[2, 1, 0, 2],
[4, 1, 2, 0]])
Although I just saw the link from Harm te Molder and it seems to be more relevant for your use.

Julia: converting column type from Integer to Float64 in a DataFrame

I am trying to change type of numbers in a column of a DataFrame from integer to floating point. It should be straightforward to do this, but it's not working. The data type remains to be integer. What am I missing?
In [2]: using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
Out [2]: 4x2 DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 1 | "M" |
| 2 | 2 | "F" |
| 3 | 3 | "F" |
| 4 | 4 | "M" |
In [3]: df[:,:A] = float64(df[:,:A])
Out [3]: 4-element DataArray{Float64,1}:
1.0
2.0
3.0
4.0
In [4]: df
Out [4]: 4x2 DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 1 | "M" |
| 2 | 2 | "F" |
| 3 | 3 | "F" |
| 4 | 4 | "M" |
In [5]: typeof(df[:,:A])
Out [5]: DataArray{Int64,1} (constructor with 1 method)
The reason this happens is mutation and conversion.
If you have two vectors
a = [1:3]
b = [4:6]
you can make x refer to one of them with assignment.
x = a
Now x and a refer to the same vector [1, 2, 3]. If you then assign b to x
x = b
you have now changed x to refer to the same vector as b refers to.
You can also mutate vectors by copying over the values in one vector to the other. If you do
x[:] = a
you copy over the values in vector a to the vector b, so now you have two vectors with [1, 2, 3].
Then there is also conversion. If you copy a value of one type into a vector of another value Julia will attempt to convert the value to that of the elements vector.
x[1] = 5.0
This gives you a the vector [5, 2, 3] because Julia converted the Float64 value 5.0 to the Int value 5. If you tried
x[1] = 5.5
Julia will throw a InexactError() because 5.5 can't be losslessly converted to an integer.
When it comes to DataFrames things work the same as long as you realize a DataFrame is a collection of named references to vectors. So what you are doing when constructing the DataFrame in this call
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
is that you create the vector [1, 2, 3, 4], and the vector ["M", "F", "F", "M"]. You then construct a DataFrame with references to these two new vectors.
Later when you do
df[:,:A] = float64(df[:,:A])
you first create a new vector by converting the values in the vector [1, 2, 3, 4] into Float64. You then mutate the vector referred to with df[:A] by copying over the values in the Float64 vector back into the Int vector, which causes Julia to convert the values back to Int.
What Colin T Bower's answer
df[:A] = float64(df[:A])
does is that rather than mutating the vector referred to by the DataFrame, he changes the reference to refer to the vector with the Flaot64 values.
I hope this makes sense.
Try this:
df[:A] = float64(df[:A])
This works for me on Julia v0.3.5 with DataFrames v0.6.1.
This is quite interesting though. Notice that:
df[:, :A] = [2.0, 2.0, 3.0, 4.0]
will change the contents of the column to [2,2,3,4], but leaves the type as Int64, while
df[:A] = [2.0, 2.0, 3.0, 4.0]
will also change the type.
I just had quick look at the manual and couldn't see any reference to this behaviour (admittedly it was a very quick look). But I find this unintuitive enough that perhaps it is worth filing an issue.

Julia best way to reshape multi-dim array

I have a multi-dimensional array:
julia> sim1.value[1:5,:,:]
5x3x3 Array{Float64,3}:
[:, :, 1] =
0.201974 0.881742 0.497407
0.0751914 0.921308 0.732588
-0.109084 1.06304 1.15962
-0.0149133 0.896267 1.22897
0.717094 0.72558 0.456043
[:, :, 2] =
1.28742 0.760712 1.61112
2.21436 0.229947 1.87528
-1.66456 1.46374 1.94794
-2.4864 1.84093 2.34668
-2.79278 1.61191 2.22896
[:, :, 3] =
0.649675 0.899028 0.628103
0.718837 0.665043 0.153844
0.914646 0.807048 0.207743
0.612839 0.790611 0.293676
0.759457 0.758115 0.280334
I have names for the 2nd dimension in
julia> sim1.names
3-element Array{String,1}:
"beta[1]"
"beta[2]"
"s2"
Whats best way to reshape this multi-dim array so that I have a data frame like:
beta[1] | beta[2] | s2 | chain
0.201974 | 0.881742 | 0.497407 | 1
0.0751914| 0.921308 | 0.732588 | 1
-0.109084 | 1.06304 | 1.15962 | 1
-0.0149133| 0.896267 | 1.22897 | 1
... | ... | ... | ...
1.28742 | 0.760712 | 1.61112 | 2
2.21436 | 0.229947 | 1.87528 | 2
-1.66456 | 1.46374 | 1.94794 | 2
-2.4864 | 1.84093 | 2.34668 | 2
-2.79278 | 1.61191 | 2.22896 | 2
... | ... | ... | ...
At the moment, I think the best way to do this would be a mixture of loops and calls to reshape:
using DataFrames
A = randn(5, 3, 3)
df = DataFrame()
for j in 1:3
df[j] = reshape(A[:, :, j], 5 * 3)
end
names!(df, [:beta1, :beta2, :s2])
Looking at your data, it seems you wanted to basically stack the three matrices output by sim1.value[1:5,:,:] on top of each other vertically, plus add another column with the index of the matrix. The accepted answer of the brilliant and venerable John Myles White seems to put the entire contents of each of those matrices into it's own column.
The below matches your desired output using vcat for the stacking and hcat and fill to add the extra column. JMW I'm sure will know if there's a better way :)
using DataFrames
A = randn(5, 3, 3)
names = ["beta[1]","beta[2]","s2"]
push!(names, "chain")
newA = vcat([hcat(A[:,:,i],fill(i,size(A,1))) for i in 1:size(A,3)]...)
df = DataFrame(newA, Symbol[names...])
note also you can do this slightly more concisely without the explicit calls to hcat and vcat:
newA = [[[A[:,:,i] fill(i,size(A,1))] for i in 1:size(A,3)]...]