Spark SQL: Transforming some rows into columns - apache-spark-sql

Is there a way to transform a org.apache.spark.sql.DataFrame like this
Predictor icaoCode num1 num2
P1 OTHH 1.1 1.2
P1 ZGGG 2.1 2.2
P2 OTHH 3.1 3.2
P2 ZGGG 4.1 4.2
P3 OTHH 5.1 5.2
P3 ZGGG 6.1 6.2
. . . .
. . . .
. . . .
into a DataFrame like this?
icaoCode P1.num1 P1.num2 P2.num1 P2.num2 P3.num1 P3.num2 ...
OTHH 1.1 1.2 3.1 3.2 5.1 5.2 ...
ZGGG 2.1 2.2 4.1 4.2 6.1 6.2 ...
. . . . . . . ...
. . . . . . . ...
. . . . . . . ...
There can be an arbitrary number of values for Predictor and for icaoCode.

With Spark 1.6.0, there is a pivot function to transform/transpose your data. In your case it requires some preprocessing to get the data ready for pivot. Here an example how I'd do it:
def doPivot(): Unit = {
val sqlContext: SQLContext = new org.apache.spark.sql.SQLContext(sc)
// dummy data
val r1 = Input("P1", "OTHH", 1.1, 1.2)
val r2 = Input("P1", "ZGGG", 2.1, 2.2)
val r3 = Input("P2", "OTHH", 3.1, 3.2)
val records = Seq(r1, r2, r3)
val df = sqlContext.createDataFrame(records)
// prepare data for pivot
val fullName: ((String, String) => String) = (predictor: String, num: String) => {
predictor + "." + num
}
val udfFullName = udf(fullName)
val dfFullName = df.withColumn("num1-complete", udfFullName(col("predictor"), lit("num1")))
.withColumn("num2-complete", udfFullName(col("predictor"), lit("num2")))
val dfPrepared = dfFullName.select(col("icaoCode"), col("num1") as "num", col("num1-complete") as "value")
.unionAll(dfFullName.select(col("icaoCode"), col("num2") as "num", col("num2-complete") as "value"))
// transpose/pivot dataframe
val dfPivoted = dfPrepared.groupBy(col("icaoCode")).pivot("value").mean("num")
dfPivoted.show()
}
case class Input(predictor: String, icaoCode: String, num1: Double, num2: Double)
The final dataframe should work for you:
+--------+-------+-------+-------+-------+
|icaoCode|P1.num1|P1.num2|P2.num1|P2.num2|
+--------+-------+-------+-------+-------+
| OTHH| 1.1| 1.2| 3.1| 3.2|
| ZGGG| 2.1| 2.2| null| null|
+--------+-------+-------+-------+-------+

Related

Python chess: Check for passed pawns

In a chess position, I wish to check whether any passed pawn exists for white.
Is it possible to do so using the python-chess library? If not, how can I implement it?
def checkForPassedPawn(position: chess.Board, side_to_move: chess.Color):
# ... check for passed pawn
# return a boolean value
I could not find any built-in method that detects passed pawns.
You'll have to look at the pawn positions yourself. There are many ways to do that. For instance, you could take the board's string representation as a starting point:
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P p . . .
. . . . . P . .
. . . . . . . .
P P P P . . P P
R N B Q K B N R
This is the kond of string you get with str(position).
Then you could put each column in a separate list:
lines = str(position).replace(" ", "").splitlines()
columns = list(zip(*lines))
This gives you:
[
('r', 'p', '.', '.', '.', '.', 'P', 'R'),
('n', 'p', '.', '.', '.', '.', 'P', 'N'),
('b', '.', '.', 'p', '.', '.', 'P', 'B'),
('q', '.', '.', 'P', '.', '.', 'P', 'Q'),
('k', '.', '.', 'p', '.', '.', '.', 'K'),
('b', 'p', '.', '.', 'P', '.', '.', 'B'),
('n', 'p', '.', '.', '.', '.', 'P', 'N'),
('r', 'p', '.', '.', '.', '.', 'P', 'R')
]
It the current player is white, you then can check for the left most "P" in each tuple where it has a "p" more left to it, either in the current tuple, the previous one, or the next one.
For the black player, you would use a similar logic and it might be useful to first reverse the tuples in that case.
Here is an implementation of that idea:
import chess
def checkForPassedPawn(position: chess.Board, side_to_move: chess.Color):
selfpawn = "pP"[side_to_move]
otherpawn = "Pp"[side_to_move]
lines = str(position).replace(" ", "").splitlines()
if side_to_move == chess.BLACK:
lines.reverse()
# turn rows into columns and vice versa
columns = list(zip(*lines))
for colnum, col in enumerate(columns):
if selfpawn in col:
rownum = col.index(selfpawn)
if (otherpawn not in col[:rownum]
and (colnum == 0 or otherpawn not in columns[colnum-1][:rownum])
and (colnum == 7 or otherpawn not in columns[colnum+1][:rownum])):
return f"{'abcdefgh'[colnum]}{rownum+1}"
position = chess.Board()
position.push_san("e4")
position.push_san("d5")
position.push_san("f4")
position.push_san("e5")
position.push_san("exd5")
position.push_san("c5") # Now white pawn at d5 is a passed pawn
print(position)
passedpawn = checkForPassedPawn(position, chess.WHITE)
print("passed white pawn:", passedpawn)
position.push_san("d4")
position.push_san("e4") # Now black pawn at e4 is a passed pawn
print(position)
passedpawn = checkForPassedPawn(position, chess.BLACK)
print("passed black pawn:", passedpawn)
Output:
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P p . . .
. . . . . P . .
. . . . . . . .
P P P P . . P P
R N B Q K B N R
passed white pawn: d4
r n b q k b n r
p p . . . p p p
. . . . . . . .
. . p P . . . .
. . . P p P . .
. . . . . . . .
P P P . . . P P
R N B Q K B N R
passed black pawn: e4

outliers with groupby in pandas

I have a data that are like that (toy data) :
import pandas as pd
import numpy as np
N=5
dfi = pd.DataFrame()
for i in range(5):
df = pd.DataFrame(index=pd.date_range("20100101", periods=N, freq='M'))
df['price'] = np.random.randint(0,N,size=(len(df)))
df['quantity'] = np.random.randint(0,N,size=(len(df)))
df['type'] = 'P'+str(i)
dfi = pd.concat([df, dfi], axis=0)
dfi
From this I would like to calculate a new price per type ie something like that :
new_price = (1+perf)*new_price(t-1)
with :
new_price(0)=price(0)
and
perf = price(t)/price(t-1) if abs(price(t)/price(t-1)-1)<s else 0
I tried :
dfi['prix_corr'] = (dfi
.sort_index()
.groupby('type').price
.apply(lambda x: x.pct_change() if x.pct_change().abs() <= 0.5 else 0)
)
but get an error message :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
``
I would like to correct in each group for outlier time series data.
Any suggestion ?
Given your input, you could try using a custom function in your lambda expression such as:
def compute_price_change(x):
mask = x.pct_change().abs() > 0.5
x = x.pct_change()
x[mask] = 0
return x
dfi['prix_corr'] = (dfi
.groupby('type').price
.apply(lambda x: compute_price_change(x))
)
Output:
price quantity type prix_corr
2010-01-31 3 0 P4 NaN
2010-02-28 3 2 P4 0.0
2010-03-31 0 2 P4 -0.5
2010-04-30 2 4 P4 0.5
2010-05-31 2 2 P4 0.0
2010-01-31 1 2 P3 NaN
2010-02-28 4 3 P3 0.0
2010-03-31 0 0 P3 0.0
2010-04-30 4 0 P3 0.0
2010-05-31 2 2 P3 0.0
. . . . .
. . . . .
. . . . .
Since .pct_change() returns NaN for the first entry, you might want to handle that in some way as well.

SAS Proc IML Optimization

proc iml;
start f_prob(beta) global(one_m_one, pone_m_one);
p = nrow(one_m_one);
td = j(p,3,0.);
a = 1;
do i = 1 to p;
td[i,1] = exp((one_m_one[i,1])*(beta[1]) + (one_m_one[i,2])*(beta[2]) + (one_m_one[i,3])*(beta[3]) + (one_m_one[i,4])*(beta[4]) + (one_m_one[i,5])*(beta[5]) + (one_m_one[i,6])*(beta[6]) + (one_m_one[i,7])*(beta[7]) + (one_m_one[i,8])*(beta[8]) + (one_m_one[i,9])*(beta[9]) + (one_m_one[i,10])*(beta[10]));
do j = a to 11+a;
td[i,2] = td[i,2] + exp((pone_m_one[j,1])*(beta[1]) + (pone_m_one[j,2])*(beta[2]) + (pone_m_one[j,3])*(beta[3]) + (pone_m_one[j,4])*(beta[4]) + (pone_m_one[j,5])*(beta[5]) + (pone_m_one[j,6])*(beta[6]) + (pone_m_one[j,7])*(beta[7]) + (pone_m_one[j,8])*(beta[8]) + (pone_m_one[j,9])*(beta[9]) + (pone_m_one[j,10])*(beta[10]));
end;
a = a + 12;
end;
td[,3] = td[,1]/td[,2];
f = 1;
do i = 1 to p;
f = f*td[i,3];
end;
return(f);
finish f_prob;
/* Set up the constraints: sum(x)=0 */
/* x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 SIGN VALUE */
con = {. . . . . . . . . . . ., /* specify lower bounds */
. . . . . . . . . . . ., /* specify upper bounds */
1 1 1 1 1 1 1 1 1 1 0 0}; /* constraints */
beta0 = j(1,10,0);
optn = {1,4};
call nlpnra(rc, result, "f_prob", beta0, optn) blc=con;
Hi, I am trying to optimise the function f that has 10 parameters in it with a constraint of all 10 parameters sum up to zero.
Can anyone suggest how can I write the code for the last part so that i can optimise f and get the results i want? Thanks in advance.
The documentation provides an example of how to specify a linear constraint matrix. For your example, use a 3 x 12 matrix.
On the first row (columns 1:10) put any lower-bound constraints for the parameters.
On the second row (columns 1:10) put any upper-bound constraints for the parameters.
On the third row, put all ones in columns 1:10. Put a 0 in column 11 to indicate the EQUAL sign. Put 0 in the 12th column to indicate the value of the constraint.
The code looks like this:
/* Set up the constraints: sum(x)=0 */
/* x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 SIGN VALUE */
con = {. . . . . . . . . . . ., /* specify lower bounds */
. . . . . . . . . . . ., /* specify upper bounds */
1 1 1 1 1 1 1 1 1 1 0 0}; /* constraints */
call nlpnra(rc, result, "f_prob", beta, optn) blc=con;
The last line specifies the coefficients of the matrix expression c*x = 0, where c = {1 1 ... 1} contains the of the third row.

How to create triangle mesh from list of points in Elm

Lets say I have a list of points
[p1,p2,p3,p4,p5,p6, ...] or [[p1,p2,p3,...],[...]...]
were p1,p2,p3 are one stripe and p4,p5,p6 the other.
p1 - p4 - p7 ...
| / | / |
p2 - p5 - p8 ...
| / | / |
p3 - p6 - p9 ...
. . .
. . .
. . .
How can I transform this into a list of
[(p1,p2,p4), (p4,p5,p2), (p2,p3,p5), (p5,p6,p3), ...]
Is there a way without converting the list into an Array und use get and handle all the Maybes
First let's define how to split a square into two triangles:
squareToTriangles : a -> a -> a -> a -> List (a, a, a)
squareToTriangles topLeft botLeft topRight botRight =
[ (topLeft, botLeft, topRight)
, (topRight, botRight, botLeft)
]
Now, since squares are made of two lists, let's assume you can use a list of tuples as input. Now you can make triangles out of lists of left/right points:
triangles : List (a, a) -> List (a, a, a)
triangles list =
case list of
(tl, tr) :: ((bl, br) :: _ as rest) ->
List.append
(squareToTriangles tl bl tr br)
(triangles rest)
_ ->
[]
Of course, your input doesn't involve tuples, so let's define something that takes a list of lists as input:
triangleMesh : List (List a) -> List (a, a, a)
triangleMesh list =
case list of
left :: (right :: _ as rest) ->
List.append
(triangles <| List.map2 (,) left right)
(triangleMesh rest)
_ ->
[]
Now you can pass in your list of lists, such that:
triangleMesh [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
-- yields...
[(1,2,4),(4,5,2),(2,3,5),(5,6,3),(4,5,7),(7,8,5),(5,6,8),(8,9,6)]
Note that this can probably be optimized by using a better method than List.append, but the general algorithm holds.
You can simply pattern match on your list as follows:
toMesh: List Float -> List (Float, Float, Float)
toMesh list =
case list of
[ p1, p2, p3, p4, p5, p6] ->
Just [(p1,p2,p4), (p4,p5,p2), (p2,p3,p5), (p5,p6,p3)]
_ ->
[]

Why doesn't this use of Elem typecheck?

I am quite confused.
module Experiment
import Data.Vect
p1: Elem 5 [3,4,5,6]
p1 = There (There Here)
v : Vect 4 Int
v = 3 :: 4 :: 5 :: 6 :: Nil
p2: Elem 5 v
p2 = There (There Here)
The definition of p2 does not typecheck, while the definition of p1 does. I am using Idris 0.10.2. Is there something I am missing?
Lowercase names in type declarations are interpreted as implicit arguments (like a in length : List a -> Nat which is actually length : {a : Type} -> List a -> Nat). To refer to the defined Vect you can either use an uppercase name or refer by the namespace:
module Experiment
import Data.Vect
A : Vect 4 Int
A = 3 :: 4 :: 5 :: 6 :: Nil
p2: Elem 5 A
p2 = There (There Here)
a : Vect 4 Int
a = 3 :: 4 :: 5 :: 6 :: Nil
p3: Elem 5 Experiment.a
p3 = There (There Here)