Filtering based on value and creating list in spark dataframe - dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach

You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Related

is there a PySpark function that will merge data from a column for rows with same id?

I have the following dataframe:
+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+
I need it to be in a df/rdd format
(1, [a, b, c])
(2, [f, g])
(3, [j])
I'm new to spark and was wondering if this operation can be performed by a single function
I tried using flatmap but I don't think I'm using it correctly
You can group by "A" and then use aggregate function for example collect_set or collect_array
import pyspark.sql.functions as F
df = [
{"A": 1, "B": "a"},
{"A": 1, "B": "b"},
{"A": 1, "B": "c"},
{"A": 2, "B": "f"},
{"A": 2, "B": "g"},
{"A": 3, "B": "j"}
]
df = spark.createDataFrame(df)
df.groupBy("A").agg(F.collect_set(F.col("B"))).show()
Output
+---+--------------+
| A|collect_set(B)|
+---+--------------+
| 1| [c, b, a]|
| 2| [g, f]|
| 3| [j]|
+---+--------------+
First step, create sample data.
#
# 1 - Create sample dataframe + view
#
# array of tuples - data
dat1 = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "f"),
(2, "g"),
(3, "j")
]
# array of names - columns
col1 = ["A", "B"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# make temp hive view
df1.createOrReplaceTempView("sample_data")
Second step, play around with temporary table.
%sql
select * from sample_data
%sql
select A, collect_list(B) as B_LIST from sample_data group by A
Last step, write code to execute Spark SQL to create dataframe that you want.
df2 = spark.sql("select A, collect_list(B) as B_LIST from sample_data group by A")
display(df2)
In summary, you can use the dataframe methods to create the same output. However, the Spark SQL looks clean and makes more sense.

pyspark pandas UDF merging date ranges in multiple rows

I am modifying the function described here to work with pyspark.
Input
from pyspark.sql import functions as F
data_in = spark.createDataFrame([
[1, "2017-1-1", "2017-6-30"], [1, "2017-1-1", "2017-1-3"], [1, "2017-5-1", "2017-9-30"],
[1, "2018-5-1", "2018-9-30"], [1, "2018-5-2", "2018-10-31"], [1, "2017-4-1", "2017-5-30"],
[1, "2017-10-3", "2017-10-3"], [1, "2016-12-5", "2016-12-31"], [1, "2016-12-1", "2016-12-2"],
[2, "2016-12-1", "2016-12-2"], [2, "2016-12-3", "2016-12-25"]
], schema=["id","start_dt","end_dt"])
data_in = data_in.select("id", F.to_date("start_dt","yyyy-M-d").alias("start_dt"),
F.to_date("end_dt","yyyy-M-d").alias("end_dt")).sort(["id","start_dt","end_dt"])
Aggregate function to apply
from datetime import datetime
mydt = datetime(1970,1,1).date()
def merge_dates(grp):
dt_groups = ((grp["start_dt"]-grp["end_dt"].shift(fill_value=mydt)).dt.days > 1).cumsum()
grouped = grp.groupby(dt_groups).agg({"start_dt":"min", "end_dt":"max"})
return grouped if len(grp)==len(grouped) else merge_dates(grouped)
Testing using Pandas
df = data_in.toPandas()
df.groupby("id").apply(merge_dates).reset_index().drop('level_1', axis=1)
Output
id start_dt end_dt
0 1 2016-12-01 2016-12-02
1 1 2016-12-05 2017-09-30
2 1 2017-10-03 2017-10-03
3 1 2018-05-01 2018-10-31
4 2 2016-12-01 2016-12-25
When I try to run this using Spark
data_out = data_in.groupby("id").applyInPandas(merge_dates, schema=data_in.schema)
display(data_out)
I get the following error
PythonException: 'RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 3 Actual: 2'. Full traceback below:
When I change schema to data_in.schema[1:] I get back only the date columns which are computed correctly (matches the Pandas output) but does not return the field id - which is obviously required. How can I fix this so that the final output has the id as well?
With spark only, if we replicate what you have in pandas, it would look like below:
from pyspark.sql import functions as F
w = W.partitionBy("id").orderBy(F.monotonically_increasing_id())
w1 = w.rangeBetween(W.unboundedPreceding,0)
out = (data_in.withColumn("helper",F.datediff(F.col("start_dt"),
F.lag("end_dt").over(w))>1)
.fillna({"helper":True})
.withColumn("helper2",F.sum(F.col("helper").cast("int")).over(w1))
.groupBy("id","helper2").agg(F.min("start_dt").alias("start_dt"),
F.max("end_dt").alias("end_dt")
)
.drop("helper2"))
out.show()
+---+----------+----------+
| id| start_dt| end_dt|
+---+----------+----------+
| 1|2016-12-01|2016-12-02|
| 1|2016-12-05|2017-09-30|
| 1|2017-10-03|2017-10-03|
| 1|2018-05-01|2018-10-31|
| 2|2016-12-01|2016-12-25|
+---+----------+----------+
Note that this assumes that mydt = datetime(1970,1,1).date() is just a placeholder for nulls when shifting the values, .i have used fillna as True for same. if not you can fillna right after the lag which is the same as shift

Conditional merge on in pandas

My question in simple I am using pd.merge to merge two df .
Here's the line of code:
pivoted = pd.merge(pivoted, concerned_data, on='A')
and I want the on='B' whenever a row has column A value as null. Is there a possible way to do this?
Edit:
As an example if
df1: A | B |randomval
1 | 1 | ty
Nan| 2 | asd
df2: A | B |randomval2
1 | Nan| tyrte
3 | 2 | asde
So if on='A' and the value is Nan is any of the df (for a single row) I want on='B' for that row only
Thank you!
You could create a third column in your pandas.DataFrame which incorporates this logic and merge on this one.
For example, create dummy data
df1 = pd.DataFrame({"A" : [1, None], "B" : [1, 2], "Val1" : ["a", "b"]})
df2 = pd.DataFrame({"A" : [1, 2], "B" : [None, 2], "Val2" : ["c", "d"]})
Create a column c which has this logic
df1["C"] = pd.concat([df1.loc[~df1.A.isna(), "A"], df1.loc[df1.A.isna(), "B"]],ignore_index=False)
df2["C"] = pd.concat([df2.loc[~df2.A.isna(), "A"], df2.loc[df2.A.isna(), "B"]],ignore_index=False)
Finally, merge on this common column and include only your value columns
df3 = pd.merge(df1[["Val1","C"]], df2[["Val2","C"]], on='C')
In [27]: df3
Out[27]:
Val1 C Val2
0 a 1.0 c
1 b 2.0 d

How to apply a function to all vector pairs in a matrix

I want to calculate the sum of distances of a row vector with respect to all other row vectors in a matrix. Thus the result has to be a square matrix.
For a matrix M:
| a b c | | v1 |
M = | | = | |
| c d e | | v2 |
I'd like to calculate:
| (a-a)+(b-b)+(c-c) (a-c)+(b-d)+(c-e) | | v1-v1 v1-v2 |
M = | | = | |
| (c-a)+(d-b)+(e-c) (c-c)+(d-d)+(e-e) | | v2-v1 v2-v2 |
I am aware that I could do this in a nested for loop but is there a more elegant way to apply this, or any other operation like this, to a matrix with numpy?
Use broadcasting -
(M[:,None,:]- M).sum(2)
Sample run -
In [41]: M
Out[41]:
array([[8, 3, 2],
[6, 1, 2]])
In [42]: (M[:,None,:]- M).sum(2)
Out[42]:
array([[ 0, 4],
[-4, 0]])
If M is a NumPy matrix, get an array view into it with np.asarray() and then use it, like so -
M_arr = np.asarray(M)
out = np.asmatrix((M_arr[:,None,:]- M_arr).sum(2))
Sample run -
In [69]: M = np.asmatrix(np.random.randint(0,9,(2,3)))
In [70]: M
Out[70]:
matrix([[3, 8, 8],
[0, 5, 0]])
In [71]: M_arr = np.asarray(M)
In [72]: np.asmatrix((M_arr[:,None,:]- M_arr).sum(2))
Out[72]:
matrix([[ 0, 14],
[-14, 0]])
Let's also verify that are we indeed working with a view there with np.asarray() -
In [73]: np.may_share_memory(M, M_arr)
Out[73]: True

Julia: converting column type from Integer to Float64 in a DataFrame

I am trying to change type of numbers in a column of a DataFrame from integer to floating point. It should be straightforward to do this, but it's not working. The data type remains to be integer. What am I missing?
In [2]: using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
Out [2]: 4x2 DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 1 | "M" |
| 2 | 2 | "F" |
| 3 | 3 | "F" |
| 4 | 4 | "M" |
In [3]: df[:,:A] = float64(df[:,:A])
Out [3]: 4-element DataArray{Float64,1}:
1.0
2.0
3.0
4.0
In [4]: df
Out [4]: 4x2 DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 1 | "M" |
| 2 | 2 | "F" |
| 3 | 3 | "F" |
| 4 | 4 | "M" |
In [5]: typeof(df[:,:A])
Out [5]: DataArray{Int64,1} (constructor with 1 method)
The reason this happens is mutation and conversion.
If you have two vectors
a = [1:3]
b = [4:6]
you can make x refer to one of them with assignment.
x = a
Now x and a refer to the same vector [1, 2, 3]. If you then assign b to x
x = b
you have now changed x to refer to the same vector as b refers to.
You can also mutate vectors by copying over the values in one vector to the other. If you do
x[:] = a
you copy over the values in vector a to the vector b, so now you have two vectors with [1, 2, 3].
Then there is also conversion. If you copy a value of one type into a vector of another value Julia will attempt to convert the value to that of the elements vector.
x[1] = 5.0
This gives you a the vector [5, 2, 3] because Julia converted the Float64 value 5.0 to the Int value 5. If you tried
x[1] = 5.5
Julia will throw a InexactError() because 5.5 can't be losslessly converted to an integer.
When it comes to DataFrames things work the same as long as you realize a DataFrame is a collection of named references to vectors. So what you are doing when constructing the DataFrame in this call
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
is that you create the vector [1, 2, 3, 4], and the vector ["M", "F", "F", "M"]. You then construct a DataFrame with references to these two new vectors.
Later when you do
df[:,:A] = float64(df[:,:A])
you first create a new vector by converting the values in the vector [1, 2, 3, 4] into Float64. You then mutate the vector referred to with df[:A] by copying over the values in the Float64 vector back into the Int vector, which causes Julia to convert the values back to Int.
What Colin T Bower's answer
df[:A] = float64(df[:A])
does is that rather than mutating the vector referred to by the DataFrame, he changes the reference to refer to the vector with the Flaot64 values.
I hope this makes sense.
Try this:
df[:A] = float64(df[:A])
This works for me on Julia v0.3.5 with DataFrames v0.6.1.
This is quite interesting though. Notice that:
df[:, :A] = [2.0, 2.0, 3.0, 4.0]
will change the contents of the column to [2,2,3,4], but leaves the type as Int64, while
df[:A] = [2.0, 2.0, 3.0, 4.0]
will also change the type.
I just had quick look at the manual and couldn't see any reference to this behaviour (admittedly it was a very quick look). But I find this unintuitive enough that perhaps it is worth filing an issue.