is there a PySpark function that will merge data from a column for rows with same id? - dataframe

I have the following dataframe:
+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+
I need it to be in a df/rdd format
(1, [a, b, c])
(2, [f, g])
(3, [j])
I'm new to spark and was wondering if this operation can be performed by a single function
I tried using flatmap but I don't think I'm using it correctly

You can group by "A" and then use aggregate function for example collect_set or collect_array
import pyspark.sql.functions as F
df = [
{"A": 1, "B": "a"},
{"A": 1, "B": "b"},
{"A": 1, "B": "c"},
{"A": 2, "B": "f"},
{"A": 2, "B": "g"},
{"A": 3, "B": "j"}
]
df = spark.createDataFrame(df)
df.groupBy("A").agg(F.collect_set(F.col("B"))).show()
Output
+---+--------------+
| A|collect_set(B)|
+---+--------------+
| 1| [c, b, a]|
| 2| [g, f]|
| 3| [j]|
+---+--------------+

First step, create sample data.
#
# 1 - Create sample dataframe + view
#
# array of tuples - data
dat1 = [
(1, "a"),
(1, "b"),
(1, "c"),
(2, "f"),
(2, "g"),
(3, "j")
]
# array of names - columns
col1 = ["A", "B"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# make temp hive view
df1.createOrReplaceTempView("sample_data")
Second step, play around with temporary table.
%sql
select * from sample_data
%sql
select A, collect_list(B) as B_LIST from sample_data group by A
Last step, write code to execute Spark SQL to create dataframe that you want.
df2 = spark.sql("select A, collect_list(B) as B_LIST from sample_data group by A")
display(df2)
In summary, you can use the dataframe methods to create the same output. However, the Spark SQL looks clean and makes more sense.

Related

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

I have a dataframe which returns the output as
I would like to transpose this into
Can someone help to understand how to prepare the pyspark code to achieve this result dynamically. I have tried Unpivot in sql but no luck.
df =spark.createDataFrame([
(78,20,19,90),
],
('Machines', 'Books', 'Vehicles', 'Plants'))
Create a new array of struct column that combines column names and value names. Use the magic inline to explode the struct field. Code below
df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()
+--------+-----+
| Fields|Count|
+--------+-----+
|Machines| 78|
| Books| 20|
|Vehicles| 19|
| Plants| 90|
+--------+-----+
As mentioned in unpivot-dataframe tutoral use:
df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
Or to generalise:
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
Full example:
df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])
# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
[Out]:
+--------+-----+
|Fields |Count|
+--------+-----+
|Machines|78 |
|Books |20 |
|Vehicles|19 |
|Plants |90 |
+--------+-----+

Average of array column of two dataframes and find the maximum index in pyspark

I want to Combine the column values of two dataframe after performing some operations to create a new dataframe in pyspark. The columns of each dataframe are vectors with integer values. The operations done are taking the average of each values in the vectors of the dataframe and finding the index of the maximum element of the new vectors created.
Dataframe1:
|id| |value1 |
|:.| |:......|
| 0| |[0,1,2]|
| 1| |[3,4,5]|
Dataframe2:
|id| |value2 |
|:.| |:......|
| 0| |[1,2,3]|
| 1| |[4,5,6]|
Dataframe3:
|value3 |
|:............|
|[0.5,1.5,2.5]|
|[3.5,4.5,5.5]|
Dataframe4:
|value4|
|:.....|
|2 |
|2 |
Dataframe3 is obtained by taking the average of each elements of each vectors of dataframe 1 and 2 i.e.: first vector of dataframe3 [0.5,1.5,2.5] is obtained by [0+1/2,1+2/2,2+3/2]. Dataframe4 is obtained by taking the index of maximum value of each vector.i.e; Take first vector of dataframe3[0.5,1.5,2.5] maximum value is 2.5 and it occurs at index 2 so first element in Dataframe4 is 2. How we can implement this in pyspark .
V1:
+--------------------------------------+---+
|p1 |id |
+--------------------------------------+---+
|[0.01426862, 0.010903089, 0.9748283] |0 |
|[0.068229124, 0.89613986, 0.035630997]|1 |
+--------------------------------------+---+
V2:
+-------------------------+---+
|p2 |id |
+-------------------------+---+
|[0.0, 0.0, 1.0] |0 |
|[2.8160464E-27, 1.0, 0.0]|1 |
+-------------------------+---+
when df3 = v1.join(v2,on="id") is used
df3=
this is what I get
+-------------------------------------+---------------+
|p1 |p2 |
+-------------------------------------+---------------+
|[0.02203844, 0.010056663, 0.9679049] |[0.0, 0.0, 1.0]|
|[0.039553806, 0.015186918, 0.9452593]|[0.0, 0.0, 1.0]|
+-------------------------------------+---------------+
and when
df3 = df3.withColumn( "p3", F.expr("transform(arrays_zip(p1, p2), x -> (x.p1 + x.p2) / 2)"),)
df4 = df3.withColumn("p4",F.expr("array_position(p3, array_max(p3))"))
were p3 is the average value .I get all values of df4 as zero
First, I recreate your test data :
a = [
[0, [0,1,2]],
[1, [3,4,5]],
]
b = ["id", "value1"]
df1 = spark.createDataFrame(a,b)
c = [
[0, [1,2,3]],
[1, [4,5,6]],
]
d = ["id", "value2"]
df2 = spark.createDataFrame(c,d)
then, I process the data :
join
df3 = df1.join(df2, on="id")
df3.show()
+---+---------+---------+
| id| value1| value2|
+---+---------+---------+
| 0|[0, 1, 2]|[1, 2, 3]|
| 1|[3, 4, 5]|[4, 5, 6]|
+---+---------+---------+
create the average array
from pyspark.sql import functions as F, types as T
#F.udf(T.ArrayType(T.FloatType()))
def avg_array(array1, array2):
return list(map(lambda x: (x[0] + x[1]) / 2, zip(array1, array2)))
df3 = df3.withColumn("value3", avg_array(F.col("value1"), F.col("value2")))
# OR without UDF
df3 = df3.withColumn(
"value3",
F.expr("transform(arrays_zip(value1, value2), x -> (x.value1 + x.value2) / 2)"),
)
df3.show()
+---+---------+---------+---------------+
| id| value1| value2| value3|
+---+---------+---------+---------------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]|
+---+---------+---------+---------------+
get the index (the array_position start at 1, you can do a -1 if necessary)
df4 = df3.withColumn("value4",F.expr("array_position(value3, array_max(value3))"))
df4.show()
+---+---------+---------+---------------+------+
| id| value1| value2| value3|value4|
+---+---------+---------+---------------+------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]| 3|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]| 3|
+---+---------+---------+---------------+------+

Filtering based on value and creating list in spark dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Update row index when all columns of the next row ara NaN in a Pandas DataFrame

I have a Pandas DataFrame extracted from a PDF with tabula-py.
The PDF is like this:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
As you can see A very large name has a line break and, as the original pdf does not have borders, a row with ['name', NaN, NaN] and another with ['A very large', 'c', 3] are created in the DataFrame, when I want only a sigle one with content: ['A very large name', 'c', 3].
Same happens with My name is big
As this happens for several rows which I'm trying to achieve is concatenate the content of the name cell with the previous one when the rest of the cells in the row are NaN. Then delete the NaN rows.
But any other strategy that obtain the same result is welcome.
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
I'm trying code like this, but is not working
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
you can try with groupby.agg with join or first depending on the columns. the groups are created with checking where it is notna in the column letter and value and cumsum.
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0

Conditional merge on in pandas

My question in simple I am using pd.merge to merge two df .
Here's the line of code:
pivoted = pd.merge(pivoted, concerned_data, on='A')
and I want the on='B' whenever a row has column A value as null. Is there a possible way to do this?
Edit:
As an example if
df1: A | B |randomval
1 | 1 | ty
Nan| 2 | asd
df2: A | B |randomval2
1 | Nan| tyrte
3 | 2 | asde
So if on='A' and the value is Nan is any of the df (for a single row) I want on='B' for that row only
Thank you!
You could create a third column in your pandas.DataFrame which incorporates this logic and merge on this one.
For example, create dummy data
df1 = pd.DataFrame({"A" : [1, None], "B" : [1, 2], "Val1" : ["a", "b"]})
df2 = pd.DataFrame({"A" : [1, 2], "B" : [None, 2], "Val2" : ["c", "d"]})
Create a column c which has this logic
df1["C"] = pd.concat([df1.loc[~df1.A.isna(), "A"], df1.loc[df1.A.isna(), "B"]],ignore_index=False)
df2["C"] = pd.concat([df2.loc[~df2.A.isna(), "A"], df2.loc[df2.A.isna(), "B"]],ignore_index=False)
Finally, merge on this common column and include only your value columns
df3 = pd.merge(df1[["Val1","C"]], df2[["Val2","C"]], on='C')
In [27]: df3
Out[27]:
Val1 C Val2
0 a 1.0 c
1 b 2.0 d