SQL grouped running sum - sql

I have some data like this
data = [("1","1"), ("1","1"), ("1","1"), ("2","1"), ("2","1"), ("3","1"), ("3","1"), ("4","1"),]
df =spark.createDataFrame(data=data,schema=["id","imp"])
df.createOrReplaceTempView("df")
+---+---+
| id|imp|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 1|
| 3| 1|
| 4| 1|
+---+---+
I want the count of IDs grouped by ID, it's running sum and total sum. This is the code I'm using
query = """
select id,
count(id) as count,
sum(count(id)) over (order by count(id) desc) as running_sum,
sum(count(id)) over () as total_sum
from df
group by id
order by count desc
"""
spark.sql(query).show()
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 7| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
The problem is with the running_sum column. For some reason it automatically groups the count 2 while summing and shows 7 for both ID 2 and 3.
This is the result I'm expecting
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 5| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+

You should do the running sum in an outer query.
spark.sql('''
select *,
sum(cnt) over (order by id rows between unbounded preceding and current row) as run_sum,
sum(cnt) over (partition by '1') as tot_sum
from (
select id, count(id) as cnt
from data_tbl
group by id)
'''). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+
Using dataframe API
data_sdf. \
groupBy('id'). \
agg(func.count('id').alias('cnt')). \
withColumn('run_sum',
func.sum('cnt').over(wd.partitionBy().orderBy('id').rowsBetween(-sys.maxsize, 0))
). \
withColumn('tot_sum', func.sum('cnt').over(wd.partitionBy())). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+

Related

PySpark Create Relationship between DataFrame Columns

I am trying to implement some logic to get a relationship between ID and Link based on the logic below.
Logic -
if id 1 has link with 2 and 2 has link with 3, then relation is 1 -> 2, 1 -> 3, 2 -> 1, 2 -> 3, 3 -> 1, 3 -> 2
Similarly if 1 with 4, 4 with 7 and 7 with 5 then relation is 1 -> 4, 1 -> 5, 1 -> 7, 4 -> 1, 4 -> 5, 4 -> 7, 5 -> 1, 5 -> 4, 5 -> 7
Input DataFrame -
+---+----+
| id|link|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 6| 5|
| 9| 7|
| 9| 10|
+---+----+
I am trying to achieve below output-
+---+----+
| Id|Link|
+---+----+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 3|
| 2| 4|
| 3| 1|
| 3| 2|
| 3| 4|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 6|
| 6| 5|
| 7| 9|
| 7| 10|
| 9| 7|
| 9| 10|
| 10| 7|
| 10| 9|
+---+----+
I have tried many way, but it's not at all working. I have tried following codes as well
df = spark.createDataFrame([(1, 2), (3, 1), (4, 2), (6, 5), (9, 7), (9, 10)], ["id", "link"])
ids = df.select("Id").distinct().rdd.flatMap(lambda x: x).collect()
links = df.select("Link").distinct().rdd.flatMap(lambda x: x).collect()
combinations = [(id, link) for id in ids for link in links]
df_combinations = spark.createDataFrame(combinations, ["Id", "Link"])
result = df_combinations.join(df, ["Id", "Link"], "left_anti").union(df).dropDuplicates()
result = result.sort(asc("Id"), asc("Link"))
and
df = spark.createDataFrame([(1, 2), (3, 1), (4, 2), (6, 5), (9, 7), (9, 10)], ["id", "link"])
combinations = df.alias("a").crossJoin(df.alias("b")) \
.filter(F.col("a.id") != F.col("b.id"))\
.select(col("a.id").alias("a_id"), col("b.id").alias("b_id"), col("a.link").alias("a_link"), col("b.link").alias("b_link"))
window = Window.partitionBy("a_id").orderBy("a_id", "b_link")
paths = combinations.groupBy("a_id", "b_link") \
.agg(F.first("b_id").over(window).alias("id")) \
.groupBy("id").agg(F.collect_list("b_link").alias("links"))
result = paths.select("id", F.explode("links").alias("link"))
result = result.union(df.selectExpr("id as id_", "link as link_"))
Any help would be much appreciated.
This is not a general approach but you can use the graphframes package. You might struggle to set it up but one can use it, the result is simple.
import os
sc.addPyFile(os.path.expanduser('graphframes-0.8.1-spark3.0-s_2.12.jar'))
from graphframes import *
e = df.select('id', 'link').toDF('src', 'dst')
v = e.select('src').toDF('id') \
.union(e.select('dst')) \
.distinct()
g = GraphFrame(v, e)
sc.setCheckpointDir("/tmp/graphframes")
df = g.connectedComponents()
df.join(df.withColumnRenamed('id', 'link'), ['component'], 'inner') \
.drop('component') \
.filter('id != link') \
.show()
+---+----+
| id|link|
+---+----+
| 7| 10|
| 7| 9|
| 3| 2|
| 3| 4|
| 3| 1|
| 5| 6|
| 6| 5|
| 9| 10|
| 9| 7|
| 1| 2|
| 1| 4|
| 1| 3|
| 10| 9|
| 10| 7|
| 4| 2|
| 4| 1|
| 4| 3|
| 2| 4|
| 2| 1|
| 2| 3|
+---+----+
connectedComponents method returns the component id for each vertex, that is unique for each vertex group (that is connected by edge and seperated if there is no edge to the other component). So you can do the cartesian product for each component without the vertex itself.
Added answer
Inspired from the above approach, I looked up and found the networkx package.
import networkx as nx
df = df.toPandas()
G = nx.from_pandas_edgelist(df, 'id', 'link')
components = [[list(c)] for c in nx.connected_components(G)]
df2 = spark.createDataFrame(components, ['array']) \
.withColumn('component', f.monotonically_increasing_id()) \
.select('component', f.explode('array').alias('id'))
df2.join(df2.withColumnRenamed('id', 'link'), ['component'], 'inner') \
.drop('component') \
.filter('id != link') \
.show()
+---+----+
| id|link|
+---+----+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 3|
| 2| 4|
| 3| 1|
| 3| 2|
| 3| 4|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 6|
| 6| 5|
| 9| 10|
| 9| 7|
| 10| 9|
| 10| 7|
| 7| 9|
| 7| 10|
+---+----+

spark sql spark.range(7).select('*,'id % 3 as "bucket").show // how to understand ('*,'id % 3 as "bucket")

spark.range(7).select('*,'id % 3 as "bucket").show
// result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
spark.range(7).withColumn("bucket",$"id" % 3).show
///result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
I want to know what to make of *, and the whole select statement
Is the bottom of these two ways equivalent?
spark.range(7).select('*,'id % 3 as "bucket").show
spark.range(7).select($"*",$"id" % 3 as "bucket").show
spark.range(7).select(col("*"),col("id") % 3 as "bucket").show
val df = spark.range(7)
df.select(df("*"),df("id") % 3 as "bucket").show
These four ways are equivalent;
// https://spark.apache.org/docs/2.4.4/api/scala/index.html#org.apache.spark.sql.Column

Pyspark : How to find and convert top 5 row values to 1 and rest all to 0?

I have a dataframe and i need to find the maximum 5 values in each row, convert only those values to 1 and rest all to 0 while maintaining the dataframe structure, i.e. the column names should remain the same
I tried using toLocalIterator and then converting each row to a list, then converting top 5 to values 1.
But it gives me a java.lang.outOfMemoryError when i run the code on large dataset.
While looking at the logs i found that a task of very large size(around 25000KB) is submitted while the max recommended size is 100KB
Is there a better way to find and convert top 5 values to a certain value(1 in this case) and rest all to 0, which would utilize less memory
EDIT 1:
For example if i have this 10 columns and 5 rows as the input
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+----+----+----+----+----+----+----+----+----+----+
|0.74| 0.9|0.52|0.85|0.18|0.23| 0.3| 0.0| 0.1|0.07|
|0.11|0.57|0.81|0.81|0.45|0.48|0.86|0.38|0.41|0.45|
|0.03|0.84|0.17|0.96|0.09|0.73|0.25|0.05|0.57|0.66|
| 0.8|0.94|0.06|0.44| 0.2|0.89| 0.9| 1.0|0.48|0.14|
|0.73|0.86|0.68| 1.0|0.78|0.17|0.11|0.19|0.18|0.83|
+----+----+----+----+----+----+----+----+----+----+
this is what i want as the output
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+---+
as you can see i want to find the top(max) 5 values in each row convert them to 1 and the rest of the values to 0, while maintaining the structure i.e. rows and columns
this is what i am using (which gives me outOfMemoryError)
for row in prob_df.rdd.toLocalIterator():
rowPredDict = {}
for cat in categories:
rowPredDict[cat]= row[cat]
sorted_row = sorted(rowPredDict.items(), key=lambda kv: kv[1],reverse=True)
#print(rowPredDict)
rowPredDict = rowPredDict.fromkeys(rowPredDict,0)
rowPredDict[sorted_row[0:5][0][0]] = 1
rowPredDict[sorted_row[0:5][1][0]] = 1
rowPredDict[sorted_row[0:5][2][0]] = 1
rowPredDict[sorted_row[0:5][3][0]] = 1
rowPredDict[sorted_row[0:5][4][0]] = 1
#print(count,sorted_row[0:2][0][0],",",sorted_row[0:2][1][0])
rowPredList.append(rowPredDict)
#count=count+1
I don't have enough volume for performance testing but could you try below approach using spark functions array apis
1. Prepare Dataset:
import pyspark.sql.functions as f
l1 = [(0.74,0.9,0.52,0.85,0.18,0.23,0.3,0.0,0.1,0.07),
(0.11,0.57,0.81,0.81,0.45,0.48,0.86,0.38,0.41,0.45),
(0.03,0.84,0.17,0.96,0.09,0.73,0.25,0.05,0.57,0.66),
(0.8,0.94,0.06,0.44,0.2,0.89,0.9,1.0,0.48,0.14),
(0.73,0.86,0.68,1.0,0.78,0.17,0.11,0.19,0.18,0.83)]
df = spark.createDataFrame(l1).toDF('col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9','col_10')
df.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 0.74| 0.9| 0.52| 0.85| 0.18| 0.23| 0.3| 0.0| 0.1| 0.07|
| 0.11| 0.57| 0.81| 0.81| 0.45| 0.48| 0.86| 0.38| 0.41| 0.45|
| 0.03| 0.84| 0.17| 0.96| 0.09| 0.73| 0.25| 0.05| 0.57| 0.66|
| 0.8| 0.94| 0.06| 0.44| 0.2| 0.89| 0.9| 1.0| 0.48| 0.14|
| 0.73| 0.86| 0.68| 1.0| 0.78| 0.17| 0.11| 0.19| 0.18| 0.83|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
2. Get top 5 for each row
Following below steps on df
Create array and Sort the elements
Get first 5 elements into new column called all
UDF to get Max 5 elements from sorted:
Note : spark >= 2.4.0 have slice function which can do similar task. I am using 2.2 in currently so creating UDF but if you have 2.4 or higher version then you can give a try with slice
def get_n_elements_(arr, n):
return arr[:n]
get_n_elements = f.udf(get_n_elements_, t.ArrayType(t.DoubleType()))
df_all = df.withColumn('all', get_n_elements(f.sort_array(f.array(df.columns), False),f.lit(5)))
df_all.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|all |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|0.74 |0.9 |0.52 |0.85 |0.18 |0.23 |0.3 |0.0 |0.1 |0.07 |[0.9, 0.85, 0.74, 0.52, 0.3] |
|0.11 |0.57 |0.81 |0.81 |0.45 |0.48 |0.86 |0.38 |0.41 |0.45 |[0.86, 0.81, 0.81, 0.57, 0.48]|
|0.03 |0.84 |0.17 |0.96 |0.09 |0.73 |0.25 |0.05 |0.57 |0.66 |[0.96, 0.84, 0.73, 0.66, 0.57]|
|0.8 |0.94 |0.06 |0.44 |0.2 |0.89 |0.9 |1.0 |0.48 |0.14 |[1.0, 0.94, 0.9, 0.89, 0.8] |
|0.73 |0.86 |0.68 |1.0 |0.78 |0.17 |0.11 |0.19 |0.18 |0.83 |[1.0, 0.86, 0.83, 0.78, 0.73] |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
3. Create dynamic sql and execute with selectExpr
sql_stmt = ''' case when array_contains(all, {0}) then 1 else 0 end AS `{0}` '''
df_all.selectExpr(*[sql_stmt.format(c) for c in df.columns]).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
You can do that easily like this.
For example we want to do that task for value column, so first sort the value column take the 5th value and change the values using a when condition.
df2 = sc.parallelize([("fo", 100,20),("rogerg", 110,56),("franre", 1080,297),("f11", 10100,217),("franci", 10,227),("fran", 1002,5),("fran231cis", 10007,271),("franc3is", 1030,2)]).toDF(["name", "salary","value"])
df2 = df2.orderBy("value",ascending=False)
+----------+------+-----+
| name|salary|value|
+----------+------+-----+
| franre| 1080| 297|
|fran231cis| 10007| 271|
| franci| 10| 227|
| f11| 10100| 217|
| rogerg| 110| 56|
| fo| 100| 20|
| fran| 1002| 5|
| franc3is| 1030| 2|
+----------+------+-----+
maxx = df2.take(5)[4]["value"]
dff = df2.select(when(df2['value'] >= maxx, 1).otherwise(0).alias("value"),"name", "salary")
+---+----------+------+
|value| name|salary|
+---+----------+------+
| 1| franre| 1080|
| 1|fran231cis| 10007|
| 1| franci| 10|
| 1| f11| 10100|
| 1| rogerg| 110|
| 0| fo| 100|
| 0| fran| 1002|
| 0| franc3is| 1030|
+---+----------+------+

Apache spark window, chose previous last item based on some condition

I have an input data which has id, pid, pname, ppid which are id (can think it is time), pid (process id), pname (process name), ppid (parent process id) who created pid
+---+---+-----+----+
| id|pid|pname|ppid|
+---+---+-----+----+
| 1| 1| 5| -1|
| 2| 1| 7| -1|
| 3| 2| 9| 1|
| 4| 2| 11| 1|
| 5| 3| 5| 1|
| 6| 4| 7| 2|
| 7| 1| 9| 3|
+---+---+-----+----+
now need to find ppname (parent process name) which is the last pname (previous pnames) of following condition previous.pid == current.ppid
expected result for previous example:
+---+---+-----+----+------+
| id|pid|pname|ppid|ppname|
+---+---+-----+----+------+
| 1| 1| 5| -1| -1|
| 2| 1| 7| -1| -1| no item found above with pid=-1
| 3| 2| 9| 1| 7| last pid = 1(ppid) above, pname=7
| 4| 2| 11| 1| 7|
| 5| 3| 5| 1| 7|
| 6| 4| 7| 2| 11| last pid = 2(ppid) above, pname=11
| 7| 1| 9| 3| 5| last pid = 3(ppid) above, pname=5
+---+---+-----+----+------+
I can join by itself based on pid==ppid then take diff between ids and pick row with min positive difference maybe then join back again for the cases where we didn't find any positive diffs (-1 case).
But I am thinking it is almost like a cross join, which I might not afford since I have 100M rows.

Counting number of nulls in pyspark dataframe by row

So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()
+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
+---+-----+-----+-----+
After running the code, the desired output is:
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
+---+-----+-----+-----+--------+
EDIT: Not all non null values are ints.
Convert null to 1 and others to 0 and then sum all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|
+---+-----+-----+-----+--------+