I am new to Spark and I am trying to clean a relatively large dataset. The problem I have is that the feature values seem to be mismatched in the original dataset. It looks something like this for the first line when I take a summary of the dataset :
|summary A B |
---------------------
|count 5 10 |
I am trying to find a way to filter based on the row with the lowest count across all features and maintain the ordering.
I would like to have:
|summary A B |
---------------------
|count 5 5 |
How could I achieve this? Thanks!
Here are two approaches for you to consider:
Simple approach
# Set up the example df
df = spark.createDataFrame([('count',5,10)],['summary','A','B'])
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 10|
# +-------+---+---+
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType
#udf(returnType=IntegerType())
def get_row_min(A,B):
return min([A,B])
df.withColumn('new_A', get_row_min(col('A'),col('B')))\
.withColumn('new_B', col('new_A'))\
.drop('A')\
.drop('B')\
.withColumnRenamed('new_A','A')\
.withColumnRenamed('new_B', 'B')\
.show()
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 5|
# +-------+---+---+
Generic approach for indirectly specified columns
# Set up df with an extra column (and an extra row to show it works)
df2 = spark.createDataFrame([('count',5,10,15),
('count',3,2,1)],
['summary','A','B','C'])
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 10| 15|
# | count| 3| 2| 1|
# +-------+---+---+---+
#udf(returnType=IntegerType())
def get_row_min_generic(*cols):
return min(cols)
exclude = ['summary']
df3 = df2.withColumn('min_val', get_row_min_generic(*[col(col_name) for col_name in df2.columns
if col_name not in exclude]))
exclude.append('min_val') # this could just be specified in the list
# from the beginning instead of appending
new_cols = [col('min_val').alias(c) for c in df2.columns if c not in exclude]
df_out = df3.select(['summary']+new_cols)
df_out.show()
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 5| 5|
# | count| 1| 1| 1|
# +-------+---+---+---+
Related
I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+
df = spark.createDataFrame(
[[100,'a_',1],
[150,'a_',6],
[200,'a_',6],
[120,'b_',2],
[220,'c_', 3],
[230,'d_', 4],
[500,'e_',5],[
110,'a_',6],
[130,'b_',6],
[140,'b_',12]], ['id','type','cnt'])
As is:
df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(col("cnt").desc(), col("id").desc()))
).head(10)
To be. I want to make method
def rank(df, order):
df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(order))
).head(10)
I want to pass multiple columns for ordering (col("cnt").desc(), col("id").desc()).
If there's just one column, that's simple, but I should make a scalable method (to accept more columns). How to do it?
++)
If I want to another dynamic parameter, how to fix it?
def rank(df, ?, *order):
df = df.withColumn("rank", row_number().over(Window.partitionBy(?).orderBy(*order))
)
return df
Instead of order try using *order
I'm not sure what you want to do, but the following modified version of your rank function seems to work with different number columns provided.
def rank(df, *order):
df = df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(*order))
)
return df
rank(df, asc("id")).show()
# +---+----+---+----+
# | id|type|cnt|rank|
# +---+----+---+----+
# |100| a_| 1| 1|
# |110| a_| 6| 2|
# |150| a_| 6| 3|
# |200| a_| 6| 4|
# |120| b_| 2| 1|
# |130| b_| 6| 2|
# |140| b_| 12| 3|
# |220| c_| 3| 1|
# |230| d_| 4| 1|
# |500| e_| 5| 1|
# +---+----+---+----+
rank(df, col("cnt").desc(), col("id").desc()).show()
# +---+----+---+----+
# | id|type|cnt|rank|
# +---+----+---+----+
# |200| a_| 6| 1|
# |150| a_| 6| 2|
# |110| a_| 6| 3|
# |100| a_| 1| 4|
# |140| b_| 12| 1|
# |130| b_| 6| 2|
# |120| b_| 2| 3|
# |220| c_| 3| 1|
# |230| d_| 4| 1|
# |500| e_| 5| 1|
# +---+----+---+----+
I'm newby with PySpark and don't know what's the problem with my code.
I have 2 dataframes
df1=
+---+--------------+
| id|No_of_Question|
+---+--------------+
| 1| Q1|
| 2| Q4|
| 3| Q23|
|...| ...|
+---+--------------+
df2 =
+--------------------+---+---+---+---+---+---+
| Q1| Q2| Q3| Q4| Q5| ... |Q22|Q23|Q24|Q25|
+--------------------+---+---+---+---+---+---+
| 1| 0| 1| 0| 0| ... | 1| 1| 1| 1|
+--------------------+---+---+---+---+---+---+
I'd like to create a new dataframe with all columns from df2 defined into df1.No_of_Question.
Expected result
df2 =
+------------+
| Q1| Q4| Q24|
+------------+
| 1| 0| 1|
+------------+
I've already tried
df2 = df2.select(*F.collect_list(df1.No_of_Question)) #Error: Column is not iterable
or
df2 = df2.select(F.collect_list(df1.No_of_Question)) #Error: Resolved attribute(s) No_of_Question#1791 missing from Q1, Q2...
or
df2 = df2.select(*df1.No_of_Question)
of
df2= df2.select([col for col in df2.columns if col in df1.No_of_Question])
But none of these solutions worked.
Could you help me please?
You can collect the values of No_of_Question into a python list then pass it to df2.select().
Try this:
questions = [
F.col(r.No_of_Question).alias(r.No_of_Question)
for r in df1.select("No_of_Question").collect()
]
df2 = df2.select(*questions)
I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra
I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+