Creating multiple columns for a grouped pyspark dataframe - dataframe

I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B, after grouping by column A.
What doesn't work:
import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859| 4|
|947| 2|
|282| 6|
|699| 24|
|153| 12|
# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174| 18|
|153| 20|
|630| 6|
|147| 16|
#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))
The error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I also tried it without using f.col, and with just df_g2.count, but I get an error saying "col should be column".
Something that DOES work:
df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])
However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A, its order changes in the grouped object (e.g. compare order of column A in df_grouped with its order in df_g2 in above), or there is a better way to do this.

What you probably need is groupby and pivot.
Try this:
df.groupby('A').pivot('B').agg(F.count('B')).show()

Related

How to find the uncommon rows between two Pyspark DataFrames? [duplicate]

I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes
I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash)
but scenario is different if I want to use key field match
Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.
DataFrame 1:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
DataFrame 2:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sandiego| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Expected dataframe after comparing dataframe 1 and 2
+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
| 4| sandiego| romino|9848022331|
+------+---------+--------+----------+
subract function is what you are looking for, which will check all the columns value for each row and gives you a dataframe which are different from the other dataframe.
df2.subtract(df1).select("emp_id","emp_city","emp_name","emp_phone")
As the api document says
Return a new :class:DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Equivalent of `takeWhile` for Spark dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?
Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

Shift pyspark column value to left by one

I have a pyspark dataframe that looks like this:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| |Mike |20|6-7|
As you can see the values and the column names are not aligned. For example, "Mike" should be under the column of "name", instead of age.
How can I shift the values to left by one so it can match the column name?
The ideal dataframe looks like:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| Mike |20 |6-0|160|
Please note that the above data is just an example. In reality I have more than 200 columns and more than 1M rows of data.
Try with .toDF with new column names by dropping name column from the dataframe.
Example:
df=spark.createDataFrame([('','Mike',20,'6-7',160)],['name','age','height','weight'])
df.show()
#+----+----+------+------+---+
#|name| age|height|weight| _5|
#+----+----+------+------+---+
#| |Mike| 20| 6-7|160|
#+----+----+------+------+---+
#select all columns except name
df1=df.select(*[i for i in df.columns if i != 'name'])
drop_col=df.columns.pop()
req_cols=[i for i in df.columns if i != drop_col]
df1.toDF(*req_cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
Using spark.createDataFrame():
cols=['name','age','height','weight']
spark.createDataFrame(df.select(*[i for i in df.columns if i != 'name']).rdd,cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
If you are creating dataframe while reading a file then define schema having first column name as dummy then once you read the data drop the column using .drop() function.
spark.read.schema(<struct_type schema>).csv(<path>).drop('<dummy_column_name>')
spark.read.option("header","true").csv(<path>).toDF(<columns_list_with dummy_column>).drop('<dummy_column_name>')

pyspark withColumn, how to vary column name

is there any way to create/fill columns with pyspark 2.1.0 where the name of the column is the value of a different column?
I tried the following
def createNewColumnsFromValues(dataFrame, colName, targetColName):
"""
Set value of column colName to targetColName's value
"""
cols = dataFrame.columns
#df = dataFrame.withColumn(f.col(colName), f.col(targetColName))
df = dataFrame.withColumn('x', f.col(targetColName))
return df
The out commented line does not work, when calling the method I get the error
TypeError: 'Column' object is not callable
whereas the fixed name (as a string) is no problem. Any idea of how to also make the name of the column come from another one, not just the value? I also tried to use a UDF function definition as a workaround with the same no success result.
Thanks for help!
Edit:
from pyspark.sql import functions as f
I figured a solution which scales nicely for few (or not many) distinct values I need columns for. Which is necessarily the case or the number of columns would explode.
def createNewColumnsFromValues(dataFrame, colName, targetCol):
distinctValues = dataFrame.select(colName).distinct().collect()
for value in distinctValues:
dataFrame = dataFrame.withColumn(str(value[0]), f.when(f.col(colName) == value[0], f.col(targetCol)).otherwise(f.lit(None)))
return dataFrame
You might want to try the following code:
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def createNewColumnsFromValues(dataFrame, sourceCol, colName, targetCol):
"""
Set value column colName to targetCol
"""
for value in sourceCol:
dataFrame = dataFrame.withColumn(str(value[0]), when(col(colName)==value[0], targetCol).otherwise(None))
return dataFrame
createNewColumnsFromValues(test_df, test_df.select("col4").collect(), "col4", test_df["col3"]).show()
The trick here is to do select("COLUMNNAME").collect() to get a list of the values in the column. Then colName contains this list, which is a list of rows, where each row has a single element. So you can directly iterate through the list and access the element at position 0. In this case a cast to string was necessary to ensure the column name of the new column is a string. The target column is used for the values for each of the individual columns. So the result would look like:
+----+----+----+----+----+----+
|col1|col2|col3|col4| 1| 8|
+----+----+----+----+----+----+
| 1| 2| 5| 1| 5|null|
| 3| 4| 7| 8|null| 7|
+----+----+----+----+----+----+

convert pyspark groupedData object to spark Dataframe

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?
I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()
The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))