pyspark withColumn, how to vary column name - dataframe

is there any way to create/fill columns with pyspark 2.1.0 where the name of the column is the value of a different column?
I tried the following
def createNewColumnsFromValues(dataFrame, colName, targetColName):
"""
Set value of column colName to targetColName's value
"""
cols = dataFrame.columns
#df = dataFrame.withColumn(f.col(colName), f.col(targetColName))
df = dataFrame.withColumn('x', f.col(targetColName))
return df
The out commented line does not work, when calling the method I get the error
TypeError: 'Column' object is not callable
whereas the fixed name (as a string) is no problem. Any idea of how to also make the name of the column come from another one, not just the value? I also tried to use a UDF function definition as a workaround with the same no success result.
Thanks for help!
Edit:
from pyspark.sql import functions as f

I figured a solution which scales nicely for few (or not many) distinct values I need columns for. Which is necessarily the case or the number of columns would explode.
def createNewColumnsFromValues(dataFrame, colName, targetCol):
distinctValues = dataFrame.select(colName).distinct().collect()
for value in distinctValues:
dataFrame = dataFrame.withColumn(str(value[0]), f.when(f.col(colName) == value[0], f.col(targetCol)).otherwise(f.lit(None)))
return dataFrame

You might want to try the following code:
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def createNewColumnsFromValues(dataFrame, sourceCol, colName, targetCol):
"""
Set value column colName to targetCol
"""
for value in sourceCol:
dataFrame = dataFrame.withColumn(str(value[0]), when(col(colName)==value[0], targetCol).otherwise(None))
return dataFrame
createNewColumnsFromValues(test_df, test_df.select("col4").collect(), "col4", test_df["col3"]).show()
The trick here is to do select("COLUMNNAME").collect() to get a list of the values in the column. Then colName contains this list, which is a list of rows, where each row has a single element. So you can directly iterate through the list and access the element at position 0. In this case a cast to string was necessary to ensure the column name of the new column is a string. The target column is used for the values for each of the individual columns. So the result would look like:
+----+----+----+----+----+----+
|col1|col2|col3|col4| 1| 8|
+----+----+----+----+----+----+
| 1| 2| 5| 1| 5|null|
| 3| 4| 7| 8|null| 7|
+----+----+----+----+----+----+

Related

How to properly import CSV files with PySpark

I know, that one can load files with PySpark for RDD's using the following commands:
sc = spark.sparkContext
someRDD = sc.textFile("some.csv")
or for dataframes:
spark.read.options(delimiter=',') \
.csv("some.csv")
My file is a .csv with 10 columns, seperated by ',' . However, the very last column contains some text, that also has a lot of ",". Splitting by "," will result in different column sizes for each row and moreover, I do not have the whole text in one column.
I am just looking for a good way to load a .csv file into a dataframe that has multiple "," at the very last index.
Maybe there is way to only split on the first n columns? Because it is guaranteed, that all columns before the text column are only seperated by one ",". Interestingly, using pd.read_csv does not cause this issue! So far my workaround has been to load the file with
csv = pd.read_csv("some.csv", delimiter=",")
csv_to_array = csv.values.tolist()
df = createDataFrame(csv_to_array)
which is not a pretty solution. Moreover, it did not allow me to use some schema on my dataframe.
If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+

How to add multiple column dynamically based on filter condition

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code
source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc| 1.1| john|2-3-21
|def| 3.0| dani|2-2-21
+---+-----+-----+------
dest_df
+---+-----+-----+------+
|key|val11|val12|date |
+---+-----+-----+------
|abc| 2.1| jack|2-3-21|
|def| 3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
The output I expect is
#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+
I get only one column result
#Actual
+---+-----------------+-
|key| difference_in_val12 |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-
How to generate multiple columns based on filter condition dynamically?
Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -
from pyspark.sql import functions as F
columns= source_df.columns[1:]
joined_df=source_df\
.join(dest_df, 'key', 'full')
for column in columns:
if column != columns[-1]:
column_name="difference_in_"+str(column)
report = joined_df\
.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
else:
column_name="difference_in_"+str(column)
report1 = report.filter((source_df[column] != dest_df[column]))\
.withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()
Output -
+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc| 1.1| john| 2.1| jack| [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:
from pyspark.sql import functions as F
cols = source_df.drop("key").columns
output = (source_df.withColumn("ref",F.lit("src:"))
.unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
for i in cols]).dropna(subset = cols, how='all')
)
output.show()
+---+------------------+--------------------+
|key| val11| val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+

Shift pyspark column value to left by one

I have a pyspark dataframe that looks like this:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| |Mike |20|6-7|
As you can see the values and the column names are not aligned. For example, "Mike" should be under the column of "name", instead of age.
How can I shift the values to left by one so it can match the column name?
The ideal dataframe looks like:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| Mike |20 |6-0|160|
Please note that the above data is just an example. In reality I have more than 200 columns and more than 1M rows of data.
Try with .toDF with new column names by dropping name column from the dataframe.
Example:
df=spark.createDataFrame([('','Mike',20,'6-7',160)],['name','age','height','weight'])
df.show()
#+----+----+------+------+---+
#|name| age|height|weight| _5|
#+----+----+------+------+---+
#| |Mike| 20| 6-7|160|
#+----+----+------+------+---+
#select all columns except name
df1=df.select(*[i for i in df.columns if i != 'name'])
drop_col=df.columns.pop()
req_cols=[i for i in df.columns if i != drop_col]
df1.toDF(*req_cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
Using spark.createDataFrame():
cols=['name','age','height','weight']
spark.createDataFrame(df.select(*[i for i in df.columns if i != 'name']).rdd,cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
If you are creating dataframe while reading a file then define schema having first column name as dummy then once you read the data drop the column using .drop() function.
spark.read.schema(<struct_type schema>).csv(<path>).drop('<dummy_column_name>')
spark.read.option("header","true").csv(<path>).toDF(<columns_list_with dummy_column>).drop('<dummy_column_name>')

Creating multiple columns for a grouped pyspark dataframe

I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B, after grouping by column A.
What doesn't work:
import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859| 4|
|947| 2|
|282| 6|
|699| 24|
|153| 12|
# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174| 18|
|153| 20|
|630| 6|
|147| 16|
#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))
The error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I also tried it without using f.col, and with just df_g2.count, but I get an error saying "col should be column".
Something that DOES work:
df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])
However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A, its order changes in the grouped object (e.g. compare order of column A in df_grouped with its order in df_g2 in above), or there is a better way to do this.
What you probably need is groupby and pivot.
Try this:
df.groupby('A').pivot('B').agg(F.count('B')).show()

Count the number of missing values in a dataframe Spark

I have a dataset with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Any help please to get a dataframe in which we'll find columns and number of missing values for each one.
You could count the missing values by summing the boolean output of the isNull() method, after converting it to type integer:
In Scala:
import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show
In Python:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
Alternatively, you could also use the output of df.describe().filter($"summary" === "count"), and subtract the number in each cell by the number of rows in the data:
In Scala:
import org.apache.spark.sql.functions.lit,
val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show
In Python:
from pyspark.sql.functions import lit
rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()
from pyspark.sql.functions import isnull, when, count, col
nacounts = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()
nacounts
for i in df.columns:
print(i,df.count()-(df.na.drop(subset=i).count()))