Creating rows of a table using the columns of another table in pyspark - dataframe

I have a table(df) which has multiple columns: col1, col2,col3 and so on.
col1
col2
col3
....
coln
1
abc
1
qwe
1
xyz
2
3
3
abc
6
qwe
I want my final table(df) to have following columns:
attribute_name: contains the name of columns from previous table
count: contains total count of the table
distinct_count: contains distinct count of each column from previous table
null_count: contains count of null values of each column from previous table
The final table should like like this:
attribute_name
count
distinct_count
null_count
col1
4
3
0
col2
4
2
1
col3
4
3
1
coln
4
1
2
Could someone help me on how i can implement this in pyspark?

I didn't test it or checked if it is correct, but something like this should work:
attr_df_list = []
for column_name in df.columns:
attr_df_list.append(
df.selectExpr(
f"{column_name} AS attribute_name",
"COUNT(*) AS count",
f"COUNT(DISTINCT {column_name}) AS distinct_count",
f"COUNT_IF({column_name} IS NULL) AS null_count"
)
)
result_df = reduce(lambda df1, df2: df1.union(df2), attr_df_list)

Here's a solution:
df = spark.createDataFrame([("apple",1,1),("mango",2,2),("apple",None,3),("mango",None,4)], ["col1","col2","col3"])
df.show()
# Out:
# +---——+—--—+—--—+
# | col1|col2|col3|
# +—---—+—--—+—--—+
# |apple| 1| 1|
# |mango| 2| 2|
# |apple|null| 3|
# |mango|null| 4|
# +—---—+—--—+—--—+
from pyspark.sql.functions import col
data = [(c, \
df.filter(col(c).isNotNull()).count(), \
df[[c]].distinct().count(), \
df.filter(col(c).isNull()).count() \
) for c in df.columns]
cols=['attribute_name','count','distinct_count','null_count']
spark.createDataFrame(data, cols).show()
# Out:
# +——————-------—+—---—+——————-------—+——-----———+
# |attribute_name|count|distinct_count|null_count|
# +————-------———+—---—+—————-------——+———-----——+
# | col1| 4| 2| 0|
# | col2| 2| 3| 2|
# | col3| 4| 4| 0|
# +————-------———+—---—+————-------———+———-----——+
The idea is to loop through the columns of the original dataframe and for each column create a new row with the aggregated data.

Related

How to efficiently split a dataframe in Spark based on a condition?

I have a situtation like that with this Spark dataframe:
id
value
1
0
1
3
2
4
1
0
2
2
3
0
4
1
Now what I want to obtain is to efficiently split this single dataframe in 3 different one such that each dataframe extracted from the original one is between two 0 in the "value" column (with the first zero indicating the beginning of each dataframe) using Apache Spark, so that I would obtain this as result:
Dataframe 1 (rows from first 0 value to the last value before the next 0):
id
value
1
0
1
3
2
4
Dataframe 2 (rows from the second zero value to the last value before the 3rd zero):
id
value
1
0
2
2
Dataframe 3:
id
value
3
0
4
1
While as samkart said it is not efficient/easy way to break data on basis of order of rows still if you are using spark v3.2+ you can leverage pandas on pyspark to do it in spark way like below
import pyspark.pandas as ps
from pyspark.sql import functions as F
from pyspark.sql import Window
pdf=ps.read_csv("/FileStore/tmp4/pand.txt")
sdf = pdf.to_spark(index_col='index')
sdf=sdf.withColumn("run",F.sum(F.when(F.col("value")==0,1).otherwise(0)).over(Window.orderBy("index")))
toval= sdf.agg(F.max(F.col("run"))).collect()[0][0]
for x in range (1,toval+1):
globals()[f"sdf{x}"]=sdf.filter(F.col("run")==x).drop("index","run")
For above data it will create 3 dataframe sdf1,sdf2,sdf3 like below
sdf1.show()
sdf2.show()
sdf3.show()
#output
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 1| 3|
| 2| 4|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 2| 2|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 4| 1|
+---+-----+

How to increment the value by one if the value of the column is not there in Pyspark

I have a below pyspark dataframe
a = [["1","fawef"],["","esd"],["","rdf"],["2","ddbf"]]
columns = ["id","name"]
df = spark.createDataFrame(data = a, schema = columns)
id name
1 fawef
esd
rdf
2 ddbf
Now my requirement is, if the id column is empty then I need to get the max of the id and increment that value by 1 and place the value in new column of the particular row.
Example:
In the above dataframe second row is empty in the id column now i need to get the max of id column that will be 2 and now i need to add 1 to the max value now the output will be 3. Now i need to place 3 in the second row of the new column.
output i am expecting
id name new_col
1 fawef 1
esd 3
rdf 4
2 ddbf 2
Is there any way to achieve the above output, it will be great.
Incrementing it will be easy if we have the max of the id field and a row_number() wherever the id field is blank.
I used the following data
# +----+----+
# | id|name|
# +----+----+
# | 1|blah|
# |null| yes|
# |null| no|
# | 2|bleh|
# |null|ohno|
# +----+----+
and, did the following transformations
data_sdf. \
withColumn('id_rn', func.row_number().over(wd.partitionBy('id').orderBy(func.lit('1')))). \
withColumn('new_id',
func.when(func.col('id').isNull(), func.max('id').over(wd.partitionBy(func.lit('1'))) + func.col('id_rn')).
otherwise(func.col('id'))
). \
show()
# +----+----+-----+------+
# | id|name|id_rn|new_id|
# +----+----+-----+------+
# |null| yes| 1| 3|
# |null| no| 2| 4|
# |null|ohno| 3| 5|
# | 1|blah| 1| 1|
# | 2|bleh| 1| 2|
# +----+----+-----+------+
I created a new field to assign row numbers or some ID to the blank values using row_number()
I used the aforementioned to add it to the max of id field whenever id is blank

PySpark calculate percentage that every column is 'missing'

I am using PySpark and try to calculate the percentage of records that every column has missing ('null') values.
dataframe we are going to work with: df (and many more columns)
id
fb
linkedin
snapchat
...
1
aa
(null)
(null)
...
2
(null)
aaa
(null)
...
3
(null)
(null)
a
...
4
(null)
(null)
(null)
...
With the following script I am able to get 'Null' rate for every column:
df.select([round((count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))), 6).alias(c) for c in df.columns])
Just wondering how can we calculate the percentage that every column has 'null' value ?(assuming there are many columns, and we don't want to specify every column name)
Thanks!
Another way would be to create a custom function - calc_null_percent utilising the best of both worlds from Spark and Pandas
The custom func , will contain the total_count & null_count respective to each columns
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
import pandas as pd
import numpy as np
from io import StringIO
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
input_str = """
1,0,null,
1,null,0,
null,1,0,
1,0,0,
1,0,0,
null,0,1,
1,1,0,
1,1,null,
null,1,0
""".split(',')
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "col1,col2,col3".split(',')))
n = len(input_values)
n_col = 3
input_list = [tuple(input_values[i:i+n_col]) for i in range(0,n,n_col)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
Custom Func
def calc_null_percent(spark_df,sort=True):
pd_col_count = spark_df.select([F.count(F.col(c)).alias(c)
for (c,c_type) in spark_df.dtypes]
).toPandas().T.reset_index().rename(columns={0: 'total_count'
,'index':'column'})
pd_col_null_count = spark_df.select([F.sum(F.when(F.isnan(c) | F.isnull(c),1).otherwise(0)).alias(c)
for (c,c_type) in spark_df.dtypes]
).toPandas().T.reset_index().rename(columns={0: 'null_count'
,'index':'column'})
final_df = pd.merge(pd_col_count,pd_col_null_count,on=['column'])
final_df['null_percentage'] = final_df['null_count'] * 100 / final_df['total_count']
if len(final_df) == 0:
print("There are no any missing values!")
return None
return final_df
nullStatsDF = sql.createDataFrame(calc_null_percent(sparkDF))
nullStatsDF.show()
+------+-----------+----------+------------------+
|column|total_count|null_count| null_percentage|
+------+-----------+----------+------------------+
| col1| 6| 3| 50.0|
| col2| 8| 1| 12.5|
| col3| 7| 2|28.571428571428573|
+------+-----------+----------+------------------+
Assuming you do not consider a few columns for the count of missing values (here I assumed that your column id should not contain missings), you can use the following code
import pyspark.sql.functions as F
# select columns in which you want to check for missing values
relevant_columns = [c for c in df.columns if c != 'id']
# number of total records
n_records = df.count()
# percentage of rows with all missings in relevant_columns
my_perc = df \
.select((F.lit(len(relevant_columns)) - (sum(df[c].isNull().cast('int') for c in relevant_columns))).alias('n')) \
.filter(F.col('n') == 0) \
.count() / n_records * 100
print(my_perc)
# 25.0

How do I transpose a dataframe with only one row and multiple column in pyspark?

I have dataframes with one row:
A B C D E
4 1 7 2 3
I would like to convert this to a dataframe with the following format:
Letter Number
A 4
B 1
C 7
D 2
E 3
I did not find any built-in pyspark function in the docs, so I created a very simple basic function that does the job. Given that your dataframe df has only one row, you can use the following solution.
def my_transpose(df):
# get values
letter = df.columns
number = list(df.take(1)[0].asDict().values())
# combine values for a new Spark dataframe
data = [[a, b] for a, b in zip(letter, number)]
res = spark.createDataFrame(data, ['Letter', 'Number'])
return res
my_transpose(df).show()
+------+------+
|Letter|Number|
+------+------+
| A| 4|
| B| 1|
| C| 7|
| D| 2|
| E| 3|
+------+------+

How to create multiple flag columns based on list values found in the dataframe column?

The table looks like this :
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
The city column contains around 1000+ values which are | delimited
I want to create a flag column to indicate if a person visited only the city of interest.
city_of_interest=['Paris','Seattle','Tokyo']
There are 20 such values in the list.
Ouput should look like this :
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
The solution can either be in pandas or pyspark.
For pyspark, use split + array_contains:
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
For Pandas, use Series.str.get_dummies:
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)
Pandas Solution
First transform to list to use DataFrame.explode:
new_df=df.copy()
new_df['CITY']=new_df['CITY'].str.lstrip('|').str.split('|')
#print(new_df)
# ID CITY
#0 1 [London, Paris, Tokyo]
#1 2 [Tokyo, Barcelona, Mumbai, London]
#2 3 [Vienna, Paris, Seattle]
Then we can use:
Method 1: DataFrame.pivot_table
new_df=( new_df.explode('CITY')
.pivot_table(columns='CITY',index='ID',aggfunc='size',fill_value=0)
[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Method 2: DataFrame.groupby + DataFrame.unstack
new_df=( new_df.explode('CITY')
.groupby(['ID'])
.CITY
.value_counts()
.unstack('CITY',fill_value=0)[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Output new_df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0
Using a UDF to check if the city of interest value is in the delimited column.
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()