How to clean the data from CSV file

How to clean the data from CSV file - apache-spark-sql

The Sample name.csv data :
Name, ,Age, ,Class,
Diwakar,, ,25,, ,12,
, , , , ,
Prabhat, ,27, ,15,
Zyan, ,30, ,17,
Jack, ,35, ,21,
reading the csv file:
names = spark.read.csv("name.csv", header="true", inferSchema="true")
names.show()
getting this as a output and we are loosing some data:
+-------+----+---+---+-----+----+
| Name| 1|Age| 3|Class| _c5|
+-------+----+---+---+-----+----+
|Diwakar|null| | 25| null| |
| | | | | |null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+----+---+---+-----+----+
I want to have an output like given below:
+-------+---+---+---+-----+----+
| Name| 1|Age| 3|Class| _c5|
+-------+---+---+---+-----+----+
|Diwakar| | 25| | 12|null|
| | | | | |null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+---+---+---+-----+----+

We can read all the fields by defining schema and then use the schema while reading CSV file then use When Otherwise we can get the data for Age,Class columns.
Example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema with same number of columns in csv file
sch=StructType([
StructField("Name", StringType(), True),
StructField("1", StringType(), True),
StructField("Age", StringType(), True),
StructField("3", StringType(), True),
StructField("Class", StringType(), True),
StructField("_c5", StringType(), True),
StructField("_c6", StringType(), True)
])
#reading csv file with schema
df=spark.read.schema(sch).option("header",True).csv("name.csv")
df.withColumn('Age', when(length(trim(col('Age'))) == 0, col('3')).otherwise(col('Age'))).\
withColumn('1',lit("")).\
withColumn('3',lit("")).\
withColumn('Class',when((col('Class').isNull())|(lower(col('Class')) == 'null'), col('_c6')).when(length(trim(col('Class'))) == 0, lit("null")).otherwise(col('Class'))).\
withColumn('_c5',lit("null")).\
drop("_c6").\
show()
#+-------+---+---+---+-----+----+
#| Name| 1|Age| 3|Class| _c5|
#+-------+---+---+---+-----+----+
#|Diwakar| | 25| | 12|null|
#| | | | | null|null|
#|Prabhat| | 27| | 15|null|
#| Zyan| | 30| | 17|null|
#| Jack| | 35| | 21|null|
#+-------+---+---+---+-----+----+

Related

How to get the occurence rate of the specific values with Apache Spark

I have the raw data DataFrame like that:
+-----------+--------------------+------+
|device | timestamp | value|
+-----------+--------------------+------+
| device_A|2022-01-01 18:00:01 | 100|
| device_A|2022-01-01 18:00:02 | 99|
| device_A|2022-01-01 18:00:03 | 100|
| device_A|2022-01-01 18:00:04 | 102|
| device_A|2022-01-01 18:00:05 | 100|
| device_A|2022-01-01 18:00:06 | 99|
| device_A|2022-01-01 18:00:11 | 98|
| device_A|2022-01-01 18:00:12 | 100|
| device_A|2022-01-01 18:00:13 | 100|
| device_A|2022-01-01 18:00:15 | 101|
| device_A|2022-01-01 18:00:17 | 101|
I'd like to aggregate them and to build the listed 10 s aggregation like that:
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
To plot a heat-map graph of the values later.
I have succeed with getting the values column but not clear how to calculate the corresponding counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))
How can I do the weighted list aggregation in Apache Spark?

Group by time window using window function with duration of 10 seconds to get counts by value and device, then group by device + window_time and collect list of structs:
val result = (
df.groupBy(
$"device",
window($"timestamp", "10 second")("start").as("window_time"),
$"value"
)
.count()
.groupBy("device", "window_time")
.agg(collect_list(struct($"value", $"count")).as("values"))
.withColumn("count", col("values.count"))
.withColumn("values", col("values.value"))
)
result.show()
//+--------+-------------------+--------------+---------+
//| device| window_time| values| count|
//+--------+-------------------+--------------+---------+
//|device_A|2022-01-01 18:00:00|[102, 99, 100]|[1, 2, 3]|
//|device_A|2022-01-01 18:00:10|[100, 101, 98]|[2, 2, 1]|
//+--------+-------------------+--------------+---------+

How to replace null values in the output of a left join operation with 0 in pyspark dataframe?

I have a simple PySpark dataframe, df1-
df1 = spark.createDataFrame([
("u1", 1),
("u1", 2),
("u2", 3),
("u3", 4),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
Output-
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |1 |
|u1 |2 |
|u2 |3 |
|u3 |4 |
+-------+----+
I have another PySpark dataframe df2-
df2 = spark.createDataFrame([
(1, 'f1'),
(2, 'f2'),
],
['var1', 'var2'])
print(df2.printSchema())
df2.show(truncate=False)
Output-
root
|-- var1: long (nullable = true)
|-- var2: string (nullable = true)
None
+----+----+
|var1|var2|
+----+----+
|1 |f1 |
|2 |f2 |
+----+----+
I have to join the two dataframes mentioned above, by using a left-join operation on them-
df1.join(df2, df1.var1==df2.var1, 'left').show()
Output-
+-------+----+----+----+
|user_id|var1|var1|var2|
+-------+----+----+----+
| u1| 1| 1| f1|
| u1| 2| 2| f2|
| u2| 3|null|null|
| u3| 4|null|null|
+-------+----+----+----+
But as you can see, I am getting null values in the rows for which there two tables don't have a match.
How can I replace all the null values with 0?

You can use fillna. Two fillnas are needed to account for integer and string columns.
df1.join(df2, df1.var1==df2.var1, 'left').fillna(0).fillna("0")

You can rename columns after join (otherwise you get columns with the same name) and use a dictionary to specify how you want to fill missing values:
f1.join(df2, df1.var1 == df2.var1, 'left').select(
*[df1['user_id'], df1['var1'], df2['var1'].alias('df2_var1'), df2['var2'].alias('df2_var2')]
).fillna({'df2_var1': 0, 'df2_var2': '0'}).show()
Output:
+-------+----+--------+--------+
|user_id|var1|df2_var1|df2_var2|
+-------+----+--------+--------+
| u1| 1| 1| f1|
| u2| 3| 0| 0|
| u1| 2| 2| f2|
| u3| 4| 0| 0|
+-------+----+--------+--------+

convert row to column in spark

I have a date like below :- I have to display year_month column column wise. How should I use this, I am new to spark.
scala> spark.sql("""select sum(actual_calls_count),year_month from ph_com_b_gbl_dice.dm_rep_customer_call group by year_month""")
res0: org.apache.spark.sql.DataFrame = [sum(actual_calls_count): bigint, year_month: string]
scala> res0.show
+-----------------------+----------+
|sum(actual_calls_count)|year_month|
+-----------------------+----------+
| 1| 2019-10|
| 3693| 2018-10|
| 7| 2019-11|
| 32| 2017-10|
| 94| 2019-03|
| 10527| 2018-06|
| 4774| 2017-05|
| 1279| 2017-11|
| 331982| 2018-03|
| 315767| 2018-02|
| 7097| 2017-03|
| 8| 2017-08|
| 3| 2019-07|
| 3136| 2017-06|
| 6088| 2017-02|
| 6344| 2017-04|
| 223426| 2018-05|
| 9819| 2018-08|
| 1| 2017-07|
| 68| 2019-05|
+-----------------------+----------+
only showing top 20 rows
My output should be like this :-
sum(actual_calls_count)|year_month1 | year_month2 | year_month3 and so on..

scala> df.groupBy(lit(1)).pivot(col("year_month")).agg(concat_ws("",collect_list(col("sum")))).drop("1").show(false)
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|2017-02|2017-03|2017-04|2017-05|2017-06|2017-07|2017-08|2017-10|2017-11|2018-02|2018-03|2018-05|2018-06|2018-08|2018-10|2019-03|2019-05|2019-07|2019-10|2019-11|
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|6088 |7097 |6344 |4774 |3136 |1 |8 |32 |1279 |315767 |331982 |223426 |10527 |9819 |3693 |94 |68 |3 |1 |7 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

I have recently started working with pySpark so don't know about many details regarding this.
I am trying to create a BinaryType column in a data frame? But struggling to do it...
for example, let's take a simple df
df.show(2)
+---+----------+
| col1|col2|
+---+----------+
| "1"| null|
| "2"| "20"|
+---+----------+
Now I want to have a third column "col3" with BinaryType like
| col1|col2| col3|
+---+----------+
| "1"| null|[1 null]
| "2"| "20"|[ 2 20]
+---+----------+
How should i do it?

Try this:
a = [('1', None), ('2', '20')]
df = spark.createDataFrame(a, ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| 20|
+----+----+
df = df.withColumn('col3', F.array(['col1', 'col2']))
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 1|null| [1,]|
| 2| 20|[2, 20]|
+----+----+-------+

How to calculate the percentage of total in Spark SQL

Considering the following data:
Name | Flag
A | 0
A | 1
A | 0
B | 0
B | 1
B | 1
I'd like to transform it to:
Name | Total | With Flag | Percentage
A | 3 | 1 | 33%
B | 3 | 2 | 66%
Preferably, in Spark SQL.

For example like this:
val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")
df.groupBy($"Name").agg(
count("*").alias("total"),
sum($"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean($"flag").multiply(100).cast("integer").alias("percentage"))
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
or:
df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to clean the data from CSV file - apache-spark-sql

Related

How to get the occurence rate of the specific values with Apache Spark

How to replace null values in the output of a left join operation with 0 in pyspark dataframe?

convert row to column in spark

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

How to calculate the percentage of total in Spark SQL

Categories

Resources