Spark-scala code error while applying unix_timestamp - apache-spark-sql

I am trying some basic steps in spark-shell with df
I am getting an error for the second statement. can someone explain to me why I am getting the result
scala> stagingDF.select("a_ingestion_dtm").show(2,false)
+--------------------------+
|a_ingestion_dtm |
+--------------------------+
|2019-07-08 16:10:02.836005|
|2019-07-08 16:10:02.866005|
+--------------------------+
only showing top 2 rows
scala> stagingDF.select("a_ingestion_dtm",unix_timestamp(col("a_ingestion_dtm"))).show(10,false)
<console>:47: error: overloaded method value select with alternatives:
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String, org.apache.spark.sql.Column)
stagingDF.select("a_ingestion_dtm",unix_timestamp(col ("a_ingestion_dtm"))).show(10,false)

Please find the below answer.
scala> res11.show
+--------------------+
| time|
+--------------------+
|2019-07-08 16:10:...|
|2019-07-08 16:10:...|
+--------------------+
scala> res11.select('time, unix_timestamp($"time")).show
+--------------------+-----------------------------------------+
| time|unix_timestamp(time, yyyy-MM-dd HH:mm:ss)|
+--------------------+-----------------------------------------+
|2019-07-08 16:10:...| 1562582402|
|2019-07-08 16:10:...| 1562582402|
+--------------------+-----------------------------------------+

Related

Spark scala create dataframe from text with columns split by delimiter | [duplicate]

This question already has answers here:
Java String split not returning the right values
(4 answers)
Closed 10 months ago.
I am trying to create a spark dataframe from text file in which data is delimited by | symbol.
Have to Spark with Scala.
The text files has data as below:
John|1234|$2500|giggle
Ross|1344|$5500|Micsoft
Jennifer|5432|$2100|healthcare
val schemaString = "name,employeeid,salary,company"
val fields = schemaString.split(",").map(fieldName => StructField(fieldName,StringType, nullable=true))
val schema = StructType(fields)
val rddView= sc.textFile("/dev/path/*").map(_.split("|")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val rddViewDf = sqlContext.createDataFrame(rddView,schema)
rddViewDf.show()
Expecting the values to be mapped to corresponding columns but output is not as expected.
Can someone provide the correct solution in Spark using scala language
Output I am getting:
+----+----------+------+-------+
|name|employeeid|salary|company|
+----+----------+------+-------+
| J| o| h| n|
| R| o| s| s|
| J| e| n| n|
+----+----------+------+-------+
Expected Output
+----------+------------+----------+-----------+
|name |employeeid | salary| company|
+---------+-------------+----------+-----------+
| John| 1234| $2500| giggle|
| Ross| 1344| $5500| Micsoft|
| Jennifer| 5432| $2100| healthcare|
+----+----------+------+-----------------------+
As pointed out in the comments, your split delimeter is incorrect.
However, you should not be using RDDs anyway
scala> spark.read.option("delimiter", "|").csv("data.txt").show()
+--------+----+-----+----------+
| _c0| _c1| _c2| _c3|
+--------+----+-----+----------+
| John|1234|$2500| giggle|
| Ross|1344|$5500| Micsoft|
|Jennifer|5432|$2100|healthcare|
+--------+----+-----+----------+
https://spark.apache.org/docs/latest/sql-data-sources-csv.html
To rename the columns, please see this and translate to Scala How to read csv without header and name them with names while reading in pyspark?
Note: Ideally, your employeeId column is defined as a LongType, not StringType

Pyspark create a summary table with calculated values

I have a data frame that looks like this:
+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
and I want to create a summary table which calculates the trip_rate for all the night trips and all the day trips (total_amount column divided by trip_distance). So the end result should look like this:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
Here is what I'm trying to do:
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
],
['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
I don't believe I'm even approaching it the right way. And I'm getting this error:(
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and 'tpep_pickup_datetime' is not an aggregate function.
Can someone help me know how to approach this to get that summary table?
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
Without rounding off:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
(You have day_night in df2 construction code, but isDay in the display table. I'm considering the field name as day_night here.)

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Using LIKE operator for multiple words in PySpark

I have a DataFrame df in PySpark, like a one shown below -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|16 |Sons & Sons |U.K. |
|51 |TÜV GmbH |Germany|
|23 |Mueller GmbH |Germany|
|97 |Schneider AG |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
I would like to keep only those rows where ID starts from either 5 or 6. So, I want my final dataframe to look like this -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|51 |TÜV GmbH |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE statement.
Had I only been interested in those rows where ID starts from 5, it could have been done easily like this -
df=df.where("ID like ('5%')")
My Question: How can I add the second statement like "ID like ('6%')" with OR - | boolean inside where clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where here -
df=df.where("(ID like ('5%')) | (ID like ('6%'))")
This works for me
from pyspark.sql import functions as F
df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))
You can try
df = df.where('ID like "5%" or ID like "6%"')
In pyspark, SparkSql syntax:
where column_n like 'xyz%' OR column_n like 'abc%'
might not work.
Use:
where column_n RLIKE '^xyz|abc'
Explanation: It will filter all words either starting with abc or xyz.
This works perfectly fine.
For me this worked:
from pyspark.sql.functions import col
df.filter((col("ID").like("5%")) | (col("ID").like("6%")))

Split a column in multiple columns using Spark SQL

I have a column col1 that represents a GPS coordinate format:
25 4.1866N 55 8.3824E
I would like to split it in multiple columns based on white-space as separator, as in the output example table_example below:
| 1st_split | 2nd_split | 3rd_split | 4th_split |
|:-----------|------------:|:------------:|:------------:|
| 25 | 4.1866N | 55 | 8.3824E |
Considering the fact that there is the split() function, I have tried in this way:
SELECT explode(split(`col1`, ' ')) AS `col` FROM table_example;
But, instead of splitting per multiple columns, it splits per multiple rows, like in the output below:
Can someone clarify me which would be the worth approach for get the expected result?
If you have a dataframe as
+---------------------+
|col |
+---------------------+
|25 4.1866N 55 8.3824E|
+---------------------+
Using Scala API
You can simply use split inbuilt function and select appropriately as
import org.apache.spark.sql.functions._
df.withColumn("split", split(col("col"), " "))
.select(col("split")(0).as("1st_split"), col("split")(1).as("2nd_split"),col("split")(2).as("3rd_split"),col("split")(3).as("4th_split"))
.show(false)
which would give you
+---------+---------+---------+---------+
|1st_split|2nd_split|3rd_split|4th_split|
+---------+---------+---------+---------+
|25 |4.1866N |55 |8.3824E |
+---------+---------+---------+---------+
Using SQL way
Sql is much easier and similar to the api way
df.createOrReplaceTempView("table_example")
val splitted = sqlContext.sql("SELECT split(`col`, ' ') AS `col` FROM table_example")
splitted.createOrReplaceTempView("splitted_table")
val result = sqlContext.sql("SELECT `col`[0] AS `1st_split`, `col`[1] AS `2nd_split`, `col`[2] AS `3rd_split`, `col`[3] AS `4th_split` FROM splitted_table")
result.show(false)
I hope the answer is helpful