How to convert the first row as column from an existing dataframe - apache-spark-sql

I have a dataframe like below. I want to convert the first row as columns for this dataframe.
How could I do this. Is there any way to convert it directly.(without using df.first)
usdata.show()
-----+---+------------+------------+--------------------+--------------------+
| _1| _2| _3| _4| _5|
+----------+---------+--------------------+--------------------+-------------+
|first_name|last_name| company_name| address| city|
| James| Butt| "Benton, John B Jr"| 6649 N Blue Gum St| New Orleans|
| Josephine| Darakjy|"Chanay, Jeffrey ...| 4 B Blue Ridge Blvd| Brighton|
| Art| Venere|"Chemel, James L ...|8 W Cerritos Ave #54| Bridgeport|
| Lenna| Paprocki|Feltz Printing Se...| 639 Main St| Anchorage|
+----------+---------+--------------------+--------------------+-------------+
Regards,
Dinesh

Related

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this
common_id
table1_address
table2_address
table3_address
table4_address
123
null
null
stack building12
null
157
123road street12
123road street 33
123road street 44
123road street 45
158
wolf building 451-2
451-2 building wolf
wolf building 451-2
null
163
null
sweet rd. 254-11
null
--
I have about 3 million rows that contains address information from different tables with common_id. I joined 4 tables and made it into one table. I want to make the address rows into one address row that looks like this.
common_id
collaborated_address
123
stack building12
157
123road street12
158
wolf building 451-2
163
sweet rd. 254-11
I tried to do this by using pandas, but it takes too long so I want to do this by using spark sql or pyspark functions.
Conditions:
when collaborated, it should collect only the ones that are not null or not "--"
like row common_id 158, it should collect addresses that are mostly the same. In this case, "wolf building 451-2" is in table1_address column and table3_address.
if all column contains address but has slightly different address like row common_id 157, then it should collect random address.
There are few approaches:
Using rdd with map function.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html
from pyspark.sql import Row
data = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
# Output:
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 30|
| Anna| Rose| F| 41|
| Robert|Williams| M| 62|
+---------+--------+------+------+
def isMale(row):
# Basic Function, replace your address matching logic here.
if row['gender']=="M":
return True
return False
rdd=df.rdd.map(lambda x:isMale(x))
actual_df=rdd.map(lambda x: Row(x)).toDF()
actual_df
DataFrame[_1: boolean]
actual_df.show()
+-----+
| _1|
+-----+
| true|
|false|
| true|
+-----+
Using map with dataframes: https://stackoverflow.com/a/45404691/2986344

Split Spark Dataframe name column into three columns

I have a dataframe in Spark, the column is name, it is a string delimited by space, the tricky part is some names have middle name, others don't. How can I split the column into firstname, middlename and lastname? I am using F.split, dunno how to differentiate middle name and last name. I understand I cannot use negative index in Spark. Take a look at my sample df
from pyspark.sql import functions as F
cols = ['id', 'name']
vals = [('l03', 'Bob K Barry'), ('S20', 'Cindy Winston'), ('l10', 'Jerry Kyle Moore'), ('j31', 'Dora Larson')]
df = spark.createDataFrame(vals, cols)
df.show()
+---+----------------+
| id| name|
+---+----------------+
|l03| Bob K Barry|
|S20| Cindy Winston|
|l10|Jerry Kyle Moore|
|j31| Dora Larson|
+---+----------------+
split_col = F.split(df['name'], ' ')
df = df.withColumn('firstname', split_col.getItem(0))
df.show()
+---+----------------+---------+
| id| name|firstname|
+---+----------------+---------+
|l03| Bob K Barry| Bob|
|S20| Cindy Winston| Cindy|
|l10|Jerry Kyle Moore| Jerry|
|j31| Dora Larson| Dora|
+---+----------------+---------+
How do I continue to split? Appreciated.
Have the first element in the array always as the firstname and the last element as lastname (using size). If there cannot be more than 1 middle name, you can do:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col("split_list")[1])).drop("split_list").show()
+---+----------------+-----+-------+----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston|null|
|l10|Jerry Kyle Moore|Jerry| Moore|Kyle|
|j31| Dora Larson| Dora| Larson|null|
+---+----------------+-----+-------+----+
If there can be more than 1 middle name, then you can use substring on name for middlename column:
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col('name').substr(F.length("fn")+2, \
F.length("name")-F.length("fn")-F.length("ln")-2))).drop("split_list").show()
+---+----------------+-----+-------+-----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+-----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston| null|
|l10|Jerry Kyle Moore|Jerry| Moore| Kyle|
|j31| Dora Larson| Dora| Larson| null|
|A12| Fn A B C Ln| Fn| Ln|A B C|
+---+----------------+-----+-------+-----+
I'm assuming that the FN is the first element, and the LN is the last element, and anything in between is the MN. This is not always true as people can have multiple FN/LN.

Pyspark create a summary table with calculated values

I have a data frame that looks like this:
+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 21:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
| 2019-01-01 10:01:00| 2019-01-01 08:53:20| 1.5| 2.00| true|
| 2019-01-01 22:59:59| 2019-01-01 21:18:59| 2.6| 5.00|false|
+--------------------+---------------------+-------------+------------+-----+
and I want to create a summary table which calculates the trip_rate for all the night trips and all the day trips (total_amount column divided by trip_distance). So the end result should look like this:
+------------+-----------+
| day_night | trip_rate |
+------------+-----------+
|Day | 1.33 |
|Night | 1.92 |
+------------+-----------+
Here is what I'm trying to do:
df2 = spark.createDataFrame(
[
('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
],
['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)
day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
I don't believe I'm even approaching it the right way. And I'm getting this error:(
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and 'tpep_pickup_datetime' is not an aggregate function.
Can someone help me know how to approach this to get that summary table?
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
+---------+---------+
|day_night|trip_rate|
+---------+---------+
| Day| 1.33|
| Night| 1.92|
+---------+---------+
Without rounding off:
df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
.withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()
(You have day_night in df2 construction code, but isDay in the display table. I'm considering the field name as day_night here.)

Spark Read CSV doesn't preserve the double quotes while reading

I am trying to read a csv file with one column has double quotes like below.
James,Butt,"Benton, John B Jr",6649 N Blue Gum St
Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd
Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54
Lenna,Paprocki,Feltz Printing Service,639 Main St,Anchorage
Donette,Foller,Printing Dimensions,34 Center St,Hamilton
Simona,Morasca,"Chapman, Ross E Esq",3 Mcauley Dr
I am using the below code to keep the double quotes as its from the csv file.(few rows having double quotes and few dont)
val df_usdata = spark.read.format("com.databricks.spark.csv")//
.option("header","true")//
.option("quote","\"")//
.load("file:///E://data//csvdata.csv")
df_usdata.show(false)
But it didn't preserve the double quotes inside the dataframe but it should be.
The .option("quote",""") is not working. Am using Spark 2.3.1 version.
The output should be like below.
+----------+---------+-------------------------+---------------------+
|first_name|last_name|company_name |address |
+----------+---------+-------------------------+---------------------+
|James |Butt |"Benton, John B Jr" |6649 N Blue Gum St |
|Josephine |Darakjy |"Chanay, Jeffrey A Esq" |4 B Blue Ridge Blvd |
|Art |Venere |"Chemel, James L Cpa" |8 W Cerritos Ave #54 |
|Lenna |Paprocki |Feltz Printing Service |639 Main St |
|Donette |Foller |Printing Dimensions |34 Center St |
|Simona |Morasca |"Chapman, Ross E Esq" |3 Mcauley Dr |
+----------+---------+-------------------------+---------------------+
Try empty quotes .option("quote","") instead.
val df_usdata = spark.read.format("com.databricks.spark.csv")//
.option("header","true")//
.option("quote","")//
.load("file:///E://data//csvdata.csv")
df_usdata.show(false)

How to do Multi Index Pivot when index and values are in the same column?

I have this frame:
regions = pd.read_html('http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html')
messy_regions = regions[8]
Which yields something like this:
|0 | 1
--- |---| ---
0| Region 1 (The Northeast)| nan
1| Division 1 (New England)| Division 2 (Middle Atlantic)
2| Maine | New York
3| New Hampshire | Pennsylvania
4| Vermont | New Jersey
5| Massachusetts |nan
6| Rhode Island |nan
7| Connecticut | nan
8| Region 2 (The Midwest) | nan
9| Division 3 (East North Central)| Division 4 (West North Central)
10| Wisconsin | North Dakota
11| Michigan | South Dakota
12| Illinois | Nebraska
The goal is to make this a tidy dataframe and I think I need to pivot in order to get the regions and Divisions as columns with the states as values under the correct regions/divisions. Once it's in that shape then I can just melt to the desired shape. I can't figure out though how to extract what would be the column headers out of this though. Any help is appreciated and at the very least a good point in the right direction.
You can use:
url = 'http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html'
#input dataframe with columns a, b
df = pd.read_html(url)[8]
df.columns = ['a','b']
#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
.sort_values(['Region', 'variable'])
.dropna()
.drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
.reset_index(drop=False)
.reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
print (df)
Region Division Names
0 Region 1 (The Northeast) Division 1 (New England) Maine
1 Region 1 (The Northeast) Division 1 (New England) New Hampshire
2 Region 1 (The Northeast) Division 1 (New England) Vermont
3 Region 1 (The Northeast) Division 1 (New England) Massachusetts
4 Region 1 (The Northeast) Division 1 (New England) Rhode Island
5 Region 1 (The Northeast) Division 1 (New England) Connecticut
6 Region 1 (The Northeast) Division 2 (Middle Atlantic) New York
7 Region 1 (The Northeast) Division 2 (Middle Atlantic) Pennsylvania
8 Region 1 (The Northeast) Division 2 (Middle Atlantic) New Jersey
9 Region 2 (The Midwest) Division 3 (East North Central) Wisconsin
10 Region 2 (The Midwest) Division 3 (East North Central) Michigan
11 Region 2 (The Midwest) Division 3 (East North Central) Illinois
12 Region 2 (The Midwest) Division 3 (East North Central) Indiana
13 Region 2 (The Midwest) Division 3 (East North Central) Ohio
...
...