Creating new column based on an existing column value in pyspark

Creating new column based on an existing column value in pyspark - apache-spark-sql

I have a data frame that has an existing column with airport names, and I want to create another column with their abbreviations.
For example, I have an existing column with the following values:
SEATTLE TACOMA AIRPORT, WA US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
MIAMI INTERNATIONAL AIRPORT, FL US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
SEATTLE TACOMA AIRPORT, WA US
I would like to create a new column with their associated abbreviations, e.g SEA, MIA, and SFO. I was thinking I can use for loop to achieve that, but I am not so sure how to code it exactly.

Here's 2 sample approaches:
using a dict and a UDF
using a second DataFrame to join with
from pyspark.sql.functions import col, udf, StringType
s = """\
SEATTLE TACOMA AIRPORT, WA US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
MIAMI INTERNATIONAL AIRPORT, FL US
MIAMI INTERNATIONAL AIRPORT, FL US
SAN FRANCISCO INTERNATIONAL AIRPORT, CA US
SEATTLE TACOMA AIRPORT, WA US"""
abbr = {
"SEATTLE TACOMA AIRPORT": "SEA",
"MIAMI INTERNATIONAL AIRPORT": "MIA",
"SAN FRANCISCO INTERNATIONAL AIRPORT": "SFO",
}
df = spark.read.csv(sc.parallelize(s.splitlines()))
print("=== df ===")
df.show()
# =================================
# 1. using a UDF
# =================================
print("=== using a UDF ===")
udf_airport_to_abbr = udf(lambda airport: abbr[airport], StringType())
df.withColumn("abbr", udf_airport_to_abbr("_c0")).show()
# =================================
# 2. using a join
# =================================
# you may want to create this df in some different way ;)
df_abbrs = spark.read.csv(sc.parallelize(["%s,%s" % x for x in abbr.items()]))
print("=== df_abbrs ===")
df_abbrs.show()
print("=== using a join ===")
df.join(df_abbrs, on="_c0").show()
Output:
=== df ===
+--------------------+------+
| _c0| _c1|
+--------------------+------+
|SEATTLE TACOMA AI...| WA US|
|MIAMI INTERNATION...| FL US|
|SAN FRANCISCO INT...| CA US|
|MIAMI INTERNATION...| FL US|
|MIAMI INTERNATION...| FL US|
|SAN FRANCISCO INT...| CA US|
|SEATTLE TACOMA AI...| WA US|
+--------------------+------+
=== using a UDF ===
+--------------------+------+----+
| _c0| _c1|abbr|
+--------------------+------+----+
|SEATTLE TACOMA AI...| WA US| SEA|
|MIAMI INTERNATION...| FL US| MIA|
|SAN FRANCISCO INT...| CA US| SFO|
|MIAMI INTERNATION...| FL US| MIA|
|MIAMI INTERNATION...| FL US| MIA|
|SAN FRANCISCO INT...| CA US| SFO|
|SEATTLE TACOMA AI...| WA US| SEA|
+--------------------+------+----+
=== df_abbrs ===
+--------------------+---+
| _c0|_c1|
+--------------------+---+
|SEATTLE TACOMA AI...|SEA|
|MIAMI INTERNATION...|MIA|
|SAN FRANCISCO INT...|SFO|
+--------------------+---+
=== using a join ===
+--------------------+------+---+
| _c0| _c1|_c1|
+--------------------+------+---+
|SEATTLE TACOMA AI...| WA US|SEA|
|SEATTLE TACOMA AI...| WA US|SEA|
|SAN FRANCISCO INT...| CA US|SFO|
|SAN FRANCISCO INT...| CA US|SFO|
|MIAMI INTERNATION...| FL US|MIA|
|MIAMI INTERNATION...| FL US|MIA|
|MIAMI INTERNATION...| FL US|MIA|
+--------------------+------+---+

You can add new column in dataframe and it will create new dataframe
You can use dataframe.withColumn(newcolumnname, case statement to decode name to abbreviations)

Related

Add columns to table based on other rows and columns

For example, I have the following table in SQL Server:
name
from
to
traveling date
Mike
london
Paris
5/jan/2022
Mike
Paris
Barcelona
5/jan/2022
Sam
Cairo
Riyadh
6/Mar/2022
Sam
Riyadh
Dubai
6/Mar/2022
Sam
Dubai
Maldives
7/Mar/2022
Sam
Maldives
Riyadh
13/Mar/2022
Sam
Riyadh
Cairo
13/Mar/2022
And the result I want must have new columns based on other columns and rows, like below:
name
from
to
traveling date
Route
Date
Mike
London
Paris
5/Jan/2022
London-Barcelona
5/Jan/2022
Mike
Paris
Barcelona
5/Jan/2022
Sam
Cairo
Riyadh
6/Mar/2022
Cairo-Maldives
6/Mar/2022
Sam
Riyadh
Dubai
6/Mar/2022
Sam
Dubai
Maldives
7/Mar/2022
Sam
Maldives
Riyadh
13/Mar/2022
Maldives-Cairo
13/Mar/2022
Sam
Riyadh
Cairo
13/Mar/2022
As you can see Sam had Round ticket he reached the next day of traveling and stayed for couple of days in Maldives.
All I care about is to consider Route as 1 ticket and have the traveling date from 'Date'.
Help please.

t-SQL cartesian production of several tables

I would like to get a cartesian product of several tables in SQL (which are actually only one column, so no common key). For example:
TABLE A
Robert
Pierre
Samuel
TABLE B
Montreal
Chicago
TABLE C
KLM
AIR FRANCE
FINAL TABLE (CROSS PRODUCT)
Robert | Montreal | KLM
Pierre | Montreal | KLM
Samuel | Montreal | KLM
Robert | Chicago | KLM
Pierre | Chicago | KLM
Samuel | Chicago | KLM
Robert | Montreal | AIR FRANCE
Pierre | Montreal | AIR FRANCE
Samuel | Montreal | AIR FRANCE
Robert | Chicago | AIR FRANCE
Pierre | Chicago | AIR FRANCE
Samuel | Chicago | AIR FRANCE
I tried CROSS JOIN, but I couldn't find an example with multiple tables. Is the only way to do it is nesting? What if we have 15 tables to join that way... it creates a very long code.
Thank you!

You would simply use:
select *
from a cross join b cross join c;
Do note that if any of the tables are empty (i.e. no rows), you will get no results.

Regular expression to find Specific character in a string

I have these sample values
prm_2020 P02 United Kingdom London 2 for 2
prm_2020 P2 United Kingdom London 2 for 2
prm_2020 P10 United Kingdom London 2 for 2
prm_2020 P11 United Kingdom London 2 for 2
Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID
select distinct
promo_name
,regexp_extract(promo_name, '(?<=p\d+\s+)P\d+') as regexp_id
from stock
where promo_name is not null
select distinct
promo_name
,regexp_extract(promo_name, 'P[0-9]+') as regexp_id
from stock
where promo_name is not null
both generating errors

The expression would be:
select regexp_extract(col, 'P[0-9]+')

One regex could be (?<=prm_\d+\s+)P\d+
Besides searching for strings in the form of P* where * is a digit, it also checks that such strings are preceded by strings in the form prm_* where * is a digit.
Keep in mind case sensitivity. The solution above IS case sensitive (if your input comes as PRM, then your match will be discarded.) I am not familiar with apache-spark but I assume it supports parameters such as /i as other platforms to indicate the regex should be case insensitive.
Regexr.com demo

Just select the group 0
regexp_extract(promo_name, 'P[0-9]+',0)

function regexp_extract will take 3 parameters.
Column value
Regex Pattern
Group Index
def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column
You are missing last parameter in regexp_extract function.
Check below code.
scala> df.show(truncate=False)
+------------------------------------------+
|data |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+
df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+

how to apply split_part function from end of string in postgres

I want to split the below string (present in a single column) separated by spaces from the end. For the below 3 rows, I want the following output
OUTPUT:
Country STATE STREET UNIT
AU NSW 2 12
AU NSW 51
AU NSW 12
INPUT:
12 2 NOELA PLACE ST MARYS NSW 2760 AU
51 MALABAR ROAD SOUTH COOGEE NSW 2034 AU
12 LISTER STREET WINSTON HILLS NSW 2153 AU

of course such conditional parsing is not reliable:
t=# with v(a) as( values('12 2 NOELA PLACE ST MARYS NSW 2760 AU')
,('51 MALABAR ROAD SOUTH COOGEE NSW 2034 AU')
,('12 LISTER STREET WINSTON HILLS NSW 2153 AU')
)
select reverse(split_part(reverse(a),' ',1)), reverse(split_part(reverse(a),' ',3)), case when split_part(a,' ',2) ~ '\d' then split_part(a,' ',2) end st, split_part(a,' ',1) un from v;
reverse | reverse | st | un
---------+---------+----+----
AU | NSW | 2 | 12
AU | NSW | | 51
AU | NSW | | 12
(3 rows)

How to remove values after special character in hive

I am having a hive table with column state as
**state**
taxes, TX
Washington, WA
New York, NY
New Jersey, NJ
Now I want to separate the state column and I want to write it in new columns as
**state** **code**
taxes TX
Washington WA
New York NY
New Jersey NJ

select split(state,',')[0] as state
,ltrim(split(state,',')[1]) as code
from mytable
+------------+------+
| state | code |
+------------+------+
| taxes | TX |
| Washington | WA |
| New York | NY |
| New Jersey | NJ |
+------------+------+

select substr (name,0,instr(name,',')-1), substr (name ,instr(name,',')+1,10) from aa

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Creating new column based on an existing column value in pyspark - apache-spark-sql

You can add new column in dataframe and it will create new dataframe You can use dataframe.withColumn(newcolumnname, case statement to decode name to abbreviations)

Related

Add columns to table based on other rows and columns

t-SQL cartesian production of several tables

Regular expression to find Specific character in a string

how to apply split_part function from end of string in postgres

How to remove values after special character in hive

Categories

Resources