Spark Read CSV doesn't preserve the double quotes while reading - apache-spark-sql

I am trying to read a csv file with one column has double quotes like below.
James,Butt,"Benton, John B Jr",6649 N Blue Gum St
Josephine,Darakjy,"Chanay, Jeffrey A Esq",4 B Blue Ridge Blvd
Art,Venere,"Chemel, James L Cpa",8 W Cerritos Ave #54
Lenna,Paprocki,Feltz Printing Service,639 Main St,Anchorage
Donette,Foller,Printing Dimensions,34 Center St,Hamilton
Simona,Morasca,"Chapman, Ross E Esq",3 Mcauley Dr
I am using the below code to keep the double quotes as its from the csv file.(few rows having double quotes and few dont)
val df_usdata = spark.read.format("com.databricks.spark.csv")//
.option("header","true")//
.option("quote","\"")//
.load("file:///E://data//csvdata.csv")
df_usdata.show(false)
But it didn't preserve the double quotes inside the dataframe but it should be.
The .option("quote",""") is not working. Am using Spark 2.3.1 version.
The output should be like below.
+----------+---------+-------------------------+---------------------+
|first_name|last_name|company_name |address |
+----------+---------+-------------------------+---------------------+
|James |Butt |"Benton, John B Jr" |6649 N Blue Gum St |
|Josephine |Darakjy |"Chanay, Jeffrey A Esq" |4 B Blue Ridge Blvd |
|Art |Venere |"Chemel, James L Cpa" |8 W Cerritos Ave #54 |
|Lenna |Paprocki |Feltz Printing Service |639 Main St |
|Donette |Foller |Printing Dimensions |34 Center St |
|Simona |Morasca |"Chapman, Ross E Esq" |3 Mcauley Dr |
+----------+---------+-------------------------+---------------------+

Try empty quotes .option("quote","") instead.
val df_usdata = spark.read.format("com.databricks.spark.csv")//
.option("header","true")//
.option("quote","")//
.load("file:///E://data//csvdata.csv")
df_usdata.show(false)

Related

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this
common_id
table1_address
table2_address
table3_address
table4_address
123
null
null
stack building12
null
157
123road street12
123road street 33
123road street 44
123road street 45
158
wolf building 451-2
451-2 building wolf
wolf building 451-2
null
163
null
sweet rd. 254-11
null
--
I have about 3 million rows that contains address information from different tables with common_id. I joined 4 tables and made it into one table. I want to make the address rows into one address row that looks like this.
common_id
collaborated_address
123
stack building12
157
123road street12
158
wolf building 451-2
163
sweet rd. 254-11
I tried to do this by using pandas, but it takes too long so I want to do this by using spark sql or pyspark functions.
Conditions:
when collaborated, it should collect only the ones that are not null or not "--"
like row common_id 158, it should collect addresses that are mostly the same. In this case, "wolf building 451-2" is in table1_address column and table3_address.
if all column contains address but has slightly different address like row common_id 157, then it should collect random address.
There are few approaches:
Using rdd with map function.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html
from pyspark.sql import Row
data = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
# Output:
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 30|
| Anna| Rose| F| 41|
| Robert|Williams| M| 62|
+---------+--------+------+------+
def isMale(row):
# Basic Function, replace your address matching logic here.
if row['gender']=="M":
return True
return False
rdd=df.rdd.map(lambda x:isMale(x))
actual_df=rdd.map(lambda x: Row(x)).toDF()
actual_df
DataFrame[_1: boolean]
actual_df.show()
+-----+
| _1|
+-----+
| true|
|false|
| true|
+-----+
Using map with dataframes: https://stackoverflow.com/a/45404691/2986344

How to merge two variables in a Dataset using SAS

I have a dataset from an imported file.
Now there are two variables that need to be merged into one variable because the data is identical.
arr and arr_nbr should be merged into arr_nbr.
How can I get that done?
Original:
|name |db |arr |arr_nbr|
+-----+--------+----+-------+
|john |10121960|0456| |
|jane |04071988| |8543 |
|mia |01121955|9583| |
|liam |23091973| |7844 |
Desired output:
|name |db |arr_nbr|
+-----+--------+-------+
|john |10121960|0456 |
|jane |04071988|8543 |
|mia |01121955|9583 |
|liam |23091973|7844 |
Given that there are leading 0's in your desired output, I assume they are all character variables. In that case, use the COALESCEC function. It returns the first non-null or nonmissing value.
data want;
set have;
arr_nbr = coalescec(arr, arr_nbr);
drop arr;
run;
name db arr_nbr
john 10121960 0456
jane 04071988 8543
mia 01121955 9583
liam 23091973 7844

Split Spark Dataframe name column into three columns

I have a dataframe in Spark, the column is name, it is a string delimited by space, the tricky part is some names have middle name, others don't. How can I split the column into firstname, middlename and lastname? I am using F.split, dunno how to differentiate middle name and last name. I understand I cannot use negative index in Spark. Take a look at my sample df
from pyspark.sql import functions as F
cols = ['id', 'name']
vals = [('l03', 'Bob K Barry'), ('S20', 'Cindy Winston'), ('l10', 'Jerry Kyle Moore'), ('j31', 'Dora Larson')]
df = spark.createDataFrame(vals, cols)
df.show()
+---+----------------+
| id| name|
+---+----------------+
|l03| Bob K Barry|
|S20| Cindy Winston|
|l10|Jerry Kyle Moore|
|j31| Dora Larson|
+---+----------------+
split_col = F.split(df['name'], ' ')
df = df.withColumn('firstname', split_col.getItem(0))
df.show()
+---+----------------+---------+
| id| name|firstname|
+---+----------------+---------+
|l03| Bob K Barry| Bob|
|S20| Cindy Winston| Cindy|
|l10|Jerry Kyle Moore| Jerry|
|j31| Dora Larson| Dora|
+---+----------------+---------+
How do I continue to split? Appreciated.
Have the first element in the array always as the firstname and the last element as lastname (using size). If there cannot be more than 1 middle name, you can do:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col("split_list")[1])).drop("split_list").show()
+---+----------------+-----+-------+----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston|null|
|l10|Jerry Kyle Moore|Jerry| Moore|Kyle|
|j31| Dora Larson| Dora| Larson|null|
+---+----------------+-----+-------+----+
If there can be more than 1 middle name, then you can use substring on name for middlename column:
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col('name').substr(F.length("fn")+2, \
F.length("name")-F.length("fn")-F.length("ln")-2))).drop("split_list").show()
+---+----------------+-----+-------+-----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+-----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston| null|
|l10|Jerry Kyle Moore|Jerry| Moore| Kyle|
|j31| Dora Larson| Dora| Larson| null|
|A12| Fn A B C Ln| Fn| Ln|A B C|
+---+----------------+-----+-------+-----+
I'm assuming that the FN is the first element, and the LN is the last element, and anything in between is the MN. This is not always true as people can have multiple FN/LN.

How to convert the first row as column from an existing dataframe

I have a dataframe like below. I want to convert the first row as columns for this dataframe.
How could I do this. Is there any way to convert it directly.(without using df.first)
usdata.show()
-----+---+------------+------------+--------------------+--------------------+
| _1| _2| _3| _4| _5|
+----------+---------+--------------------+--------------------+-------------+
|first_name|last_name| company_name| address| city|
| James| Butt| "Benton, John B Jr"| 6649 N Blue Gum St| New Orleans|
| Josephine| Darakjy|"Chanay, Jeffrey ...| 4 B Blue Ridge Blvd| Brighton|
| Art| Venere|"Chemel, James L ...|8 W Cerritos Ave #54| Bridgeport|
| Lenna| Paprocki|Feltz Printing Se...| 639 Main St| Anchorage|
+----------+---------+--------------------+--------------------+-------------+
Regards,
Dinesh

Left-Linear and Right-Linear Grammars

I need help with constructing a left-linear and right-linear grammar for the languages below?
a) (0+1)*00(0+1)*
b) 0*(1(0+1))*
c) (((01+10)*11)*00)*
For a) I have the following:
Left-linear
S --> B00 | S11
B --> B0|B1|011
Right-linear
S --> 00B | 11S
B --> 0B|1B|0|1
Is this correct? I need help with b & c.
Constructing an equivalent Regular Grammar from a Regular Expression
First, I start with some simple rules to construct Regular Grammar(RG) from Regular Expression(RE).
I am writing rules for Right Linear Grammar (leaving as an exercise to write similar rules for Left Linear Grammar)
NOTE: Capital letters are used for variables, and small for terminals in grammar. NULL symbol is ^. Term 'any number' means zero or more times that is * star closure.
[BASIC IDEA]
SINGLE TERMINAL: If the RE is simply e (e being any terminal), we can write G, with only one production rule S --> e (where S is the start symbol), is an equivalent RG.
UNION OPERATION: If the RE is of the form e + f, where both e and f are terminals, we can write G, with two production rules S --> e | f, is an equivalent RG.
CONCATENATION: If the RE is of the form ef, where both e and f are terminals, we can write G, with two production rules S --> eA, A --> f, is an equivalent RG.
STAR CLOSURE: If the RE is of the form e*, where e is a terminal and * Kleene star closure operation, we can write two production rules in G, S --> eS | ^, is an equivalent RG.
PLUS CLOSURE: If the RE is of the form e+, where e is a terminal and + Kleene plus closure operation, we can write two production rules in G, S --> eS | e, is an equivalent RG.
STAR CLOSURE ON UNION: If the RE is of the form (e + f)*, where both e and f are terminals, we can write three production rules in G, S --> eS | fS | ^, is an equivalent RG.
PLUS CLOSURE ON UNION: If the RE is of the form (e + f)+, where both e and f are terminals, we can write four production rules in G, S --> eS | fS | e | f, is an equivalent RG.
STAR CLOSURE ON CONCATENATION: If the RE is of the form (ef)*, where both e and f are terminals, we can write three production rules in G, S --> eA | ^, A --> fS, is an equivalent RG.
PLUS CLOSURE ON CONCATENATION: If the RE is of the form (ef)+, where both e and f are terminals, we can write three production rules in G, S --> eA, A --> fS | f, is an equivalent RG.
Be sure that you understands all above rules, here is the summary table:
+-------------------------------+--------------------+------------------------+
| TYPE | REGULAR-EXPRESSION | RIGHT-LINEAR-GRAMMAR |
+-------------------------------+--------------------+------------------------+
| SINGLE TERMINAL | e | S --> e |
| UNION OPERATION | e + f | S --> e | f |
| CONCATENATION | ef | S --> eA, A --> f |
| STAR CLOSURE | e* | S --> eS | ^ |
| PLUS CLOSURE | e+ | S --> eS | e |
| STAR CLOSURE ON UNION | (e + f)* | S --> eS | fS | ^ |
| PLUS CLOSURE ON UNION | (e + f)+ | S --> eS | fS | e | f |
| STAR CLOSURE ON CONCATENATION | (ef)* | S --> eA | ^, A --> fS |
| PLUS CLOSURE ON CONCATENATION | (ef)+ | S --> eA, A --> fS | f |
+-------------------------------+--------------------+------------------------+
note: symbol e and f are terminals, ^ is NULL symbol, and S is the start variable
[ANSWER]
Now, we can come to you problem.
a) (0+1)*00(0+1)*
Language description: All the strings consist of 0s and 1s, containing at-least one pair of 00.
Right Linear Grammar:
S --> 0S | 1S | 00A
A --> 0A | 1A | ^
String can start with any string of 0s and 1s thats why included rules s --> 0S | 1S and Because at-least one pair of 00 ,there is no null symbol. S --> 00A is included because 0, 1 can be after 00. The symbol A takes care of the 0's and 1's after the 00.
Left Linear Grammar:
S --> S0 | S1 | A00
A --> A0 | A1 | ^
b) 0*(1(0+1))*
Language description: Any number of 0, followed any number of 10 and 11.
{ because 1(0 + 1) = 10 + 11 }
Right Linear Grammar:
S --> 0S | A | ^
A --> 1B
B --> 0A | 1A | 0 | 1
String starts with any number of 0 so rule S --> 0S | ^ are included, then rule for generating 10 and 11 for any number of times using A --> 1B and B --> 0A | 1A | 0 | 1.
Other alternative right linear grammar can be
S --> 0S | A | ^
A --> 10A | 11A | 10 | 11
Left Linear Grammar:
S --> A | ^
A --> A10 | A11 | B
B --> B0 | 0
An alternative form can be
S --> S10 | S11 | B | ^
B --> B0 | 0
c) (((01+10)*11)*00)*
Language description: First is language contains null(^) string because there a * (star) on outside of every thing present inside (). Also if a string in language is not null that defiantly ends with 00. One can simply think this regular expression in the form of ( ( (A)* B )* C )* , where (A)* is (01 + 10)* that is any number of repeat of 01 and 10.
If there is a instance of A in string there would be a B defiantly because (A)*B and B is 11.
Some example strings { ^, 00, 0000, 000000, 1100, 111100, 1100111100, 011100, 101100, 01110000, 01101100, 0101011010101100, 101001110001101100 ....}
Left Linear Grammar:
S --> A00 | ^
A --> B11 | S
B --> B01 | B10 | A
S --> A00 | ^ because any string is either null, or if it's not null it ends with a 00. When the string ends with 00, the variable A matches the pattern ((01 + 10)* + 11)*. Again this pattern can either be null or must end with 11. If its null, then A matches it with S again i.e the string ends with pattern like (00)*. If the pattern is not null, B matches with (01 + 10)*. When B matches all it can, A starts matching the string again. This closes the out-most * in ((01 + 10)* + 11)*.
Right Linear Grammar:
S --> A | 00S | ^
A --> 01A | 10A | 11S
Second part of you question:
For a) I have the following:
Left-linear
S --> B00 | S11
B --> B0|B1|011
Right-linear
S --> 00B | 11S
B --> 0B|1B|0|1
(answer)
You solution are wrong for following reasons,
Left-linear grammar is wrong Because string 0010 not possible to generate.
Right-linear grammar is wrong Because string 1000 is not possible to generate. Although both are in language generated by regular expression of question (a).
EDIT
Adding DFA's for each regular expression. so that one can find it helpful.
a) (0+1)*00(0+1)*
b) 0*(1(0+1))*
c) (((01+10)*11)*00)*
Drawing DFA for this regular expression is trick and complex.
For this I wanted to add DFA's
To simplify the task, we should think the kind formation of RE
to me the RE (((01+10)*11)*00)* looks like (a*b)*
(((01+10)*11)* 00 )*
( a* b )*
Actually in above expression a it self in the form of (a*b)*
that is ((01+10)*11)*
RE (a*b)* is equals to (a + b)*b + ^. The DFA for (ab) is as belows:
DFA for ((01+10)*11)* is:
DFA for (((01+10)*11)* 00 )* is:
Try to find similarity in construction of above three DFA. don't move ahead till you don't understand first one
Rules to convert regular expressions to left or right linear regular grammar