How to remove the space and dots and convert into lowercase - dataframe

I have a pyspark dataframe with names like
N. Plainfield
North Plainfield
West Home Land
NEWYORK
newyork
So. Plainfield
S. Plaindield
Some of them contain dots and spaces between initials, and some do not. How can they be converted to:
n Plainfield
north plainfield
west homeland
newyork
newyork
so plainfield
s plainfield
(with no dots and spaces between initials and 1 space between initials and name)
I tried using the following, but it only replaces dots and doesn't remove spaces between initials:
names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))
After removing the whitespaces and dots is there any way get the distinct values.
Like this.
north plainfield
west homeland
newyork
so plainfield

I think you should divide the step.
from uppercase to lowercase
replace dot using regex_replace function
from pyspark.sql.functions import *
# from uppercase to lowercase
names_modified = names_modified.withColumn('name', lower('name'))
# from dot to blink
names_modified = names_modified.withColumn('name_clean', regex_replace('name', '.', ' '))

Related

Pop the first element in a pandas column

I have a pandas column like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': [['William J. Clare', '290 Valley Dr.', 'Casper, WY 82604','USA, United States'],
['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'],
['William N. Barnard', '145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
}
df = pd.DataFrame(data)
I wanted to pop the 1st element in the address column list if its name or if it doesn't contain any number.
output:
[['290 Valley Dr.', 'Casper, WY 82604','USA, United States'], ['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'], ['145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
This is continuation of my previous post. I am learning python and this is my 2nd project and I am struggling with this from morning please help me.
Assuming you define an address as a string starting with a number (you can change the logic):
for l in df['address']:
if not l[0][0].isdigit():
l.pop(0)
print(df)
updated df:
id address
0 001 [290 Valley Dr., Casper, WY 82604, USA, United...
1 002 [1180 Shelard Tower, Minneapolis, MN 55426, US...
2 003 [145 S. Durbin, Casper, WY 82601, USA, United ...

Delete abbreviations (combination of Letter+dot) from Pandas column

I'd like to delete specific parts of strings in a pandas column, such as any letter followed by a dot. For example, having a column with names:
John W. Man
Betty J. Rule
C.S. Stuart
What should remain is
John Man
Betty Rule
Stuart
SO, any letter followed by a dot, that represents an abbreviation, should go.
I can't think of a way with str.replace or anything like that.
Use Series.str.replace with reegx for match one letter with . and space after it if exist:
df['col'] = df['col'].str.replace('([a-zA-Z]{1}\.\s*)','', regex=True)
print (df)
col
0 John Man
1 Betty Rule
2 Stuart

regex: match everything, but not a certain string including white sapce (regular expression, inspite of, anything but, VBA visual basic)

Folks, there are already billions of questions on "regex: match everything, but not ...", but non seems to fit my simple question.
A simple string: "1 Rome, 2 London, 3 Wembley Stadium" and I want to match just "1 Rome, 2 London, 3 Wembley Stadium", in order to extract only the names but not the ranks ("Rome, London, Wembley Stadium").
Using a regex tester (https://extendsclass.com/regex-tester.html), I can simply match the opposite by:
([0-9]+\s*) and it gives me:
"1 Rome, 2 London, 3 Wembley Stadium".
But how to reverse it? I tried something like:
[^0-9 |;]+[^0-9 |;], but it also excludes white spaces that I want to maintain (e.g. after the comma and in between Wembley and Stadium, "1 Rome, 2 London, 3 Wembley Stadium"). I guess the "0-9 " needs be determined somehow as one continuous string. I tried various brackets, quotation marks, \s*, but nothing jet.
Note: I'm working in a visual basic environment and not allowing lookbehinds!
You can use
\d+\s*(.*?)(?=,\s*\d+\s|$)
See the regex demo, get the values from match.Submatches(0). Details:
\d+ - one or more digits
\s* - zero or more whitespaces
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
(?=,\s*\d+\s|$) - a positive lookahead that requires ,, zero or more whitespaces, one or more digits and then a whitespace OR end of string immediately to the right of the current location.
Here is a demo of how to get all matches:
Sub TestRegEx()
Dim matches As Object, match As Object
Dim str As String
str = "1 Rome, 2 London, 3 Wembley Stadium"
Set regex = New regExp
regex.Pattern = "\d+\s*(.*?)(?=,\s*\d+\s|$)"
regex.Global = True
Set matches = regex.Execute(str)
For Each match In matches
Debug.Print match.subMatches(0)
Next
End Sub
Output:

Removing double space and single space in data frame simultaneously

I have a column where the names are separated by Single space, double space(there can be more) and I want to split the names by Fist Name and Last Name
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],{'Age':[32,34,36]})
df['Name'] = df['Name'].str.strip()
df[['First_Name', 'Last_Name']] = df['Name'].str.split(" ",expand = True,)
this should do it
df[['First_Name', 'Last_Name']] = df.Name.apply(lambda x: pd.Series(list((filter(None, x.split(' '))))))
Use \s+ as your split pattern. This is the regex pattern meaning "one or more whitespace characters".
Also, limit number of splits with n=1. This means the string will only be split once (The first occurance of whitespace from left to right) - restricting the output to 2 columns.
df[['First_Name', 'Last_Name']] = df.Name.str.split('\s+', expand=True, n=1)
[out]
Name Age First_Name Last_Name
0 Steve Smith 32 Steve Smith
1 Joe Nadal 34 Joe Nadal
2 Roger Federer 36 Roger Federer

Pandas DataFrame: remove � (unknown-character) from strings in rows

I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.
df = pd.read_csv('sales.csv', thousands=',')
print df
Gender Size Color Category Sales Price
Female 36-38 Blue Socks 25
Female 44-46 Pink Socks 13.2
Unisex 36-38 Black Socks � 19.00
Unisex 40-42 Pink Socks � 18.50
Female 38 Yellow Pants � 89,00
Female 43 Black Pants � 89,00
I was under the assumption that a simple line with replace will solve it
df=df.replace('\�','',regex=True).astype(float)
But I got encoding error
SyntaxError: Non-ASCII character
Would appreciate hearing your thoughts on this
I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.
df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')
The above will replace € with Euro in Sales Price column.
I think #jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)
df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')
but for replacing Euro sign try this:
df=df.replace('\u20AC','',regex=True).astype(float)