I'd like to delete specific parts of strings in a pandas column, such as any letter followed by a dot. For example, having a column with names:
John W. Man
Betty J. Rule
C.S. Stuart
What should remain is
John Man
Betty Rule
Stuart
SO, any letter followed by a dot, that represents an abbreviation, should go.
I can't think of a way with str.replace or anything like that.
Use Series.str.replace with reegx for match one letter with . and space after it if exist:
df['col'] = df['col'].str.replace('([a-zA-Z]{1}\.\s*)','', regex=True)
print (df)
col
0 John Man
1 Betty Rule
2 Stuart
Related
As described in the title, I have the following problem:
Data is prepared as a pandas dataframe incoming as follows:
Article
Title
A0
A00183
BB2
BB2725
C2C3
C2C3945
As you can see, the "Title" column is repeating the string value of the Article column.
I want this to be deleted, so that the table looks as follows:
Article
Title
A0
0183
BB2
725
C2C3
945
I want to do this with Pandas.
I already found out how to read the length of the string row in column Article, so that I already know the amount of characters to be deducted with this:
df1['Length of Article string:'] = df1['Article:'].apply(len)
But now I am to stupid to figure out how to delete the strings, that can change in amount for every row, in the Title column.
Thanks for your help!
Kind regards
Tried Pandas Documentation, found some hints regarding split and strip, but I do not have enough know-how to implement...
You can replace from list derived from Article column.
df["Title"] = df["Title"].replace(df["Article"].tolist(), "", regex=True)
print(df)
Article Title
0 AA 0123
1 BBB 234
2 CCCC 345
you can use replace() with a lambda function.
dfx = df[['Article','Title']].apply(lambda x : x['Title'].replace((x['Article']), ''), axis=1)
I have a CSV file like this (comma separated)
ID, Name,Context, Location
123,"John","{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 1"
234,"Mike","{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 2"
I want to create DataFrame like this:
ID | Name |Context |Location
123| John |{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 1
234| Mike |{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 2
Could you help and show me how to use pandas read_csv doing it?
An answer - if you are willing to accept that the \ char gets stripped:
pd.read_csv(your_filepath, escapechar='\\')
ID Name Context Location
0 123 John {"Organization":{"Id":12345,"IsDefault":false}... Road 1
1 234 Mike {"Organization":{"Id":23456,"IsDefault":false}... Road 2
An answer if you actually want the backslashes in - using a custom converter:
def backslash_it(x):
return x.replace('"','\\"')
pd.read_csv(your_filepath, escapechar='\\', converters={'Context': backslash_it})
ID Name Context Location
0 123 John {\"Organization\":{\"Id\":12345,\"IsDefault\":... Road 1
1 234 Mike {\"Organization\":{\"Id\":23456,\"IsDefault\":... Road 2
escapechar on read_csv is used to actually read the csv then the custom converter puts the backslashes back in.
Note that I tweaked the header row to make the column name match easier:
ID,Name,Context,Location
How to split the full name into different columns in pyspark.
input CSV:
Name,Marks
Sam Kumar Timberlake,83
Theo Kumar Biber,82
Tom Kumar Perry,86
Xavier Kumar Cruse,87
Output Csv should be :
FirstName,MiddleName,LastName,Marks
Sam,Kumar,Timberlake,83
Theo,Kumar,Biber,82
Tom,Kumar,Perry,86
Xavier,Kumar,Cruse,87
I am sure there is a better way, but the longer way is re-instate. Meaning, do the work. I created double of the names and just manually cleaned the data into first middle names and last names. I don't think there is any machine language that can tell you the person has two first names and one middle name unless the person used a dash for two first names and two last names (born and married into last names) and use common sense for last names and be ready for mistakes. Gotta do it manually, unless, again.. you are certain because you called them up and know for sure.
The mathematical way would be separating last name from the rest. It is like calling someone by their first name John when they go by their middle name Gary. Mistakes are inevitable as long the person you address understands it is legally them. Not sure if it all makes sense.
This should work in your specific case:
import pyspark.sql.functions as F
df = df.withColumn(
"arr", F.split(F.col("Name"), " ")
)
df = (
df
.withColumn('FirstName', F.arr.getItem(0))
.withColumn('MiddleName', F.arr.getItem(1))
.withColumn('LastName', F.arr.getItem(2))
)
If you want to include the case when someone has several middle names:
df = (
df
.withColumn('FirstName', df.arr.getItem(0))
.withColumn('LastName', df.arr[F.size(df.arr)-1])
)
df = df.withColumn(
'MiddleName',
F.trim(F.expr("substring(Name, length(FirstName)+1, length(Name)-length(LastName)-length(FirstName))"))
)
I have a column where the names are separated by Single space, double space(there can be more) and I want to split the names by Fist Name and Last Name
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],{'Age':[32,34,36]})
df['Name'] = df['Name'].str.strip()
df[['First_Name', 'Last_Name']] = df['Name'].str.split(" ",expand = True,)
this should do it
df[['First_Name', 'Last_Name']] = df.Name.apply(lambda x: pd.Series(list((filter(None, x.split(' '))))))
Use \s+ as your split pattern. This is the regex pattern meaning "one or more whitespace characters".
Also, limit number of splits with n=1. This means the string will only be split once (The first occurance of whitespace from left to right) - restricting the output to 2 columns.
df[['First_Name', 'Last_Name']] = df.Name.str.split('\s+', expand=True, n=1)
[out]
Name Age First_Name Last_Name
0 Steve Smith 32 Steve Smith
1 Joe Nadal 34 Joe Nadal
2 Roger Federer 36 Roger Federer
I have a csv file that looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
"1,0,3,""Braund, Mr. Owen Harris"",male,22,1,0,A/5 21171,7.25,,S"
Can I use pandas to read the csv such that it gets read in the obvious way?
In other words, I want the csv file to be read as if it looked like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
Any suggestions?
pd.read_csv(data)
is the answer to your problem.
Here is the code I used for this Kaggle dataset:
training_set = pd.read_csv('train.csv')
Output (Just first row)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S