pandas string manipulation with regular expressions - pandas

I have pandas dataframe containing column with string
I want to get read of empty space in the beggining; 1.; 1st and 2nd numbers in a string; also / in the middle of the word, \ between words
I could not remove 2nd digit though
How to do it in one go and also remove second digit in a string
i could do it one by one (not sure if it is correct but working)
st={'string':['155555 11111 hhhh 15-0850tcx cord\with plastic end /
light mustard -82cm шнур нужд вес 07 кг',' 1. 06900000027899 non woven
12 grid socks']}
s = pd.DataFrame(st)
s['string'] = s['string'].str.replace(r'\d\.', '') #removes 1.
s['string'] = s['string'].str.replace(r"\\", " ") #removes backslash
s['string'] = s['string'].str.replace(r"\/", "") #removes backslash
s['string'] = s['string'].str.replace(r"^\d*", "") #removes digit in the begginning of string
s['string'] = s['string'].str.strip() #removes space in front

An equivalent of your commands could be:
s['string'] = s['string'].str.replace(r'(^\s*\d+\.?\s*|\\|/)', '', regex=True)
Output:
string
0 11111 hhhh 15-0850tcx cordwith plastic end light mustard -82cm шнур нужд вес 07 кг
1 06900000027899 non woven 12 grid socks
regex demo

Related

pandas extract latin words from multi language string to a separate column

I would like extract and store all latin words from multilingual string to separate column.
Desired output
'hhhh tcx cord\with plastic end / light mustard cm non woven grid socks'
I tried to use basic expression but it did not work
st={'string':['hhhh 15-0850tcx cord\with plastic end / light mustard -82cm шнур нужд вес 07 кг','1. 06900000027899 non woven 12 grid socks']}
s = pd.DataFrame(st)
re.findall("[^a-zA-Z]", s)
TypeError: expected string or bytes-like object
Use Series.str.findall:
df = pd.DataFrame(st)
df['new'] = df['string'].str.findall(r"[a-zA-Z]+")
print (df)
string \
0 hhhh 15-0850tcx cord\with plastic end / light ...
1 1. 06900000027899 non woven 12 grid socks
new
0 [hhhh, tcx, cord, with, plastic, end, light, m...
1 [non, woven, grid, socks]

regex: match everything, but not a certain string including white sapce (regular expression, inspite of, anything but, VBA visual basic)

Folks, there are already billions of questions on "regex: match everything, but not ...", but non seems to fit my simple question.
A simple string: "1 Rome, 2 London, 3 Wembley Stadium" and I want to match just "1 Rome, 2 London, 3 Wembley Stadium", in order to extract only the names but not the ranks ("Rome, London, Wembley Stadium").
Using a regex tester (https://extendsclass.com/regex-tester.html), I can simply match the opposite by:
([0-9]+\s*) and it gives me:
"1 Rome, 2 London, 3 Wembley Stadium".
But how to reverse it? I tried something like:
[^0-9 |;]+[^0-9 |;], but it also excludes white spaces that I want to maintain (e.g. after the comma and in between Wembley and Stadium, "1 Rome, 2 London, 3 Wembley Stadium"). I guess the "0-9 " needs be determined somehow as one continuous string. I tried various brackets, quotation marks, \s*, but nothing jet.
Note: I'm working in a visual basic environment and not allowing lookbehinds!
You can use
\d+\s*(.*?)(?=,\s*\d+\s|$)
See the regex demo, get the values from match.Submatches(0). Details:
\d+ - one or more digits
\s* - zero or more whitespaces
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
(?=,\s*\d+\s|$) - a positive lookahead that requires ,, zero or more whitespaces, one or more digits and then a whitespace OR end of string immediately to the right of the current location.
Here is a demo of how to get all matches:
Sub TestRegEx()
Dim matches As Object, match As Object
Dim str As String
str = "1 Rome, 2 London, 3 Wembley Stadium"
Set regex = New regExp
regex.Pattern = "\d+\s*(.*?)(?=,\s*\d+\s|$)"
regex.Global = True
Set matches = regex.Execute(str)
For Each match In matches
Debug.Print match.subMatches(0)
Next
End Sub
Output:

Pandas, adding prefix to values if the original value is less than 3 characters

I have a pandas dataframe.
In column cxt all string values should be 4 characters.
Some of theme are 3 or less characters.
In those occasions a number of '0's should be put at the front of the string value.
Currently I'm using code like: gdf.loc[gdf.cxt == '0', 'cxt'] = '0000' to catch certain strings.
Also as a cheeky follow on, gdf.loc[gdf.cxt.string.contains '.', 'cxt'] = '0000' overwrites all my values, rather than replacing any string with a '.' character.
Do you mean zfill:
df = pd.DataFrame({'ctx':['0','.x','001', 'abcd']})
df['ctx'].str.zfill(4)
Output:
0 0000
1 00.x
2 0001
3 abcd
Name: ctx, dtype: object
Try gdf.loc[gdf.cxt.str.contains('\.'), 'cxt'] = '0000'

How to replace space with a digit in middle of the mobile number

I have a data frame with column name as "Mobile No".Few of the entries are having space in the 6th position which ends up 9 digit number.
Would like to replace space in 6th position with a digit(8) to make 10 digit number. please suggest.
Before applying below code have ensured, no 'NaN's in df["Mobile No"] column and code was run. After running its df["Mobile No"] filled with 'NaN's.
Looks like something isn't working.
df["Mobile No"] = df["Mobile No"].str.replace(' ', ' ')
Sample number with space in '88888 8888'
If need replace all emprt strings use \s:
#replace each space by value
df["Mobile No1"] = df["Mobile No"].str.replace(r'\s', '8')
#repalce consecutive spaces by one value
df["Mobile No2"] = df["Mobile No"].str.replace(r'\s+', '8')
If need replace empty space in some position:
def replace_by_index(text, pos, replacement):
return ''.join(text[:pos-1] + replacement + text[pos:]) if text[pos-1] == ' ' else text
df['Mobile No3'] = df['Mobile No'].apply(lambda x: replace_by_index(x, 6, '8'))
print (df)
Mobile No Mobile No1 Mobile No2 Mobile No3
0 08881 2889 0888182889 0888182889 0888182889
1 0881 28889 0881828889 0881828889 0881 28889
2 0888881 29 08888818829 0888881829 0888881 29

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here: