Data cleaning: regex to replace numbers - pandas

I have this dataframe:
p=pd.DataFrame({'text':[2,'string']})
and trying to replace digit 2 by an 'a' using this code:
p['text']=p['text'].str.replace('\d+', 'a')
But instead of letter a and get NaN?
What am I doing wrong here?

In your dataframe, the first value of the text column is actually a number, not a string, thus the NaN error when you try to call .str. Just convert it to a string first:
p['text'] = p['text'].astype(str).str.replace('\d+', 'a')
Output:
>>> p
text
0 a
1 string
(Note that .str.replace is soon going to change the default value of regex from True to False, so you won't be able to use regular expressions without passing regex=True, e.g. .str.replace('\d+', 'a', regex=True))

Related

Converting zip+4 to zip python

I am looking to convert zip+4 codes into zip codes in a pandas dataframe. I want it to identify that a zip 4 code exists and keep just the first 5 digits. I effectively want to do the below code (although this doesn't work in this format):
df.replace('^(\d{5}-?\d{4})', group(1), regex=True)
The following code does the same procedure for a list, I'm looking to do the same thing in the dataframe.
my_input = ['01234-5678', '012345678', '01234', 'A1A 1A1', 'A1A1A1']
expression = re.compile(r'^(\d{5})-?(\d{4})?$')
my_output = []
for string in my_input:
if m := re.match(expression, string):
my_output.append(re.match(expression, string).group(1))
else:
my_output.append(string)
You can use
df = df.replace(r'^(\d{5})-?\d{4}$', r'\1', regex=True)
See the regex demo.
Details:
^ - start of string
(\d{5}) - Group 1 (\1): five digits
-? - an optional -
\d{4} - any four digits
$ - end of string.

pandas contains regex

I would like to match all cells that beginns with 978 number. But following code matches 397854 or nan too.
an_transaction_product["kniha"] = np.where(an_transaction_product["zbozi_ean"].str.contains('^978', regex=True) , 1, 0)
What do I do wrong please?
This doesn't work because .str.contains will check if the regex occurs anywhere in the string.
If you insist on using regex, .str.match does what you want.
But for this simple case .str.startswith("978") is clearer.
Apart from regex, you can use .loc to find cells that start with '978'. The code below will assign 1 to such cells in column 'A', just as an example:
df.loc[df['A'].astype(str).str[:3]=='978', 'A'] = 1
note: astype(str) converts the number to string and then str[:3] gets the first 3 characters, and then compares it to '978'.

Replacing substrings based on lists

I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!
df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma
If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442

Octave keyboard input function to filter concatenated string and integer?

if we write 12wkd3, how to choose/filter 123 as integer in octave?
example in octave:
A = input("A?\n")
A?
12wkd3
A = 123
while 12wkd3 is user keyboard input and A = 123 is the expected answer.
assuming that the general form you're looking for is taking an arbitrary string from the user input, removing anything non-numeric, and storing the result it as an integer:
A = input("A? /n",'s');
A = int32(str2num(A(isdigit(A))));
example output:
A?
324bhtk.p89u34
A = 3248934
to clarify what's written above:
in the input statement, the 's' argument causes the answer to get stored as a string, otherwise it's evaluated by Octave first. most inputs would produce errors, others may be interpreted as functions or variables.
isdigit(A) produces a logical array of values for A with a 1 for any character that is a 0-9 number, and 0 otherwise.
isdigit('a1 3 b.') = [0 1 0 1 0 0 0]
A(isdigit(A)) will produce a substring from A using only those values corresponding to a 1 in the logical array above.
A(isdigit(A)) = 13
that still returns a string, so you need to convert it into a number using str2num(). that, however, outputs a double precision number. so finally to get it to be an integer you can use int32()

Substring function does not work in Vb?

I am trying to mask SSn and want show it on label caption.
lblSPTINTo.Caption = rsMM("SPTIN")
lblCPTINTo.Caption = rsMM("CPTIN")
i am trying to use substring function to get last 4 characters but i not am to able to use it as it throws compile error .
lblSPTINTo.Caption = rsMM("SPTIN").sutbstring(4,4)
Replace sutbstring with Substring.
But it won't work that way because the first parameter is the index and the second parameter in Substring is the length, if you want the last 4 characters:
Dim last4 As String = rsMM("SPTIN")
If last4.Length > 4 Then last4 = last4.Substring(last4.Length - 4)