pandas contains regex - pandas

I would like to match all cells that beginns with 978 number. But following code matches 397854 or nan too.
an_transaction_product["kniha"] = np.where(an_transaction_product["zbozi_ean"].str.contains('^978', regex=True) , 1, 0)
What do I do wrong please?

This doesn't work because .str.contains will check if the regex occurs anywhere in the string.
If you insist on using regex, .str.match does what you want.
But for this simple case .str.startswith("978") is clearer.

Apart from regex, you can use .loc to find cells that start with '978'. The code below will assign 1 to such cells in column 'A', just as an example:
df.loc[df['A'].astype(str).str[:3]=='978', 'A'] = 1
note: astype(str) converts the number to string and then str[:3] gets the first 3 characters, and then compares it to '978'.

Related

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

Data cleaning: regex to replace numbers

I have this dataframe:
p=pd.DataFrame({'text':[2,'string']})
and trying to replace digit 2 by an 'a' using this code:
p['text']=p['text'].str.replace('\d+', 'a')
But instead of letter a and get NaN?
What am I doing wrong here?
In your dataframe, the first value of the text column is actually a number, not a string, thus the NaN error when you try to call .str. Just convert it to a string first:
p['text'] = p['text'].astype(str).str.replace('\d+', 'a')
Output:
>>> p
text
0 a
1 string
(Note that .str.replace is soon going to change the default value of regex from True to False, so you won't be able to use regular expressions without passing regex=True, e.g. .str.replace('\d+', 'a', regex=True))

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Replacing substrings based on lists

I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!
df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma
If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption: