Replacing part of string in a column with exceptions - pandas

I am doing a data cleaning in Jupyter Notebook with Pandas and I am trying to get just first part of the string. But there is a catch. I can easily delete rest of the expression but some fields are actually valid. So for example in the column:
SHIPMENT_PROVIDER
Usps
Usps International
Uspsxy3
Usps7kju
Usps0by
So I want to keep Usps and Usps international. So far I used following code for simpler challenges:
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(to_replace='(?:Usps)([a-zA-Z0-9]+)$', value = 'Usps', regex = True)
But this won't work with two alternative expressions. So the idea that Usps with some random characters e.g.(Uspsxyz) should be replaced by Usps, but Usps International with some random characters e.g. (Usps Internationalxyz) should be replaced by Usps International.

Others have posted regex solutions. How about a non-regex solution:
s = orders["SHIPMENT_PROVIDER"]
orders["SHIPMENT_PROVIDER"] = np.select(
[s.str.startswith("Usps International"), s.str.startswith("Usps")],
["Usps International", "Usps"],
s,
)

As a pattern, you could use a capture group for the first part instead with an optional part for International
^(Usps)(?: International)?[a-zA-Z0-9]+$
In the replacement use group 1.
Regex demo
import pandas as pd
pattern = r"^(Usps)(?: International)?[a-zA-Z0-9]+$"
items = [
"Usps",
"Usps International",
"Uspsxy3",
"Usps7kju",
"Usps0by",
"Usps Internationalxyz"
]
orders = pd.DataFrame(items, columns=["SHIPMENT_PROVIDER"])
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(r"^(Usps)(?: International)?[a-zA-Z0-9]+$", r"\1", regex=True)
print(orders)
Output
SHIPMENT_PROVIDER
0 Usps
1 Usps International
2 Usps
3 Usps
4 Usps
5 Usps

Related

get prefix out a size range with different size formats

I have column in a df with a size range with different sizeformats.
artikelkleurnummer size
6725 0161810ZWA B080
6726 0161810ZWA B085
6727 0161810ZWA B090
6728 0161810ZWA B095
6729 0161810ZWA B100
in the sizerange are also these other size formats like XS - XXL, 36-50 , 36/38 - 52/54, ONE, XS/S - XL/XXL, 363-545
I have tried to get the prefix '0' out of all sizes with start with a letter in range (A:K). For exemple: Want to change B080 into B80. B100 stays B100.
steps:
1 look for items in column ['size'] with first letter of string in range (A:K),
2 if True change second position in string into ''
for range I use:
from string import ascii_letters
def range_alpha(start_letter, end_letter):
return ascii_letters[ascii_letters.index(start_letter):ascii_letters.index(end_letter) + 1]
then I've tried a for loop
for items in df['size']:
if df.loc[df['size'].str[0] in range_alpha('A','K'):
df.loc[df['size'].str[1] == ''
message
SyntaxError: unexpected EOF while parsing
what's wrong?
You can do it with regex and the pd.Series.str.replace -
df = pd.DataFrame([['0161810ZWA']*5, ['B080', 'B085', 'B090', 'B095', 'B100']]).T
df.columns = "artikelkleurnummer size".split()
replacement = lambda mpat: ''.join(g for g in mpat.groups() if mpat.groups().index(g) != 1)
df['size_cleaned'] = df['size'].str.replace(r'([a-kA-K])(0*)(\d+)', replacement)
Output
artikelkleurnummer size size_cleaned
0 0161810ZWA B080 B80
1 0161810ZWA B085 B85
2 0161810ZWA B090 B90
3 0161810ZWA B095 B95
4 0161810ZWA B100 B100
TL;DR
Find a pattern "LetterZeroDigits" and change it to "LetterDigits" using a regular expression.
Slightly longer explanation
Regexes are very handy but also hard. In the solution above, we are trying to find the pattern of interest and then replace it. In our case, the pattern of interest is made of 3 parts -
A letter in from A-K
Zero or more 0's
Some more digits
In regex terms - this can be written as r'([a-kA-K])(0*)(\d+)'. Note that the 3 brackets make up the 3 parts - they are called groups. It might make a little or no sense depending on how exposed you have been to regexes in the past - but you can get it from any introduction to regexes online.
Once we have the parts, what we want to do is retain everything else except part-2, which is the 0s.
The pd.Series.str.replace documentation has the details on the replacement portion. In essence replacement is a function that takes all the matching groups as the input and produces an output.
In the first part - where we identified three groups or parts. These groups are accessed with the mpat.groups() function - which returns a tuple containing the match for each group. We want to reconstruct a string with the middle part excluded, which is what the replacement function does
sizes = [{"size": "B080"},{"size": "B085"},{"size": "B090"},{"size": "B095"},{"size": "B100"}]
def range_char(start, stop):
return (chr(n) for n in range(ord(start), ord(stop) + 1))
for s in sizes:
if s['size'][0].upper() in range_char("A", "K"):
s['size'] = s['size'][0]+s['size'][1:].lstrip('0')
print(sizes)
Using a List/Dict here for example.

Web scraping - get tag through text in "brother" tag - beautiful soup

I'm trying to get the text inside a table in wikipedia, but I will do it for many cases (books in this case). I want to get the book genres.
Html code for the page
I need to extract the td containing the genre, when the text in Genre.
I did this:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
In this particular case, you don't need to bother with all that. Just try:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
Output:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
From here you can use standard pandas methods to extract whatever you need.

spaCy nlp - positions of entities in string, extracting nearby words

Lets say I have a string and want to mark some entities such as Organizations.
string = I was working as a marketing executive for Bank of India, a 4 months..
string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..
I want to identify the words beside the entity tagged.
How can I locate the positions of the entity tagged and extract the words beside the entity?
My code:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
if ent.label_ == 'ORG':
company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1]
print company
As I understood from your question you want words beside the ORG tagged token:
import spacy
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
if doc[i].ent.label_ == 'ORG':
company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one
print company
be aware of the first and last token checking.
Following code works for me:
doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
if ent.label_ == 'ORG' and ent.text not in company:
company.append(ent.text)
print(company)
The 2nd condition in if is to extract only unique company names in my block of text. If you remove that you'll get all instances of 'ORG' added to your company list. Hope this'll work for you as well

need to extract all the content between two string in pandas dataframe

I have data in pandas dataframe. i need to extract all the content between the string which starts with "Impact Factor:" and ends with "&#". If the content doesn't have "Impact Factor:" i want null in that row of the dataframe
this is sample data from a single row.
Save to EndNote online &# Add to Marked List &# Impact Factor: Journal 2 and Citation Reports 500 &# Other Information &# IDS Number: EW5UR &#
I want the content like the below in a dataframe .
Journal 2 and Citation Reports 500
Journal 6 and Citation Reports 120
Journal 50 and Citation Reports 360
Journal 30 and Citation Reports 120
Hi you can just use a regular expression here:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:(.*?)&#',x))
You may want to strip white spaces too in which case you could use:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:\s*(.*?)\s*&#',x))

Matching a string which includes -,.$\/ with a regex

I am trying to match a string which includes -,.$/ ( and might include other special characters which I don't know yet( with a regex . I have to match first 28 characters in the string
The String is -->
Received - Data Migration 1. Units, of UNITED STATES $ CXXX CORPORATION COMMON SHARE STOCK CERTIFICATE NO. 323248 987,837 SHARES PAR VAL $1.00 NOT ADMINISTERED XX XX, XXXSFHIGSKF/XXXX PURPOSES ONLY
The regex I am using is ((([\w-,.$\/]+)\s){28}).*
Is there a better way to match special characters ?
Also I get an error if the string length is less than 28. What can I do to include the range so that the regex works even if the string is less than 28 characters
the code looks something like this
Select regexp_extract(Txn_Desc,'((([\w-,.$;!#\/%)^#<>&*(]+)\s){1,28}).*',1) as Transaction_Short_Desc,Txn_Desc
from Table x
It seems you are looking for 28 tokens.
Try
(\S+\s+){0,28}
or
([^ ]+ +){0,28}
This is the result for 8 tokens:
Received - Data Migration 1. Units, of UNITED
| | | | | | | |
1 2 3 4 5 6 7 8