Python function print output redirect - variables

Its my first attempt at coding and I've a bit of a snag.
I was wondering if one of you would be able to point me in the right direction for this case..
What I'm trying to do is convert a string from hexadecimal format into readable text.
Following this, I want to run the converted string/text against some regex commands in order to pick out things like email addresses and domains.
I know the codes work individually when testing - but the problem is that when trying to run them together, I am unable to properly assign the print output of the converted string into the variable for me to run the regex commands against.
Any tips or suggestions on how I can get around to properly assigning the converted string into a variable and work on it?
This is part of the code as the rest are options available to convert other format types into readable text:
#!/usr/bin/env python3
import base64
import codecs
import re
import sys
print(" If the data is Hexadecimal and looks similar to this: 48656C6C6F20686F77206172, enter: 2 ")
print("")
decision = int(input("enter the number here: "))
print("")
message = input("Enter the data you wish to have decoded: ")
def decode_hex1(encoded_text):
information = ''
for i in range(len(encoded_text)//2):
information = information +
print(codecs.decode(encoded_text[i*2:i*2+2]0, "hex").decode('utf-8'), end="")
return information
if decision == 2:
output = decode_hex1(message)
match_emails = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', output)
print("The following emails may be of interest to you: ", match_emails)
print("")
domain_regex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
match_domains = re.findall(domain_regex, output)
print("The following domains may be of interest to you: ", (match_domains))
else:
print("Invalid Choice - please check the number and try again")
print("")
print("Done")

Related

Recognize newline (\n) in text as end of sentence in Spacy

I'd like to recognize a newline in text as the end of a sentence. I've tried inputting it into the nlp object like this:
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
nlp = spacy.load("en_core_web_lg")
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print('next sentence:')
print(sent)
The output of this is:
next sentence:
Guest Blogging
Guest Blogging allows the user to collect backlinks
I don't understand why Spacy isn't recognizing the newline as a sentence end. My desired output is:
next sentence:
Guest Blogging:
next sentence:
Guest Blogging allows the user to collect backlinks
Does anyone know how to achieve this?
The reason the sentencizer isn't doing anything here is that the parser has run first and already set all the sentence boundaries, and then the sentencizer doesn't modify any existing sentence boundaries.
The sentencizer with \n is only the right option if you know you have exactly one sentence per line in your input text. Otherwise a custom component that adds sentence starts after newlines (but doesn't set all sentence boundaries) is probably what you want.
If you want to set some custom sentence boundaries before running the parser, you need be sure you add your custom component before the parser in the pipeline:
nlp.add_pipe("my_component", before="parser")
Your custom component would set token.is_start_start = True for the tokens right after newlines and leave all other tokens unmodified.
Check out the second example here: https://spacy.io/usage/processing-pipelines#custom-components-simple
you can do this by using
nlp = spacy.load('en_core_web_sm', exclude=["parser"])
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print("next sentence")
print(sent)
Output:
next sentence
Guest Blogging
next sentence
Guest Blogging allows the user to collect backlinks
You could also break sentences by \n before feeding spacy.
from spacy.lang.en import English
def get_sentences(_str):
chunks = _str.split('\n')
sentences = []
nlp = English()
nlp.add_pipe("sentencizer")
for chunk in chunks:
doc = nlp(chunk)
sentences += [sent.text.strip() for sent in doc.sents]
return sentences

I need to replace non-ASCII characters in pandas data frame column in python 2.7

This question was asked many times, but non of the solutions worked for me.
The data frame was pulled from a third party excel file with 'UTF-8' encoding:
pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)
But I still have characters like " ’ " instead of " ' " in some lines.
On the top of the code I have the following
# -*- encoding: utf-8 -*-
The following line does not throw errors, but do not change anything in the data:
df['text'] = df['text'].str.replace("’","'")
I tried with dictionary (which has the same core), like
repl_dict = {"’": "'"}
for k,v in repl_dict.items():
df.loc[df.text.str.contains(k), 'text'] =
df.text.str.replace(pat=k,repl=v)
and tried many other approaches including regex, but nothing worked.
When I tried:
def replace_apostrophy(text):
return text.replace("’","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))
I received the following error -
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
When I tried:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))
I got the following error -
TypeError: normalize() argument 2 must be unicode, not float
The text has also emojis that afterwords I need to count somehow.
Can someone give me a good advice?
Thank you very much!
I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.

How to scrape NON-ASCII characters alone

I am using BeautifulSoup to scrape data. The text I want to scrape is "€ 48,50", which contains an ascii character. However, I would like to replace the euro sign with nothing so that the final output is "48,50". I have been getting errors because the console cannot print it. I am using python 2.7 on Windows for this. I will appreciate a solution.
I was basically getting errors and do not know how to go about this. Or is there a way I can just extract non-ascii characters alone?
w= item.find_all("div",{"class":"product-price"}).find("strong",
{"class":"product-price__money"}).text.replace("\\u20ac"," ")
print w
You need to decode the string and pass the replace function a unicode string.
text = "€ 48,50"
w = text.decode("utf-8").replace(u"\u20ac"," ")
print w
See How to replace unicode characters in string with something else python? for more details.

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

how to find a word in ASCII file using python

I want to find a word and its index but the problem is I am only getting its first position while the word appear more than one time in file. The file's content is,
[MAKE DATA:STUDENT1=AENIE:AGE14,STUDENT2=JOHN:AGE15,STUDENT3=KELLY:AGE14,STUDENT4=JACK:AGE16,STUDENT5=SNOW:AGE16;SET RECORD:STUDENT1=GOOD,STUDENT2=,STUDENT3=BAD,STTUDENT4=,STUDENT5=GOOD]
following is my code,
import sys,os,csv
x = str(raw_input("Enter file name :")) + '.ASCII'
fp = open(x,'r')
data = fp.read()
fp.close()
found = data.find("STUDENT1")
print found
here the word "STUDENT1" appear two time while my code gives its only 1st index position. I want its second index position too. Similarly a word may appear several times in file so how can I find its all index position?
Use the optional start parameter to str.find() to search the string again starting after the previous match:
found = data.find("STUDENT1")
while found != -1:
print found
found = data.find("STUDENT1", found+1)
It would be slightly more efficient (but less concise) to use found+len("STUDENT1") instead of found+1.
Alternatively you could use the re.finditer():
import re
for match in re.finditer("STUDENT1", data):
print match.start()