EOL while scanning SyntaxError for document word count code - syntax-error

I've been handed some code by my lecturer which is to work out the word count for my document (markdown only), when I used it, it worked out the digit count not the word count. I believe the problem is located in the penultimate line .split(), the code initially had '' in the brackets, which I removed to make use of the default (split by whitespaces) but I get an error.
Any help greatly appreciated, novice coder problems.
import io
from nbformat import read
filepath='DSRM Report 2.ipynb'
with io.open(filepath, 'r', encoding='utf-8') as f:
nb=read(f, 4)
word_count = 0
for cell in nb\['cells'\]:
if cell.cell_type == 'markdown':
word_count += len(cell\['source'\].replace('#',').lstrip().split())
print("Submission length is {}".format(word_count))

Related

Cannot move a text object (variable) outside a function

I am trying to first convert pdf credit card statements to text then use regex to extract dates, amounts, and vendor from the individual lines. I can extract all the lines of text as they appear on the statement but when I call the variable with the text file, it only returns the last line.
I set the directory and read-in the pdf credit card statement as "dfpdf"
I run this code ....
with plumb.open(dfpdf) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
global line
for line in text.split('\n'):
print(line)
this returns all the lines in the statement which is what I want. But if I later call or try to print "line" all I get is the last line of the statement. In addition to what is probably a really simple answer, I would also love a suggestion for a really good tutorial or class on using python to convert pdfs then using regex to create pd data frames. Thanks to all of you out there who know what you're doing and take the time to help amatuers like me. Mark

I need to replace non-ASCII characters in pandas data frame column in python 2.7

This question was asked many times, but non of the solutions worked for me.
The data frame was pulled from a third party excel file with 'UTF-8' encoding:
pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)
But I still have characters like " ’ " instead of " ' " in some lines.
On the top of the code I have the following
# -*- encoding: utf-8 -*-
The following line does not throw errors, but do not change anything in the data:
df['text'] = df['text'].str.replace("’","'")
I tried with dictionary (which has the same core), like
repl_dict = {"’": "'"}
for k,v in repl_dict.items():
df.loc[df.text.str.contains(k), 'text'] =
df.text.str.replace(pat=k,repl=v)
and tried many other approaches including regex, but nothing worked.
When I tried:
def replace_apostrophy(text):
return text.replace("’","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))
I received the following error -
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
When I tried:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))
I got the following error -
TypeError: normalize() argument 2 must be unicode, not float
The text has also emojis that afterwords I need to count somehow.
Can someone give me a good advice?
Thank you very much!
I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

how to find a word in ASCII file using python

I want to find a word and its index but the problem is I am only getting its first position while the word appear more than one time in file. The file's content is,
[MAKE DATA:STUDENT1=AENIE:AGE14,STUDENT2=JOHN:AGE15,STUDENT3=KELLY:AGE14,STUDENT4=JACK:AGE16,STUDENT5=SNOW:AGE16;SET RECORD:STUDENT1=GOOD,STUDENT2=,STUDENT3=BAD,STTUDENT4=,STUDENT5=GOOD]
following is my code,
import sys,os,csv
x = str(raw_input("Enter file name :")) + '.ASCII'
fp = open(x,'r')
data = fp.read()
fp.close()
found = data.find("STUDENT1")
print found
here the word "STUDENT1" appear two time while my code gives its only 1st index position. I want its second index position too. Similarly a word may appear several times in file so how can I find its all index position?
Use the optional start parameter to str.find() to search the string again starting after the previous match:
found = data.find("STUDENT1")
while found != -1:
print found
found = data.find("STUDENT1", found+1)
It would be slightly more efficient (but less concise) to use found+len("STUDENT1") instead of found+1.
Alternatively you could use the re.finditer():
import re
for match in re.finditer("STUDENT1", data):
print match.start()