BeautifulSoup > prettify() method displays entire output in one line only - beautifulsoup

This is my first post on stackoverflow.com and I have a question w.r.t the output displayed in PyDev-for-Eclipse console for my Python3 program
I am using:
Python 3.4 ---
PyDev-for-Eclipse ---
Python modules: requests, bs4, pprint
Whenever, I run the code below,
html_content = response.content
bs = BS(html_content,'html.parser')
page_html = bs.prettify(encoding='utf-8')
print(page_html)
The entire output is displayed in one line alone as shown below, instead of being displayed in a pretty print format
b'<!DOCTYPE doctype html>\n<html class="no-js" lang="en-US">\n <head>\n <meta charset="utf-8"/> ...<entire output>...
I also tried with pprint() method from pprint module. However, I get the same result i.e. entire output displayed in one line alone
How do I get the o/p to be displayed in a preety print format?
Thanks,
skambl

When you specify the encoding argument, you are asking it to encode the output. This gives you a bytes object, which is recognizable by the leading b before the string that got printed. b'some value' printed to the console means that you printed a bytes object (in python 3).
Option 1
print(page_html.decode('utf-8'))
Since you asked for it to be encoded as utf-8, that is what you should decode it as.
Option 2
Seems like you actually wanted a string (not a bytes object). So just do
page_html = bs.prettify() # no encoding parameter
Additionally, you may want to read the section on Output formatters for more things you can do with the output.
I know you asked this a long time ago, but hopefully the answer is still helpful (particularly, knowing that the leading b'...' is a bytes object, and you need to decode that)! I was searching for something related to bs4 and stumbled upon this, thought I would explain why you saw this behavior :)

Related

How can I resolve a Unicode error from read_csv?

This is my first time working on a python project outside of school, so bear with me.
When I run the code below, I get the error
"(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated\uXXXXXXXX escape"
and the IDLE editor highlights the '(' before the argument of pd.read_csv.
I googled the error but got a lot of stuff that went way over my head.
The csv file in question is an excel file i saved as csv. should i save it some other way?
import pandas as pd
field = pd.read_csv("C:\Users\Glen\Documents\Feild.csv")
I just want to convert my excel data into a data frame and I don't understand why it was so easy in class, and it's now so difficult on my home pc.
The problem is with the path. There are two ways to mention the path while reading a csv file,
1- Use double backslashes,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv")
2- Use single forwardslash,
pd.read_csv("C:/Users/Glen/Documents/Feild.csv")
If these do not work, try this one,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv", encoding='utf-8')
OR
pd.read_csv("C:/Users/Glen/Documents/Feild.csv", encoding='utf-8')

'UnicodeEncodeError' when using 'set_dataframe'

When using set_dataframe to update my Google Sheets via pygsheets and pandas, I get error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 13: ordinal not in range(128)
This is due to utf-8 marks over some text, e.g.,: "señor"
This happens on executing:
wks.set_dataframe(df, start='A10')
Pandas to_csv accepts an encoding parameter similar to encoding="utf-8", may I suggests set_dataframe does the same?
wks.set_dataframe(df, start='A10', encoding="utf-8")
I see there's a ticket opened 10 days ago here but is there a workaround?
Solution:
I ran into the same issue, and I think that, more than a bug in the pygsheets module, it would be a limitation as you clearly point out.
What I did to solve this issue was:
def encodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == unicode:
try:
encodedSeries = series.str.encode(encoding)
df[column] = encodedSeries
except Exception as e:
print 'Could not encode column %s' % column
raise
And you can call the function this way:
encodeDataFrame(df)
wks.set_dataframe(df, start='A10')
This might no longer be a good solution because of a change that was made in pygsheets to avoid this issue. See EDIT section below
Explanation:
You solve the issue by encoding the unicode values yourself, before sending them to the set_dataframe function.
This problem comes up whenever you try to use the Worksheet.set_dataframe function, using a dataframe that contains unicode characters that cannot be encoded in ascii (like accents, and many other).
The exception is thrown because the set_dataframe function attempts to cast the unicode values into str values (using the default encoding). For Python 2, the default encoding is ascii and when a character out of the range of ascii is found, the exception is thrown.
Some people have suggested reloading the sys module to circumvent this problem, but here is explained why you should not do it
The other solution I would think of would be to use the pygsheets module in Python 3, where this should no longer be a problem because the default encoding for Python 3 is UTF-8 (see docs)
Bonus:
Ask yourself:
1) Is Unicode an encoding?
2) What is an encoding?
If you hesitated with any of those questions, you should read this article, which gave me the knowledge needed to think of this solution. The time invested was completely worth it.
For more information, you can try this article which links to the previous one at the end.
EDIT:
A change was made 2 days ago (07/26/19) to pygsheets that is supposed to fix this. It looks like the intention is to avoid encoding into the str type, but I would think that this change might try decoding strings into the unicode type from the default ascii encoding, which could also lead to trouble. When/If this change is released it is probably going to be better not to encode anything and pass values as unicode to the set_dataframe function.
EDIT 2:
This change has now been released in version 2.0.2. If my predictions were true, using the encodeDataFrame function I suggested will result in a UnicodeDecodeError because the Worksheet.set_dataframe function will attempt to decode the str values with the default ascii encoding. So the best way to use the function will be not to have any str values in your dataframe. If you have them, decode them to unicode before calling the set_dataframe function. You could have a mirror version of the
function I suggested. It would look something like this:
def decodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == str:
try:
decodedSeries = series.str.decode(encoding)
df[column] = decodedSeries
except Exception as e:
print 'Could not decode column %s' % column
raise

Ipython kernel dies unexpectedly reading large file

I'm reading in a ~3Gb csv using pandas in an ipython notebook. While reading the file, the notebook unexpectedly gives me an error message saying the kernel appears to have died and will restart.
As per several "big data" workflows in python/pandas, I'm reading the files in as follows:
import pandas as pd
tp = pd.read_csv(file_name_cleaned,chunksize,iterator=True,low_memory=False)
df = pd.concat(tp,ignore_index=True)
My workflow has involved some preprocessing to remove all but alphanumeric characters and a few pieces of punctuation as follows:
with open(file_name,'r') as file1:
with open(file_name_cleaned,'w') as file:2
for line in file1:
if len(line.split(sep_string)) == num_columns:
line = re.sub(r'[^A-Za-z0-9|._]+','',line)
file2.write(line+'\n')
The strange thing is that if I remove the line containing re.sub(), I get a different error - "Expected 209 fileds, saw in line 22236, saw 329" even though I've explicitly checked for the exact number of delimiters. Visual inspection of the line and surrounding lines don't really show me much either.
This process has worked fine for several other files, including ones that are larger so I don't think the size of the file is the issue although I suppose it's possible that that's an oversimplification.
I included the preprocessing because I know from experience that sometimes the data contains strange special characters, I've also gone back and forth between using encoding='utf-8' and encoding='utf-8-sig' in the read_csv() and open() statements to no real avail.
I have several questions - does including the encoding keyword argument cause python to ignore characters outside of those character sets or does it maybe invoke some kind of conversion for those characters? I'm not very familiar with these types of issues. Is it possible that some kind of unexpected character could have slipped through my preprocessing and caused this? Is there another type of issue that I haven't found that could cause this? (I have done research but nothing has been quite right.)
Any help would be much appreciated.
Also, I'm using Anaconda 2.4, with Python 3.5.1, Ipython 4.0.0, and pandas 0.17.0
I'm not sure that this totally answers my questions but I did solve the issue, while it is slower, using engine='python' in pd.read_csv() did the trick.

lxml clean breaks href attribute

import lxml.html.clean as clean
cleaner = clean.Cleaner(style=True, remove_tags=['div','span',], safe_attrs_only=['href',])
text = cleaner.clean_html('link')
print text
prints
link
how to get:
link
i.e href in normal encoding?
clean does the right thing -- the string in parentheses should be properly encoded, and the seemingly garbled thing is the proper encoding.
You might not know, but kyrillic domain names don't exist -- there's a complex system to map these to "allowed" characters.

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.