unexpected EOF while parsing Webscraping (BeautifulSoup) - beautifulsoup

I am trying to run this block code that is part of a WebScraping project of real estate data (https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb) but I am encountering errors in the card loop and cannot think of a solution (I am a beginner in the scraping part)
# Web-Scraping
for line in soup.findAll(class_="js-card-selector"):
try:
full_address=line.find(class_="property-card__address").text.strip()
address.append(full_address.replace('\n', '')) #Get all address
if full_address[:3]=='Rua' or full_address[:7]=='Avenida' or full_address[:8]=='Travessa' or full_address[:7]=='Alameda':
neighbor_first=full_address.strip().find('-')
neighbor_second=full_address.strip().find(',', neighbor_first)
if neighbor_second!=-1:
neighbor_text=full_address.strip()[neighbor_first+2:neighbor_second]
neighbor.append(neighbor_text)
else: # Bairro não encontrado
neighbor_text='-'
neighbor.append(neighbor_text)
else:
get_comma=full_address.find(',')
if get_comma!=-1:
neighbor_text=full_address[:get_comma]
neighbor.append(neighbor_text)
else:
get_hif=full_address.find('-')
neighbor_text=full_address[:get_hif]
neighbor.append(neighbor_text)
File "/tmp/ipykernel_70908/2775873616.py", line 26
^
SyntaxError: unexpected EOF while parsing
Does anyone have any idea what might be going on?

You're not running the full block of code. Every try: statement should be followed by an except:. The code in the github link has it. The code you are showing that you are running does not. That's why you get the error.
for line in soup.findAll(class_="js-card-selector"):
try:
full_address=line.find(class_="property-card__address").text.strip()
address.append(full_address.replace('\n', '')) #Get all address
if full_address[:3]=='Rua' or full_address[:7]=='Avenida' or full_address[:8]=='Travessa' or full_address[:7]=='Alameda':
neighbor_first=full_address.strip().find('-')
neighbor_second=full_address.strip().find(',', neighbor_first)
if neighbor_second!=-1:
neighbor_text=full_address.strip()[neighbor_first+2:neighbor_second]
neighbor.append(neighbor_text)
else: # Bairro não encontrado
neighbor_text='-'
neighbor.append(neighbor_text)
else:
get_comma=full_address.find(',')
if get_comma!=-1:
neighbor_text=full_address[:get_comma]
neighbor.append(neighbor_text)
else:
get_hif=full_address.find('-')
neighbor_text=full_address[:get_hif]
neighbor.append(neighbor_text)
except:
continue

Related

Can I get just the raised error from a subprocess?

I am using this to execute a command:
try:
subprocess.check_output(COMMAND)
except subprocess.CalledProcessError as e:
print(e.output)
Inside the COMMAND is raised an exception, after a lot of standard output. What gets returned in e.output is in fact ALL the output plus the thrown error.
Is there a method to only return the raised error?
In COMMAND the Message is raised like:
try:
for i in range(3):
print("i is {0}.".format(i))
x = 1/0
print("x is {0}".format(x))
except Exception as e:
print("Message")
raise
Only want the "Message" returned.
Thanks in advance.

Snakemake exit a rule during execution

Is there a way to print a helpful message and allow Snakemake to exit the workflow without giving an error? I have this example workflow:
def readFile(file):
with open(file) as f:
line = f.readline()
return(line.strip())
def isFileEmpty(file):
with open(file) as f:
line = f.readline()
if line.strip() != '':
return True
else:
return False
rule all:
input: "output/final.txt"
rule step1:
input: "input.txt"
output: "out.txt"
run:
if readFile(input[0]) == 'a':
shell("echo 'a' > out.txt")
else:
shell("echo '' > out.txt")
rule step2:
input: "out.txt"
output: dynamic("output/{files}")
run:
i = isFileEmpty(input[0])
if i:
shell("echo 'out2' > output/out2.txt")
else:
print("Out.txt is empty, workflow ended")
rule step3:
input: "output/out2.txt"
output: "output/final.txt"
run: shell("echo 'final' > output/final.txt")
In step 1, I'm reading the file contents of input.txt and if doesn't contain the letter 'a' then an empty out.txt will be produced. In step 2, whether out.txt is empty is checked. If it's not empty, step2 and 3 will be performed to give final.txt at the end. If it's empty, I want Snakemake to print the message "Out.txt is empty, workflow ended" and exit immediately without performing step 3 and giving an error message. Right now the code I have will print the message at step 2 if input.txt is empty, but it'll still try to run step 3 and will give MissingOutputException because final.txt is not generated. I understand the reason is because final.txt is one of the input files in the rule all, but I'm having trouble writing up this workflow because final.txt may or may not be produced.

EOF error using input Python 3

I keep getting an EOF error but unsure as to why. I have tried with and without int() but it makes no difference. I'm using Pycharm 3.4 and Python 3.
Thanks,
Chris
while True:
try:
number = int(input("what's your favourite number?"))
print (number)
break
You must close a try statement because you are declaring that there might be an error and you want to handle it
while True:
try:
number = int(input("what's your favourite number?"))
print(number)
break
except ValueError as e:
print("Woah, there is an error: {0}".format(e))

Puzzling Python I/O error: [Errno 2] No such file or directory

I'm trying to grab an XML file from a server (using Python 3.2.3), but I keep getting this error that there's "no such file or directory". I'm sure the URL is correct, since it outputs the URL in the error message, and I can copy-n-paste it and load it in my browser. So I'm very puzzled how this could be happening. Here's my code:
import xml.etree.ElementTree as etree
class Blah(object):
def getXML(self,xmlurl):
tree = etree.parse(xmlurl)
return tree.getroot()
def pregameData(self,url):
try:
x = self.getXML('{0}linescore.xml'.format(url))
except IOError as err:
x = "I/O error: {0}".format(err)
return x
if __name__ == '__main__':
x = Blah()
l = ['http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_anamlb_minmlb_1/',
'http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_phimlb_cinmlb_1/',
'http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_slnmlb_pitmlb_1/'
]
for url in l:
pre = x.pregameData(url)
print(pre)
And it always returns this error:
I/O error: [Errno 2] No such file or directory: 'http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_anamlb_minmlb_1/linescore.xml'
I/O error: [Errno 2] No such file or directory: 'http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_phimlb_cinmlb_1/linescore.xml'
I/O error: [Errno 2] No such file or directory: 'http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_15/gid_2013_04_15_slnmlb_pitmlb_1/linescore.xml'
You can copy-n-paste those URL's and see the files do exist in those locations. I even copied the files & directories to localhost, and tried this as localhost in case the foreign server had some kind of block. It gave me the same errors, so that's not an issue. I wondered if Etree's parse() can't handle HTTP, but the documentation doesn't say anything about that, so I'm guessing that's not an issue either.
UPDATE: As suggested in the comments, I went with using open(), but it still returned the error. Importing & trying urllib.request.urlopen(url) returns an error that AttributeError: 'module' object has no attribute 'request'.
You're correct, xml.etree dosen't automatically download and parse urls, if you want to do that you'll need to download it yourself first (using urllib or requests...).
The documentation explicitly states that parse takes a filename or fileobject, if it would support an url i'm sure it would say so explicitly. lxml.etree.parse() for example does.

Python 2.7.3 process.Popen() failures

I'm using python 2.7.3 on a Windows 7 64-bit machine currently (also developing in Linux 32 bit, Ubuntu 12.04) and am having odd difficulties getting python to communicate with the command prompt/terminal successfully. My code looks like this:
import subprocess, logging, platform, ctypes
class someClass (object):
def runTerminalCommand:
try:
terminalOrCmdLineCommand = self.toString()
process = subprocess.Popen(terminalOrCmdLineCommand, shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
output = process.stdout.readlines()
if len(output) < 1:
logging.exception("{} error : {}".format(self.cmd,process.stderr.readlines()))
raise ConfigError("cmd issue : {}".format(self.cmd))
return output
except ValueError as err:
raise err
except Exception as e:
logging.exception("Unexpected error : " + e.message)
raise ConfigError("unexpected error")
Now, I know that the self.toString() returned value will process correctly if I enter it manually, so I'm limiting this to an issue with how I'm sending it to the command line via the subprocess. I've read the documentation, and found that the subprocess.check_call() doesn't return anything if it encounters an error, so I'm using .Popen()
The exception I get is,
[date & time] ERROR: Unexpected error :
Traceback (most recent call last):
File "C:\[...]"
raise ConfigError("cmd issue : {}".format(self.cmd))
"ConfigError: cmd issue : [the list self.cmd printed out]"
What I am TRYING to do, is run a command, and read the input back. But I seem to be unable to automate the call I want to run. :(
Any thoughts? (please let me know if there are any needed details I left out)
Much appreciated, in advance.
The docs say:
Use communicate() rather than .stdin.write, .stdout.read or
.stderr.read to avoid deadlocks due to any of the other OS pipe
buffers filling up and blocking the child process.
You could use .communicate() as follows:
p = Popen(cmd, stdout=PIPE, stderr=PIPE)
stdout_data, stderr_data = p.communicate()
if p.returncode != 0:
raise RuntimeError("%r failed, status code %s stdout %r stderr %r" % (
cmd, p.returncode, stdout_data, stderr_data))
output_lines = stdout_data.splitlines() # you could also use `keepends=True`
See other methods to get subprocess output in Python.