I am using MailChimps inline-css form at: It does a great job of preparing an html file for use in sending as an email.
I have an api key. I prefer not to have to run their PHP app for only one API call. If it possible to use curl to access their inlineCss API? If so, what is the syntax?
Here is the doc page:
See also line: 2096 of this gist:
My key looks something like:
Here is a start of what I will like to achieve:
curl post -d #input.html apiKey=xxxxxxxx ""
This is what I hacked up for anyone looking for a similar solution. Comments, other options are welcomed:
import os
import re
import urllib
import mechanize
import xml.sax.saxutils as saxutils
from xml.sax.saxutils import unescape
issueRoot = os.environ['newslettersroot'] + os.environ['currYear'] + '/' + os.environ['issueRoot'] + '/'
except KeyError:
print "Please run init.bat"
srcEmailFilename = 'email.html'
dstEmailFilename = 'email_inline_css.html'
# retrieve <body> section only
html = open(issueRoot + srcEmailFilename, 'rb').read()
html = re.findall("(?si)<body.*?</body>", html)[0]
# use mailchimp inlineCss site to inject class rules into html tags
response = mechanize.urlopen("")
# retrieve form
form = mechanize.ParseResponse(response, backwards_compat=False)[0]
form["html"] = html
# form["strip"] = "checked"
# submit form and retrieve result
html = mechanize.urlopen(
match ='<textarea name="text" cols="100" rows="12">(.*?)</textarea>', html, re.DOTALL | re.IGNORECASE | re.MULTILINE)
if not match:
print html
exit("Expected to find output from mailchimp.")
# clean up output
html =
html = saxutils.unescape(html)
html = urllib.unquote_plus(html)
html = unescape(html, {"&apos;": "'", """: '"'})
html = html.replace('&', '&').replace('%2F', '/').replace('%3A', ':')
# #sed -r 's/ class="[a-zA-Z0-9-]+"//g' %newslettersroot%%currYear%\%issueRoot%\email_inlinedcss.html > %newslettersroot%%currYear%\%issueRoot%\email_removedstyle.html
#replace class tags
html = re.sub(r'(?sim)\s*class="[a-zA-Z0-9-]+"', "", html)
fh = open(issueRoot + dstEmailFilename, 'wb')


web scrape does not find the correct tags

I am trying to extract the text of this page: using bs4 and pandas
I start with:
soup = BeautifulSoup(src,'xml')
and see that the text I am interested in is wrapped in p tags,
but when I run soup.find_all('p'), the only return I get is the closing paragraph.
How can I extract the paragraph text within? What am I missing?
These are the paragraphs I am trying to extract:
I tried also with selenium using:
chrome_options = webdriver.ChromeOptions()
chrome_driver = os.getcwd() + "\\chromedriver.exe"
driver = webdriver.Chrome(options = chrome_options, executable_path = chrome_driver)
page = driver.page_source
page_soup = BeautifulSoup(page,'xml')
[a.text for a in div]
I figured it out.
The body of the site comes from a <script> tag that holds a JSON but with a funky encoding.
That said tag has an id of "ng-lseg-state", which means this is Angular's custom HTML encoding.
You can target the <script> tag with BeautifulSoup and parse it with the json module.
Then, however, you need to deal with Angular's encoding. One way, a bit crude thou, is to chain a bunch of .replace() methods.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
url = ""
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
.replace('&l;', '<')
.replace('&g;', '>')
.replace('&q;', '"')
print(BeautifulSoup(decoded_body, "lxml").find_all("p")[22].getText())
This outputs:
Essentra plc is a FTSE 250 company and a leading global provider of essential components and solutions.&a;#160; Organised into three global divisions, Essentra focuses on the light manufacture and distribution of high volume, enabling components which serve customers in a wide variety of end-markets and geographies.
However, as I've said, this is not the best approach, as I'm not entirely sure how to deal with a bunch of other characters, namely:
just to name a few. But I've already asked about this.
Here's a fully working code based on the answer to my question, mentioned above.
import html
import json
import requests
from bs4 import BeautifulSoup
def unescape(decoded_html):
char_mapping = {
'&a;': '&',
'&q;': '"',
'&s;': '\'',
'&l;': '<',
'&g;': '>',
for key, value in char_mapping.items():
decoded_html = decoded_html.replace(key, value)
return html.unescape(decoded_html)
url = ""
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p")[22].getText())

HTTP Basic Authentication not working with Python 3

I am trying to access an intranet site with HTTP Basic Authentication enabled.
Here's the code I'm using:
from bs4 import BeautifulSoup
import urllib.request, base64, urllib.error
request = urllib.request.Request(url)
string = '%s:%s' % ('username','password')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string)
u = urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
soup = BeautifulSoup(, 'html.parser')
But it doesn't work and fails with 401 Authorization required. I can't figure out why it's not working.
The solution given here works without any modifications.
from bs4 import BeautifulSoup
import urllib.request
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = ""
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
u =
soup = BeautifulSoup(, 'html.parser')
The previous code works as well. You just have to decode the utf-8 encoded string otherwise the header contains a byte-sequence.
from bs4 import BeautifulSoup
import urllib.request, base64, urllib.error
request = urllib.request.Request(url)
string = '%s:%s' % ('username','password')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
u = urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
soup = BeautifulSoup(, 'html.parser')
UTF-8 encoding might not work. You can try to use ASCII or ISO-8859-1 encoding instead.
Also, try to access the intranet site with a web browser and check how the Authorization header is different from the one you are generating.
Encode using "ascii". This worked for me.
import base64
import urllib.request
url = "http://someurl/path"
username = "someuser"
token = "239487svksjdf08234"
request = urllib.request.Request(url)
base64string = base64.b64encode((username + ":" + token).encode("ascii"))
request.add_header("Authorization", "Basic {}".format(base64string.decode("ascii")))
response = urllib.request.urlopen(request) # final response string

Replace occurrences on html file

I have to replace some kind of occurrences on thousands of html files and I'm intendind to use linux script for this.
Here are some examples of replaces I have to do
From: <a class="wiki_link" href="/WebSphere+Application+Server">
To: <a class="wiki_link" href="/confluence/display/WIKIHAB1/WebSphere%20Application%20Server">
That means, add /confluence/display/WIKIHAB1 as prefix and replace "+" by "%20".
I'll do the same for other tags, like img, iframe, and so on...
First, which tool should I use to make it? Sed? Awk? Other?
If anybody has any example, I really appreciate.
After some research I found out Beautiful Soup. It's a python library to parse html files, really easy to use and very well docummented.
I had no experience with Python and could wrote the code without problems.
Here is an example of python code to make the replace that I mentioned in the question.
import os
from bs4 import BeautifulSoup
#Replaces plus sign(+) by %20 and add /confluence... prefix to each
#href parameter at anchor(a) tag that has wiki_link in class parameter
def fixAnchorTags(soup):
tags = soup.find_all('a')
for tag in tags:
newhref = tag.get("href")
if newhref is not None:
if tag.get("class") is not None and "wiki_link" in tag.get("class"):
newhref = newhref.replace("+", "%20")
newhref = "/confluence/display/WIKIHAB1" + newhref
tag['href'] = newhref
#Creates a folder to save the converted files
def setup():
if not os.path.exists("converted"):
#Run all methods for each html file in the current folder
def run():
for file in os.listdir("."):
if file.endswith(".html"):
print "Converting " + file
htmlfile = open(file, "r")
converted = open("converted/"+file, "w")
soup = BeautifulSoup(htmlfile, "html.parser")

Scrapy can't figure out an sql query ins ajax call

I am trying to scrape data from this link using scrapy.
I want to automate the ajax call using scrapy.
When I click "Full SP" Button (inspect in Firebug) the post parameter has the sql string which is "strange"
What dialect is this?
My code :
import scrapy
import urllib
class FlatStat(scrapy.Spider):
name= "flatstat"
allowed_domains = [""]
start_urls = [""]
def parse(self, response):
query_lst = response.xpath('//table[#id="system"]//tr/td[last()]/text()').extract()
query_str = ' '.join(query_lst)
url = ''
body_dict = {'a_e_max': '9.99',
'a_e_min': '0',
'arch_min': '0',
'exp_min': '0',
# copied from the Post parameters by inspecting. Actually I tried everything.
'sqlFullString' : u'''Type%20(Rider)%7C%3D%7COrdinary%20(Exclude%20Amatr%2C%20App%2C%20Lady%20Races
#I tried copying this from the post parameters as well but no success.
#I also tried sql from the table //td text() which is "normal" sql but no success
'sqlString': query_str}
#here i tried everything FormRequest as well though there is no form.
return scrapy.Request(url, method="POST", body=urllib.urlencode(body_dict), callback=self.parse_page)
def parse_page(self, response):
with open("response.html", "w") as f:
So questions are:
What is this sql.
Why isn't it returning me the required page. How can I run the right query?
I tried Selenium as well to click the button and let it do the stuff it self but that is another unsuccessful story. :(
It's not easy to say what the website creator is doing with the submitted sqlString. It probably means something very specific to how the data is processed by their backend.
This is an extract of the page JavaScript in-HTML code:
function system_report(type) {
sqlString = '', sqlFullString = '', rowcount = 0;
$('#system tr').each(function() {
if(rowcount > 0) {
var editdata = this.cells[6].innerHTML.split("|");
sqlString += editdata[0] + '|' + editdata[1] + '|' + editdata[7] + '|' + editdata[3] + '|' + editdata[4] + '|' + editdata[5] + '^';
sqlFullString += this.cells[0].innerHTML + '|' + encodeURIComponent(this.cells[1].innerHTML) + '|' + this.cells[2].innerHTML + '|' + this.cells[3].innerHTML + '|' + this.cells[6].innerHTML + '^';
sqlString = sqlString.slice(0, -1)
Looks non trivial to reverse-engineer.
Although it's not a solution to your "sql" question above, I suggest that you try using splash (an alternative to selenium in some cases).
You can launch it with docker (the easiest way):
$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
With the following script:
function main(splash)
local url = splash.args.url
-- this clicks the "Full SP" button
-- loading the report takes some time
return {
html = splash:html()
you can get the page HTML with the popup of the report.
You can integrate Splash with Scrapy using scrapyjs (a.k.a scrapy-splash)
See with an example how to do so with a custom script.

Sending form data with an HTTP PUT request using Grinder API

I'm trying to replicate the following successful cURL operation with Grinder.
curl -X PUT -d "title=Here%27s+the+title&content=Here%27s+the+content&signature=myusername%3A3ad1117dab0ade17bdbd47cc8efd5b08"
Here's my script:
from net.grinder.script import Test
from net.grinder.script.Grinder import grinder
from net.grinder.plugin.http import HTTPRequest
from HTTPClient import NVPair
import hashlib
test1 = Test(1, "Request resource")
request1 = HTTPRequest(url="")
log =
m = hashlib.md5()
class TestRunner:
def __call__(self):
params = [NVPair("title","Here's the title"),NVPair("content", "Here's the content")]
params.sort(key=lambda param: param.getName())
ps = ""
for param in params:
ps = ps + param.getValue() + ":"
ps = ps + "myapikey"
params.append(NVPair("signature", ("myusername:" + m.hexdigest())))
result = request1.PUT()
The test runs okay, but it seems that my script doesn't actually send any of the params data to the API, and I can't work out why. There are no errors generated, but I get a 401 Unauthorized response from the API, indicating that a successful PUT request reached it, but obviously without a signature the request was rejected.
This isn't exactly an answer, more of a workaround that I came up with, that I've decided to post since this question hasn't yet received any responses, and it may help anyone else trying to achieve the same thing.
The workaround is basically to use the httplib and urllib modules to build and make the PUT request instead of the HTTPClient module.
import hashlib
import httplib, urllib
params = [("title", "Here's the title"),("content", "Here's the content")]
params.sort(key=lambda param: param[0])
ps = ""
for param in params:
ps = ps + param[1] + ":"
ps = ps + "myapikey"
m = hashlib.md5()
params.append(("signature", "myusername:" + m.hexdigest()))
params = urllib.urlencode(params)
print params
headers = {"Content-type": "application/x-www-form-urlencoded"}
conn = httplib.HTTPConnection("")
conn.request("PUT", "/api", params, headers)
response = conn.getresponse()
print response.status, response.reason
(Based on the example at the bottom of this documentation page.)
You have to refer to the multi-form posting example in Grinder script gallery, but changing the Post to Put. It works for me.
files = ( NVPair("self", ""), )
parameters = ( NVPair("run number", str(grinder.runNumber)), )
# This is the Jython way of creating an NVPair[] Java array
# with one element.
headers = zeros(1, NVPair)
# Create a multi-part form encoded byte array.
data = Codecs.mpFormDataEncode(parameters, files, headers)
grinder.logger.output("Content type set to %s" % headers[0].value)
# Call the version of POST that takes a byte array.
result = request1.PUT("/upload", data, headers)