Passing nested values to class methods in scrapy - scrapy

I'm new to web scraping, please pardon the possible vagueness in my terminology :|
A snippet of an HTML page that I'm trying to write a spider for:
<h3>2019 General Meetings</h3>
<p><strong>Group 20:</strong> <br />Wednesday, June 5, 9 a.m. <br /> Bank & Trust, 10000 E. Western Ave.</p>
<p>Wednesday, July 11, 9 a.m. <br />Bank & Trust, 10000 E. Western Ave.</p>
<p><strong>Group 20:</strong> <br />Monday, July 8, 9 a.m.<br />Hubbard, 1740 W. 199th St.</p>
<p> </p></div>
The logic I'm trying to follow is:
I have the <h3> which is the "top level" (or at least I consider it to be), there are other h3's on the page, so I need to make sure only this <h3> gets passed to the following parsers.
For the above, I'm using
response_items = response.xpath("//h3[contains(#h3, 'General Meetings')]")
And I think I have it working. (But needs more testing to make certain.)
I need to pass each of the <p> to a respective parser within the class, and each should return a required piece of information about the meeting, e.g
_parser_date will return the date, _parser_address will return the address, and do on.
I'm coming short on finding the correct scrapy/xpath syntax for this. Following https://docs.scrapy.org/en/latest/topics/selectors.html I can't get this to work quite well.
I'm particularly interested in each parser to "pick up" on a pattern within the <p>'s it's going to parse, and if it's a date pattern then format it, and return.
If it's a location pattern.. and so on.
I'm trying to avoid using re.(), unless you'd advise it's the right thing to do here.
Any insights would be most welcome,
Thank you.

This should work:
for p_node in response.xpath('//h3[contains(., 'General Meetings')]/following-sibling::p[position() < last()]'):
address = p_node.xpath('./text()[last()]).get()
date = p_node.xpath('./text()[last() - 1]).get()
I used position() < last() to skip last empty <p> and also I'm parsing data from the end.

Related

Replace part of serialized data "resets" data to standard settings

I'm managing a couple of hundreds of websites and need to change part of a serialized data.
It's a wordpress child theme and inside of the theme's "Options" settings.
Using this script
UPDATE wp_e4e5_options
SET option_value = REPLACE(option_value, 'Copyright | by <a href="https://company.com"', ' ')
a:1:{s:3:"copyright";s:17:"Copyright | by <a href="https://company.com"";}
I was certain, it would just find that part of the serialized data and replace. But it doesn't.
It reset the setting to the theme's standard setting. Even when I edit it manually by using Adminer.php in the table, it resets.
I'm aware, that this might be in the wrong forum, since it's Wordpress related, but I believe it's SQL that's the issue here.
So my question is:
If i edit it manually using Adminer.php (simple version of
phpMyAdmin), it resets all the settings back to standard. How can I
edit only part of the serialized data and only the part shown above?
What makes it "reset" to standard settings?
UPDATE:
Thanks to #Kaperto I got this working code now, which gave me a new issue.
UPDATE wp_e4e51a4870_options
SET option_value = REPLACE(option_value, 's:173:"© Copyright - Company name [nolink] | by <a href="https://company-name.com" target="_blank" rel="nofollow">Company Name</a>";', 's:40:"© Copyright - Company name [nolink]";')
The problem is, it's gonna be used as a code snippet with ManageWP looping through several hundreds of websites which all have different company names. So the first part of the string is unique but the rest is the same on all sites, after the pipe |.
So I somehow need to do this:
Find whole series where this string is included | by <a href="https://company-name.com" target="_blank" rel="nofollow">Company Name</a>"
Get the whole series s:173:"© Copyright -
Company name [nolink] | by <a href="https://company-
name.com" target="_blank"
rel="nofollow">Company Name</a>
Replace with new series with updated character count, since company name is different
Is this even achievable with pure SQL commands?

How to create function for count number of tags in SQL Server

In SQL Server 2014, I need a function that can count the number of tags, and if not equal the number of tags open and close tags, tag if the tag is low, add the appropriate package.
For this example or source code of page of website:
<div>
<ul>
<li>John</li>
<li>sara</li>
<li>mack</li>
<li>jane<li>
</div>
count (<) = count (>)
count (<tag>) = count (</tag>)
if count(<) < count(>) --> add < before element of tag
if count(<tag>) > count(</tag>) then add </tag> in correct position or delete.
The fact, that you tag your question with xml (and the same with your other question you placed shortly) shows clearly, that you have a deep misconception of XML...
XML is not HTML!!!
Many people think, that they are related, almost the same, but - despite the fact, that both use markups in <> brackets - they aren't...
We might discuss about XTHML, but your examples (here and in your other question) are not xml-safe... XHTML is an hybrid of HTML and XML. Every element must be closed correctly, no <br> is allows, only <br/>, no unclosed tags like <ul> in your example... XML is absolutely strict with character escaping, namespaces and nesting hierachies.
SQL Server offers great help to deal with valid XML, but this cannot help you out. You have to analyse this on string base. But SQL Server is rather poor with string operations and is - for sure! - not the right tool to analyse poorly designed web pages.

How to output a plaintext for each loop with Jade view engine

I'm trying to render out a plaintext string for emailing using the Jade view engine. I'm having trouble getting the right syntax for a plaintext output using a for each loop. Works fine with regular HTML, just not the plain text version:
| Bill to:
| #{customer.active_card.name}
|
- each lineitem in invoice.lines
= lineitem.description
Outputs
Bill to:
Freddy Mac
<p>Line item 1 description</p><p>Line item 2 description</p>
I can't figure out how to format the lineitem.description line so that I get a simple plaintext output so that it would look like this:
Bill to:
Freddy Mac
Line item 1 description
Line item 2 description
Any suggestions on how to tackle this ridiculously obscure edge case for Jade?
Many thanks!
Are you sure the <p> tag isn't really in the lineitem.description variable value itself? I tried your example and didn't get an unexpected <p> tag.
Second note that if you want plain text, you probably don't want jade's default HTML escaping, so use != instead of =.
For what it's worth, jade is really heavily focused on HTML specifically and using it for plain text is probably going to be annoying. Have you considered an alternate templating language like underscore templates?

What is the maximum number of url parameters that can be added to the exclusion list for google analytics

I set up a profile for Google Analytics. I have several dozen url parameters that various pages use and I want to exclude. Luckily, google has a field you can modify under the general profile settings [Exclude URL Query Parameters:]. Of the several dozen items I have they are all working, and not being considered part of the URL. Except for the parameter propid
I added propid to the comma separated list on Monday. But, everyday when I check GA, sure enough they are coming through with that parameter still attached.
So, am I trying to exclude too many parameters? I couldn't find any documentation on GA's site to say there was a limit.
here is the exact content of the exclude URL Query parameter field
There reason there are so many is the bh before me didn't know the difference between get/post.
propid,account,pp,kw1,kw2,kw3,sortby,page,msg,sd,ed,ea,ec,sc,subname,subcode,sa,qc,type,code,propid,acct,minbr,maxbr,minfb,maxfb,minhb,maxhb,minrm,maxrm,minst,maxst,minun,maxun,minyb,maxyb,minla,maxla,minba,maxba,minuc,maxuc,card,print,year,type
update
I thought after more time had passed the "bad data" would fall of of GA. But as of yesterday it is still reporting on the propid querystring value despite adding that as well as other variables to the exclude list.
update2
I found this post on google https://www.google.com/support/forum/p/Google+Analytics/thread?tid=72de4afc7b734c4e&hl=en
It reads that the field only allows 255 char, Ok. Problem Solved. Except my field of values is only 247 charcters.. ARGGGHH!
*Update 3 *
So Here is the code I've added to the googleAnalytics.asp include page that goes at the top of everyone of my asp classic pages. Can anyone see a flaw in the design? I don't care about ANY query string info. (it could have been named *.inc, but I like having intellisense working)
<script type="text/javascript">
<% GAPageDisplayName = REQUEST.ServerVariables("PATH_INFO") %>
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-20842347-1']);
_gaq.push(['_setDomainName', '.sc-pa.com']);
<% if GAPageDisplayName <> "" then %>
_gaq.push(['_trackPageview','<%=GAPageDisplayName %>']);
<% else %>
_gaq.push(['_trackPageview']);
<% end if %>
(function () {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
Update 4
I'll only accept an answer if you will include something talking to the original question. My question was very specific, I wanted to know exactly the number of characters google allows. Everything I included in my original question body was simply to backfill the question to put everything in context.
Might I suggest an alternate solution to the reliance on manually excluding all of these (and feasibly any string ever used)?
I'd suggest passing a parameter to the trackPageView function to 'force' the recording of a manually/programatically set 'page name' value.
Whereas by default, GA records/defines a page based on a unique URL, the inclusion of a pagename parameter would associate all pageviews of a page with that parameter as pageviews to a single page.
For example, standard GA pageview code looks like this: _gaq.push(['_trackPageview']);, whereas the inclusion of a specific page name looks like this: _gaq.push(['_trackPageview', 'Homepage']);. With the latter, presuming that the homepage is at www.site.com, regardless of how that page is accessed GA will always consolidate all pageview stats for it as 'Homepage'. So, www.site.com/index.php, www.site.com/?a=b and www.site.com/?1=2&x=y will always report as 'Homepage' as if it was one page.
The only drawback here is that you need to be incredibly careful around any occurences of pagination, nested pages, content swapping, site search, or any functionality which may in fact rely on the use of query strings; you may need to consider some logic on how the page name values are output, rather than attempting to define on a per-page basis depending on the site of your site(s).
Hope that's helpful!
Do you realize that you have propid listed twice in the exclusion field? Once at the beginning and then again about one-third of the way through. That's the only thing that stands out to me. See what happens if you remove either of these.
You also have type duplicated, so if the above fixes the problem for propid, also consider removing the second type.
Google limits the characters in the "Exclude Url Query" field (2048 characters max), not the number of queries. I had the same issue you're having and what I discovered was that I had populated my query string parameter list based on the pagenames in my pages report. Well those pagenames first pass through a view-level lowercase filter that I have set up. And since the "Exclude URL Query" field is case sensitive, some of the parameters were getting through. Hopefully this helps.

ASP Search and Results in a single page

I have a single Classic ASP page that I wish to display a search form and the associated results.
When a user first comes to this page, I want to display a search form and the 10 latest properties. If a user decides to use the search form to retrieve more relevant properties, then i want the default 10 latest properties to be replaced with the users' paged search results.
I was wondering if this is possible/practical within the confines of one page and if so, does anyone have any hints on how i could best achieve this?
This is my preliminary code for such a page;
http://gist.github.com/188770
Once again, i'm currently having to patch an existing ASP site until I can redevelop it in something more modern like PHP.
Thank you for any help offered.
Neil.
It's certainly very possible and practical. Typically the solution is to postback to yourself and have code in the page that detects if you arrived there from a post or a get. Get meant show the 10 latest properties, post means you do a search and show the results.
if (Request.ServerVariables("REQUEST_METHOD") = "POST") then
' arrived via post, get form values and do search
else
' arrived via get, show last 10 results
end if
You probably want to display what the user searched for in the form when you display the result:
<label>Street: <input type="text" name="searchStreet" value="<%=Server.HtmlEncode(Request("searchStreet") & "") %>" /></label>
Adding a empty string is for casting to string to not give an error when the key wasn't found, eg. on first visit.
If you want to you can make the loop prettier:
do until myRecordSet.EOF
%>
<div class='result'>")
<dl><%=myRecordSet("ContentTitle")%><dl>
<dt><%=myRecordSet("ContentStreet")%><dt>
<dt><%=myRecordSet("ContentTown")%><dt>
<dt><%=myRecordSet("ContentPostcode")%><dt>
</div><%
myRecordSet.MoveNext
loop
You probably want to Server.HtmlEncode there as well...
(ps ASP is actually one year younger than PHP... if you want something modern you might want to look at python, ruby or asp.net mvc before PHP, as it's easier to write bad code in PHP than in any of those. ds)