Apache pig: How to use the ignoreBadFiles tag with the load function? - apache-pig

I see there's a tag called ignoreBadFiles for the load function of Apache pig. I am wondering if someone can show me an example how to use it.
Here's the link for the jira tickets:
https://issues.apache.org/jira/browse/PIG-3404
It discusses the use cases for this tag but does not have an example.
For something like:
LOAD '$inpath' USING AvroStorage();
It would be great if someone can show me how to use this tag with the load function. Thanks a lot for your help!

In addition to getting your AvroStorage('ignore_bad_files') working, you may want to look at setting mapreduce.map.failures.maxpercent. This would give similar results by allowing the job continue with certain % of mappers (readers) failing.

Related

Splash issue with https://sailing-channels.com/by-subscribers

I'm attempting a scrapy-with-splash project to get a few fields off the website "https://sailing-channels.com/by-subscribers". This site uses java to retrieve and delete listings as you scroll.
I've not had any luck getting the splash server to give me the whole set of data, or any of the detailed listings for that mater.
My first question is can splash even do this?
I really don't care how I get this data. I would prefer doing it with a program but any tool that can get me fields from this site in an .csv file would do the job. Anyone have any suggestions?
Thanks for any advice
Why do you want render it? They have pretty good API, check https://sailing-channels.com/api/channels/get?sort=subscribers&skip=0&take=5&_=1548520116425. So you can iterate, increasing skip argument and parsing json each time.
Looks like very promising way.

Umbraco: A/B testing, links in structure

I'm having a problem when trying to A/B test certain nodes in my node-tree in Umbraco.
What I want to do is to copy a node in the node-tree to a specific spot and use that B-structure to see which of the structures works best, using Google analytics.
For example we have two node structures, let's call them "Private" and "Sweden".
Their structure with childnodes and properties are exactly the same. The only difference between them is the propertyvalues (content). The "Private"-URL is www.mysite.com/Private and the "Sweden"-URL is www.mysite.com/Sweden.
What I would like to do is to change every link on the B-structure, so that it points to its match at the A-structure. The problem is that since it's two different structures, it will have two different alternative links.
With other words, it should be a coinsidence that it enters the B-structure, then be moved back to the A-structure in the next click.
We manage what page it should load (either the A-node or the B-node) with scripts, so that it has a 50% chance for each node, and if it lands on the B-node, Google analytics will save data. What we can't manage is that every link on that page will be to the A-node.
I'd appreciate any help I can get.
Regards,
David
There's a couple of ways that seem likely to give you a start at least.
The /config/urlrewriting.config file allows you to set up multiple redirect rules within umbraco so a section like the following might work in sending all requests (whether (/sweden/pagename/ or /private/pagename/) back to the private structure. Not sure how GA will handle it:
rewriteUrlParameter="ExcludeFromClientQueryString" destinationUrl="http://www.mysite.com/private/$1" redirect="Domain" redirectMode="Permanent" ignoreCase="true" />
Secondly a simple httpmodule (http://support.microsoft.com/kb/307996) can process all page requests and redirect as required - you could do a gaq_push here directly or indirectly.
I'd be interested to know how you get on - it seems a good area for extension to Umbraco.
I'm not sure I have understood perfectly what you need to do, so please excuse any assumptions that may prove mistaken. Here's what I think:
Since A & B nodes should share the same html content (besides the links of course), why don't you make the link href attribute dynamic by using a bit of razor in the template or macro:
#{var isANode = CurrentPage.Parent.Name == "Sweden"; }
A similar approach would work if you are using web forms.
We finally came to the final decision to use the alternative template-solution. Since there seem to be no generic solution for my case of this problem we had to create an alternative template with specific macros to render the different information for every documenttype we're using.
Creating dynamic links for every page is a hell of a job in this stage in the project, since there are so many pages and links. Also some links are made in javascript, so there's another problem.
I copied the a-structure to another node, only for the reason to be able to change propertyvalues. There might be a problem logging and track the information with Google Analytics though, so that's the next step for us in this project. In our alternative templates we're getting the propertyvalues from the b-structure.
Still, if anyone have some better solution I'd highly appreciate it!
Regards,
David

CasperJS: Disable remote page's javascript but still use casper.evaluate?

Thanks for reading my topic, I'd be really grateful if anyone could suggest any other avenues I should explore to achieve the below.
Using CasperJS or PhantomJS I need to disable all JavaScript that belongs to the pages I navigate from being executed, while still being able to run my own using casper.execute.
Does anyone know a way I can do this?
Is it possible to modify the HTTP headers or bodies using onResourceRequested or onResourceReceived? or cancel a request conditionally? or are they read only?
Can you modify the raw HTML source before it's offered for parsing?
I've tried hacking a window.stop() in a casper.execute early, but this works inconsistently between pages.
Is the Phantom WebServer module used for this kind of thing? Could/Should I route reqs/responses through that and modify them as they pass through?
Thanks for any help - I appreciate this is a weird use case.
As stated here it is possible but not with the current phantomjs master branch but in a specific [dev branch[(https://github.com/Vitallium/phantomjs/tree/allow-to-disable-js), you should build from, look for the latest commit for disable-javascript option.

Get last successfull build on Hudson

I was wondering if anyone know of a way or a plug in to get the last build version with result of success from a particular Hudson job using the CLI somehow.
I can see this result is held in the [DateTime]\build.xml file so I could write something to grab the result but was wondering if anyone has done this already or a know of a way to use the CLI to grab this information?
I have tried to find the information on the documentation but was unable to find the answer. If you need anymore detail then let me know.
I'm a bit late to the party here, but you can also just use the URL http://localhost:8081/job/jobname/lastSuccessfulBuild to get the last successful build. If you want to extract specific data from that page you can use http://localhost:8081/job/jobname/lastSuccessfulBuild//api
You can do it with XPATH:
http://localhost:8081/api/xml?depth=2&xpath=/hudson/job/name[text()="JReport2"]/../build/result[text()="SUCCESS"]/../../build[1]/number/text()
In the above example I'm getting the last successful build number of the build named JReport2. You can query your Hudson server via WGET or CURL sending it an HTTP GET that is equivalent to that URI.
The XPath expression can be shortened, but in the long form it is easier to understand what's going on.
In general, it is instructive to enter http://<hudson-server>/api/xml in your browser and examine the output.
Correct xpath is as:
...&xpath=/hudson/job/name[text()="...name of project..."]/../build/result[text()='SUCCESS']/../number/text()
but it is not work.
Working xpath is as:
http://HudsonServer:Port/job/..nameOfProject../lastSuccessfulBuild/api/xml?xpath=//number/text()
As described above:
...&xpath=/hudson/job/name[text()="JReport2"]/../build/result[text()="SUCCESS"]/../../build[1]/number/text()
it is not correct xpath because /../../build[1]/number/text() always gives the first build.

How can I get the full change history for an article on Wikipedia?

I'd like a way to download the content of every page in the history of a popular article on Wikipedia. In other words I want to get the full contents of every edit for a single article. How would I go about doing this?
Is there a simple way to do this using the Wikipedia API. I looked and didn't find anything the popped out as a simple solution. I've also looked into the scripts on the PyWikipedia Bot page (http://botwiki.sno.cc/w/index.php?title=Template:Script&oldid=3813) and didn't find anything that was useful. Some simple way to do it in Python or Java would be the best, but I'm open to any simple solution that will get me the data.
There are multiple options for this. You can use the Special:Export special page to fetch an XML stream of the page history. Or you can use the API, found under /w/api.php. Use action=query&title=$TITLE&prop=revisions&rvprop=timestamp|user|content etc. to fetch the history.
Pywikipedia provides an interface to this, but I do not know by heart how to call it. An alternative library for Python, mwclient, also provides this, via site.pages[page_title].revisions()
Well, one solution is to parse the Wikipedia XML dump.
Just thought I'd put that out there.
If you're only getting one page, that's overkill. But if you don't need the very very latest information, using the XML would have the advantage of being a one-time download instead of repeated network hits.