How to design a "web spider" with state in Haskell? - oop

I am learning Haskell after years of OOP.
I am writing a dumb web spider with few functions and state.
I am not sure how to do it right in FP world.
In OOP world this spider could be designed like this (by usage):
Browser b = new Browser()
b.goto(“http://www.google.com/”)
String firstLink = b.getLinks()[0]
b.goto(firstLink)
print(b.getHtml())
This code loads http://www.google.com/, then “clicks” the first link, loads content of second page and then prints the content.
class Browser {
goto(url: String) : void // loads HTML from given URL, blocking
getUrl() : String // returns current URL
getHtml() : String // returns current HTML
getLinks(): [String] // parses current HTML and returns a list of available links (URLs)
private _currentUrl:String
private _currentHtml:String
}
It’s possbile to have 2 or “browsers” at once, with its own separate state:
Browser b1 = new Browser()
Browser b2 = new Browser()
b1.goto(“http://www.google.com/”)
b2.goto(“http://www.stackoverflow.com/”)
print(b1.getHtml())
print(b2.getHtml())
QUESTION: show how would you design such a thing in Haskell from scracth (Browser-like API with possibility to have several independent instances)? Please, give a code snippet.
NOTE: For simplicity, skip the details on getLinks() function (its trivial and not interesting).Also let’s assume there is an API function
getUrlContents :: String -> IO String
that opens HTTP connection and returns an HTML for given URL.
UPDATE: why to have state (or may be not)?
The API can have more functions, not just single "load-and-parse results".
I didn't add them to avoid complexity.
Also it could care about HTTP Referer header and cookies by sending them with each request in order to emulate real browser behavior.
Consider the following scenario:
Open http://www.google.com/
Type "haskell" into first input area
Click button "Google Search"
Click link "2"
Click link "3"
Print HTML of current page (google results page 3 for "haskell")
Having a scenario like this on hands, I as a developer would like to transfer it to code as close as possible:
Browser b = new Browser()
b.goto("http://www.google.com/")
b.typeIntoInput(0, "haskell")
b.clickButton("Google Search") // b.goto(b.finButton("Google Search"))
b.clickLink("2") // b.goto(b.findLink("2"))
b.clickLink("3")
print(b.getHtml())
The goal of this scenario is to get HTML of the last page after a set of operations.
Another less visible goal is to keep code compact.
If Browser has a state, it can send HTTP Referer header and cookies while hiding all mechanics inside itself and giving nice API.
If Browser has no state, the developer is likely to pass around all current URL/HTML/Cookies -- and this adds noise to scenario code.
NOTE: I guess there are libraries outside for scrapping HTML in Haskell, but my intention was not to scrape HTML, but learn how these "black-boxed" things can be designed properly in Haskell.

As you describe the problem, there is no need for state at all:
data Browser = Browser { getUrl :: String, getHtml :: String, getLinks :: [String]}
getLinksFromHtml :: String -> [String] -- use Text.HTML.TagSoup, it should be lazy
goto :: String -> IO Browser
goto url = do
-- assume getUrlContents is lazy, like hGetContents
html <- getUrlContents url
let links = getLinksFromHtml html
return (Browser url html links)
It’s possbile to have 2 or “browsers” at once, with its own separate state:
You obviously can have as many as you want, and they can't interfere with each other.
Now the equivalent of your snippets. First:
htmlFromGooglesFirstLink = do
b <- goto "http://www.google.com"
let firstLink = head (links b)
b2 <- goto firstLink -- note that a new browser is returned
putStr (getHtml b2)
And second:
twoBrowsers = do
b1 <- goto "http://www.google.com"
b2 <- goto "http://www.stackoverflow.com/"
putStr (getHtml b1)
putStr (getHtml b2)
UPDATE (reply to your update):
If Browser has a state, it can send HTTP Referer header and cookies while hiding all mechanics inside itself and giving nice API.
No need for state still, goto can just take a Browser argument. First, we'll need to extend the type:
data Browser = Browser { getUrl :: String, getHtml :: String, getLinks :: [String],
getCookies :: Map String String } -- keys are URLs, values are cookie strings
getUrlContents :: String -> String -> String -> IO String
getUrlContents url referrer cookies = ...
goto :: String -> Browser -> IO Browser
goto url browser = let
referrer = getUrl browser
cookies = getCookies browser ! url
in
do
html <- getUrlContents url referrer cookies
let links = getLinksFromHtml html
return (Browser url html links)
newBrowser :: Browser
newBrowser = Browser "" "" [] empty
If Browser has no state, the developer is likely to pass around all current URL/HTML/Cookies -- and this adds noise to scenario code.
No, you just pass values of type Browser around. For your example,
useGoogle :: IO ()
useGoogle = do
b <- goto "http://www.google.com/" newBrowser
let b2 = typeIntoInput 0 "haskell" b
b3 <- clickButton "Google Search" b2
...
Or you can get rid of those variables:
(>>~) = flip mapM -- use for binding pure functions
useGoogle = goto "http://www.google.com/" newBrowser >>~
typeIntoInput 0 "haskell" >>=
clickButton "Google Search" >>=
clickLink "2" >>=
clickLink "3" >>~
getHtml >>=
putStr
Does this look good enough? Note that Browser is still immutable.

Don't try to replicate to many object-orientation.
Just define a simple Browser type that holds the current URL (per IORef for the sake of mutability) and some IO functions to provide access and modification functionality.
A sample programm would look like this:
import Control.Monad
do
b1 <- makeBrowser "google.com"
b2 <- makeBrowser "stackoverflow.com"
links <- getLinks b1
b1 `navigateTo` (head links)
print =<< getHtml b1
print =<< getHtml b2
Note that if you define a helper function like o # f = f o, you'll have a more object-like syntax (e.g. b1#getLinks).
Complete type definitions:
data Browser = Browser { currentUrl :: IORef String }
makeBrowser :: String -> IO Browser
navigateTo :: Browser -> String -> IO ()
getUrl :: Browser -> IO String
getHtml :: Browser -> IO String
getLinks :: Browser -> IO [String]

The getUrlContents function already does what goto() and getHtml() would do, the only thing missing is a function that extracts links from the downloaded page. It could take a string (the HTML of a page) and a URL (to resolve relative links) and extract all links from that page:
getLinks :: String -> String -> [String]
From these two functions you can easily build other functions that do the spidering. For example the "get the first linked page" example could look like this:
getFirstLinked :: String -> IO String
getFirstLinked url =
do page <- getUrlContents url
getUrlContents (head (getLinks page url))
A simple function to download everything linked from a URL could be:
allPages :: String -> IO [String]
allPages url =
do page <- getUrlContent url
otherpages <- mapM getUrlContent (getLinks page url)
return (page : otherpages)
(Note that this for example will follow cycles in the links endlessly - a function for real use should take care of such cases)
There only "state" that is used by these functions is the URL and it is just given to the relevant functions as a parameter.
If there would be more information that all the browsing functions need you could create a new type to group it all together:
data BrowseInfo = BrowseInfo
{ getUrl :: String
, getProxy :: ProxyInfo
, getMaxSize :: Int
}
Functions that use this information could then simply take a parameter of this type and use the contained information. There is no problem in having many instances of these objects and using them simultaneously, every function will just use the object that it is given as a parameter.

show how would you design such a thing in Haskell from scracth (Browser-like API with possibility to have several independent instances)? Please, give a code snippet.
I would use one (Haskell) thread at each point, have all threads running in the State monad with a record type of whatever resources they need, and have results communicated back to the main thread over a channel.
Add more concurrency! That's the FP way.
If I recall correctly, there's a design here for gangs of link checking threads communicating over channels:
http://hackage.haskell.org/package/urlcheck
Also, make sure not to use Strings, but Text or ByteStrings -- they'll be much faster.

Related

Godot/Gdscript serialization of instances

If I want to serialize an array in Godot, I can do this:
var a1 = [1 ,2 ,3]
# save
var file = File.new()
file.open("a.sav", File.WRITE)
file.store_var(a1, true)
file.close()
# load
file.open("a.sav", File.READ)
var a2 = file.get_var(true)
file.close()
print(a1)
print(a2)
output (it works as expected):
[1, 2, 3]
[1, 2, 3]
But if I want to serialize an object, like this class in A.gd:
class_name A
var v = 0
Same test, with an instance of A:
# instance
var a1 = A.new()
a1.v = 10
# save
var file = File.new()
file.open("a.sav", File.WRITE)
file.store_var(a1, true)
file.close()
# load
file.open("a.sav", File.READ)
var a2 = file.get_var(true)
file.close()
print(a1.v)
print(a2.v)
output:
10
error (on line print(a2.v)):
Invalid get index 'v' (on base: 'previously freed instance').
From the online docs:
void store_var(value: Variant, full_objects: bool = false)
Stores any Variant value in the file. If full_objects is true, encoding objects is allowed (and can potentially include code).
Variant get_var(allow_objects: bool = false) const
Returns the next Variant value from the file. If allow_objects is true, decoding objects is allowed.
Warning: Deserialized objects can contain code which gets executed. Do not use this option if the serialized object comes from untrusted sources to avoid potential security threats such as remote code execution.
Isn't it supposed to work with full_objects=true? Otherwise, what's the purpose of this parameter?
My classes contains many arrays of arrays and other stuff. I guess Godot handle this kind of basic serialization functionality (of course, devs will often have to save complex data at one point), so, maybe I'm just not doing what I'm supposed to do.
Any idea?
For full_objects to work, your custom type must extend from Object (if you don't specify what your class extends, it extends Reference). And then, the serialization will be based on exported variables (or whatever you say in _get_property_list). By the way, this can, and in your case it likely is, serializing the whole script of your custom type. You can verify by looking at the saved file.
Thus, full_objects is not useful to serialize a type that extends Resource (which does not extend Object). Instead Resource serialization works with ResourceSaver, and ResourceLoader. Also with load and preload. And yes, this is how you would store or load scenes, and scripts (and textures, and meshes, and so on…).
I believe the simpler solution for your code is to use the functions str2var and var2str. These will save you a lot of headache:
# save
var file = File.new()
file.open("a.sav", File.WRITE)
file.store_pascal_string(var2str(a1))
file.close()
# load
file.open("a.sav", File.READ)
var a2 = str2var(file.get_pascal_string())
file.close()
print(a1.v)
print(a2.v)
That solution will work regardless of what is it you are storing.
Perhaps this is a solution (I haven't tested)
# load
file.open("a.sav", File.READ)
var a2 = A.new()
a2=file.get_var(true)
file.close()
print(a1.v)
print(a2.v)

Elm type confusion

I started on my first, simple web app in Elm. Most of my code is currently adapted from https://github.com/rtfeldman/elm-spa-example. I am working against a API that will give me a authToken in the response header. I have a AuthToken type that is supposed to represent that token. Taking the value out of the header and converting it to a result that's either a error String or a AuthToken is causing trouble. I expected that I could just say I am returning a AuthToken, return a String and it would be fine because my AuthTokens right now are just Strings. It seems like there clearly is something about Elm types I am not understanding.
Here is the definition of AuthToken:
type AuthToken
= AuthToken String
and my way too complicated function that for now just tries to do some type changes (later I want to also do work on the body in here):
authTokenFromHeader : String -> Http.Response String -> Result String AuthToken
authTokenFromHeader name resp =
let
header = extractHeader name resp
in
case header of
Ok header ->
let
token : Result String AuthToken
token = Ok (AuthToken header)
in
token
Err error -> Err error
I expected the happy case would return a Ok result with the string from the response header converted to a AuthToken as its value. Instead I am getting Cannot find variable 'AuthToken'. From the documentation I expected to get a constructor with the same name as the type. If I just use Ok header, the compiler is unhappy because I am returning Result String String instead of the promised Result String AuthToken.
What's the right approach here?
The code looks fine as is. The error message indicates that type AuthToken has been defined in a different module and not imported completely to the module that defines authTokenFromHeader. You can read about Elm's module system in the Elm guide: Modules.
A possible fix, assuming that type AuthToken is defined in module Types, and authTokenFromHeader is defined in module Net, is:
Types.elm:
module Types exposing (AuthToken(..))
type AuthToken = AuthToken String
Net.elm:
module Net exposing (authTokenFromHeader)
import Types exposing (AuthToken(..))
authTokenFromHeader : String -> Http.Response String -> Result String AuthToken
authTokenFromHeader name resp =
...
Note the use of AuthToken(..) instead of just AuthToken, which ensures that the type as well as the type constructors are imported/exported.
Or just move the definition of type AuthToken into the same file as the definition of authTokenFromHeader.

Get the uploaded file name in play framework 2.5

I'm creating an image upload API that takes files with POST requests. Here's the code:
def upload = Action(parse.temporaryFile) { request =>
val file = request.body.file
Ok(file.getName + " is uploaded!")
}
The file.getName returns something like: requestBody4386210151720036351asTemporaryFile
The question is how I could get the original filename instead of this temporary name? I checked the headers. There is nothing in it. I guess I could ask the client to pass the filename in the header. But should the original filename be included somewhere in the request?
All the parse.temporaryFile body parser does is store the raw bytes from the body as a local temporary file on the server. This has no semantics in terms of "file upload" as its normally understood. For that, you need to either ensure that all the other info is sent as query params, or (more typically) handle a multipart/form-data request, which is the standard way browsers send files (along with other form data).
For this, you can use the parse.multipartFormData body parser like so, assuming the form was submitted with a file field with name "image":
def upload = Action(parse.multipartFormData) { request =>
request.body.file("image").map { file =>
Ok(s"File uploaded: ${file.filename}")
}.getOrElse {
BadRequest("File is missing")
}
}
Relevant documentation.
It is not sent by default. You will need to send it specifically from the browser. For example, for an input tag, the files property will contain an array of the selected files, files[0].name containing the name of the first (or only) file. (I see there are possibly other properties besides name but they may differ per browser and I haven't played with them.) Use a change event to store the filename somewhere so that your controller can retrieve it. For example I have some jquery coffeescript like
$("#imageFile").change ->
fileName=$("#imageFile").val()
$("#imageName").val(fileName)
The value property also contains a version of the file name, but including the path (which is supposed to be something like "C:\fakepath" for security reasons, unless the site is a "trusted" site afaik.)
(More info and examples abound, W3 Schools, SO: Get Filename with JQuery, SO: Resolve path name and SO: Pass filename for example.)
As an example, this will print the original filename to the console and return it in the view.
def upload = Action(parse.multipartFormData(handleFilePartAsFile)) { implicit request =>
val fileOption = request.body.file("filename").map {
case FilePart(key, filename, contentType, file) =>
print(filename)
filename
}
Ok(s"filename = ${fileOption}")
}
/**
* Type of multipart file handler to be used by body parser
*/
type FilePartHandler[A] = FileInfo => Accumulator[ByteString, FilePart[A]]
/**
* A FilePartHandler which returns a File, rather than Play's TemporaryFile class.
*/
private def handleFilePartAsFile: FilePartHandler[File] = {
case FileInfo(partName, filename, contentType) =>
val attr = PosixFilePermissions.asFileAttribute(util.EnumSet.of(OWNER_READ, OWNER_WRITE))
val path: Path = Files.createTempFile("multipartBody", "tempFile", attr)
val file = path.toFile
val fileSink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(file.toPath())
val accumulator: Accumulator[ByteString, IOResult] = Accumulator(fileSink)
accumulator.map {
case IOResult(count, status) =>
FilePart(partName, filename, contentType, file)
} (play.api.libs.concurrent.Execution.defaultContext)
}

How to use ports inside StartApp in Elm

In my app that based on the StartApp package I have a port to communicate from inside the to JS. At the moment I call this port using a mailbox
requestPalette :
{ address : Signal.Address String
, signal : Signal String
}
requestPalette = Signal.mailbox ""
requestPaletteFilter : Signal String
requestPaletteFilter =
Signal.filter (String.isEmpty >> not) "" requestPalette.signal
|> settledAfter (300 * Time.millisecond)
port request : Signal String
port request = requestPaletteFilter
and using it like this:
[on "input" targetValue (\str -> Signal.message requestPalette.address str)
I wonder if there is a way to this inside of the update function instead of sending the message from the view.
This applies to elm 0.16 (and before), in elm 0.17 subscriptions have changed into ports
In order to send a signal to a mailbox from an update, you'll need to use StartApp as opposed to StartApp.Simple, since the former allows for Effects in the update function.
At a bare minimum, you're going to probably have an Action like this, which defines a No-Op and an action for sending the string request:
type Action
= NoOp
| SendRequest String
Your update function will now include something like the following case for the new SendRequest action. Since you're using StartApp, which deals in Effects, you must call Effects.task, and the Task you're mapping to an Effect must be of type Action, which is why we have the Task.succeed NoOp return value.
update action model =
case action of
NoOp ->
(model, Effects.none)
SendRequest str ->
let
sendTask =
Signal.send requestPalette.address str
`Task.andThen` (\_ -> Task.succeed NoOp)
in
(model, sendTask |> Effects.task)
Now your click event handler in the view can go back to using the address passed into the view:
[ on "input" targetValue (Signal.message address << SendRequest) ]
I've got a working example of the above in this gist. You'll just need to subscribe to the request port in javascript to see it in action.

Selenium build list of 404s

Is it possible to have Selenium crawl a TLD and incrementally export a list of any 404's found?
I'm stuck on a Windows machine for a few hrs and want to run some tests before back to the comfort of *nix...
I don't know Python very well, nor any of its commonly used libraries, but I'd probably do something like this (using C# code for the example, but the concept should apply):
// WARNING! Untested code here. May not completely work, and
// is not guaranteed to even compile.
// Assume "driver" is a validly instantiated WebDriver instance
// (browser used is irrelevant). This API is driver.get in Python,
// I think.
driver.Url = "http://my.top.level.domain/";
// Get all the links on the page and loop through them,
// grabbing the href attribute of each link along the way.
// (Python would be driver.find_elements_by_tag_name)
List<string> linkUrls = new List<string>();
ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a"));
foreach(IWebElement link in links)
{
// Nice side effect of getting the href attribute using GetAttribute()
// is that it returns the full URL, not relative ones.
linkUrls.Add(link.GetAttribute("href"));
}
// Now that we have all of the link hrefs, we can test to
// see if they're valid.
List<string> validUrls = new List<string>();
List<string> invalidUrls = new List<string>();
foreach(string linkUrl in linkUrls)
{
HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest;
request.Method = "GET";
// For actual .NET code, you'd probably want to wrap this in a
// try-catch, and use a null check, in case GetResponse() throws,
// or returns a type other than HttpWebResponse. For Python, you
// would use whatever HTTP request library is common.
// Note also that this is an extremely naive algorithm for determining
// validity. You could just as easily check for the NotFound (404)
// status code.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
if (response.StatusCode == HttpStatusCode.OK)
{
validUrls.Add(linkUrl);
}
else
{
invalidUrls.Add(linkUrl);
}
}
foreach(string invalidUrl in invalidUrls)
{
// Here is where you'd log out your invalid URLs
}
At this point, you have a list of valid and invalid URLs. You could wrap this all up into a method that you could pass your TLD URL into, and call it recursively with each of the valid URLs. The key bit here is that you're not using Selenium to actually determine the validity of the links. And you wouldn't want to "click" on the links to navigate to the next page, if you're truly doing a recursive crawl. Rather, you'd want to navigate directly to the links found on the page.
There are other approaches you might take, like running everything through a proxy, and capturing the response codes that way. It depends a little on how you expect to structure your solution.