Crawler4J seed url gets encoded and error page is crawler instead of actual page - urlencode

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1
for now I am adding this hard coded url in my crawler controller like:
String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);
When crawler 4J starts the URL Crawled is :
https://github.com/search?q=java%2Blocation%3AIndia&p=1
which gives me error page.
What should I do, I have tried giving encoded url but that doesn't work either.

I had to eventually make the slightest of changes to crawler4J source code:
File Name: URLCanonicalizer.java
Method : percentEncodeRfc3986
Just commented the first line in this method and I was able to crawl and fetch my results
//string = string.replace("+", "%2B");
In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.

Related

Github Enterprise Raw URL Gist Unable to Download

I'm able to get a list of gists and their files https://api.git.mygithub.net/users/myuser/gists?per_page=100&page=1 which I found using the docs here: https://docs.github.com/en/free-pro-team#latest/rest/reference/gists#get-a-gist
The files on the gist object have a raw_url. If I fetch the raw_url with the same token, it fails wanting me to authenticate. If I add the header: Accept: application/vnd.github.v3.raw it returns a 406 Not Acceptable. I've references to that header around.
I'm not sure what the scope should be on the token. It seems like it would be the same one I accessed the API. In the UI if you click the raw file it gets a token appended to the url. That token doesn't look like one of the Private tokens mentioned here: https://docs.github.com/en/free-pro-team#latest/github/authenticating-to-github/creating-a-personal-access-token
So what is the format of the HTTP request to download the raw gist?
The raw url needs to have the hostname of gist. changed to raw. and the url path needs to start with /gist/.
Example code in Go fixing it:
url := gistFile.RawUrl
url = strings.Replace(url, "gist.", "raw.", 1)
url = strings.Replace(url, ".net/", ".net/gist/", 1)

Getting the main url on which error occured in Yii 1

We have implemented an error handler for Yii 1. Also we have implemented the mail functionality with this as any error occurred an email will be send to us but the problem is we are not getting the current URL on which error is generating. Like one page controller/action can contain many images favicons etc. So if any image is missing then we are getting the image URL which showing 404 from:
$url = Yii::app()->createAbsoluteUrl(Yii::app()->request->url);
But we are not getting current URL not even in $error = Yii::app()->errorHandler->error.
So we are not getting the page in which image is absent. Please let me know if is there any way to get current page URL as I have tried many ways but all they are returning the missing images URL instead of main page URL for which images are missing.
createAbsoluteUrl() expects route as first argument - it may return random results if you provide URL instead of route (like in your code snippet).
If you want absolute URL of current request, you may use combination of getUrl() and getHostInfo():
$url = Yii::app()->request->getHostInfo() . Yii::app()->request->getUrl();
In case of error you can get current page url using Yii::app()->request->requestUri in Yii 1.

Downloading a publicly-shared file from OneDrive

When I create a share link in the UI with the "Anyone with this link can view this item" option, I get a URL that looks like https://onedrive.live.com/redir?resid=XXX!YYYY&authkey=!ZZZZZ&ithint=<contentType>. What I can't figure out is how to use this URL from code to download the content of the file. Hitting the link gives HTML for a page to show the file.
How can I construct a call to download the file? Also, is there a way to construct a call to get some (XML/JSON) metadata about the file, and maybe even a preview or something? I want to be able to do this all without prompting a user for credentials, and all the API docs are about how to make authenticated calls. I want to make anonymous calls to get publicly-shared files.
Have a read over https://dev.onedrive.com - it documents how you can make a query to our service to get the metadata for an item, along with URLs that can be used to directly download the content.
Update with more details
Sorry, the documentation you need for your specific scenario is still in process (along with the associated SDK changes) so I'll give you an overview of how to do it.
There's a sibling to the /drives path called /shares which accepts a sharing URL (such as the one you have above) in an encoded format and allows you to get metadata for the item it represents. This does not require authentication provided the sharing URL has a valid authkey.
The encoding scheme for the id is u!<UrlSafeBase64EncodedUrl>, where <UrlSafeBase64EncodedUrl> follows the guidelines outlined here (trim the = characters from the end).
Here's a snippet that should give you an idea of the whole process:
string originalUrl = "https://onedrive.live.com/redir?resid=XXX!YYYY&authkey=!foo";
byte[] urlAsUtf8Bytes = Encoding.UTF8.GetBytes(originalUrl);
string utf8BytesAsBase64String = Convert.ToBase64String(urlAsUtf8Bytes);
string encodedUrl = "u!" + utf8BytesAsBase64String.TrimEnd('=').Replace('/', '_').Replace('+', '-');
string metadataUrl = "https://api.onedrive.com/v1.0/shares/" + encodedUrl + "/root";
From there you can append /content if you want to get the contents of the file, or you can start navigating through if the URL represents a folder (e.g. /children/childfile.txt)

Use URL as an API method for Slackbot in Express js

I am still new to javascript and trying to write a Slackbot in express js. I want to use the method defined in https://api.slack.com/methods/channels.history. How should this look syntacticly and how do I use it since the method is simply a URL?
You need to make an http request for the URL and you'll be returned a response with an object containing the status (ok:true|false), if there are more messages (has_more:true|false), and then an array of the actual messages (messages:array).
The response should look something like this:
{
has_more:true
messages:Array[100]
ok:true
}
The url that you make the get request to should look something like:
https://slack.com/api/channels.history?token=BOT_TOKEN&channel=CHANNEL_ID&pretty=1
Where BOT_TOKEN is the token attached to the bot you created, and CHANNEL_ID is the ID (not the name) of the channel whos history you want to get (9 uppercase alphanumeric characters, starts with a "C").
There are also a few other parameters you can include in the url. For example, "latest=", "oldest=", "inclusive=", "count=", and "unreads=". Details about those parameters can be found on the page you linked to (https://api.slack.com/methods/channels.history).
If you want to test it out in your browser's console, find a page where jQuery is loaded, open your dev tools and head into the console, and enter the following (with your bot token and channel id swapped in):
$.get('https://slack.com/api/channels.history?token=BOT_TOKEN&channel=CHANNEL_ID&pretty=1', function(response){console.log(response)});

Change redirected URL in Scrapy

It is possible to change a redirected url in scrapy?
For example, I crawl an url:
http://someurl.com/A
which has a redirect to
http://redirectedurl.com:8080/A
This url fails because of the port number. The good URL needs to be without a port number, so I would like to change it to
http://redirectedurl.com/A
I tried to update the request.meta with redirect_urls having the new url without a port.
The docs says that MetaRefreshMiddleware obeys the redirect_urls, but no succes
meta.update({'redirect_urls': ['http://redirectedurl.com/A '] })
r = Request(url=url, callback=callback, meta=meta)
Any ideas?
No need to go deep and try to fix things "under the hood". You can just check if the request was redirected, and then create a new request with the modified URL:
import re
if 'redirect_urls' in response.meta:
new_url = re.sub(":\d+","", response.url)
yield Request(new_url)
Of course, you would add additional checks there, this is just a minimum example.