http://dbpedia.org/sparql endpoint not reliable? - sparql

Sometimes a query works sometimes it doesn't. I get sometimes "Virtuoso S1T00 Error SR171: Transaction timed out" (no timeout is set or a big timeout is set - so this is not the problem there is another problem behind that i am not aware of) or simply a browser HTTP 500 error page.
Sometimes it works from a new browser window in IE sometimes it doesn't work from FF.
What is going on with dbpedia sparql endpoint? Is there some caching or something that I am not aware of?

The DBPedia query service is kindly provided for free, and does tend to get (ab)used by many users. If you need something that you can rely on I'd suggest setting up your own instance (IIRC there are EC2 instances for that purpose).
It's a shame that the error messages tend to be so random.

Due to Large set of data DBpedia is working very harse.It won't produce proper result.If you need better result try to setup ARQ for Sparql query on your localmachine.It will give better outcome.

Related

GET vs POST API calls and cache issues

I know that GET is used to retrieve data from the server without modifying anything. Whereas POST is used to add data. I won't get into PUT/PATCH, and assume that POST is always used to update and replace data.
The theory is nice, but in practice I have encountered many situations where my GET calls need to be replaced with POST calls. This is because the response often gets incorrectly cached. Where I work there are proxy servers for security, caching, load balancing, etc., and often times the response for GET calls is directly cached to speed up the call, whereas POST calls never get fully cached.
So for my question, if I have an API call /api/get_orders/month. Theoretically, this should be a GET call, however, the number of orders might update any second. So if I call this API at any moment it may return for example 1000, and calling it just two seconds later should return 1001. However, because of the cache, and although adding a version flag such as ?v=<date_as_int> should ensure that the updated value is returned, there seems to be some caches in the proxy servers that might ignore this.
Basically, I don't feel safe enough using GET unless I know for certain that the data will not be modified or if I know for a fact that the response is always the updated data.
So, would you recommend using POST/GET in the case of retrieving daily/monthly number of orders. And if GET, with all the different and complex layers and server set-ups, how can one be certain that the data is always updated?
If you're doing multiple GET request and something is caching the data in between, you have no idea what it is or how to change it's behavior then POST is a valid workaround.
In any normal situation you would take the time what sits in between your browser and your server, and if there's something that's behaving in a way that doesn't make sense, I would try to investigate and fix that.
So you work at a place where some of that infrastructure exists. Maybe talk to the people that maintain it? But if that's not an option and you just want to find the 'ignore every convention and make my request work'-workaround, then you can use POST.

404 Exception while querying dbpedia

I have following code which makes requests to dbpedia server.
HttpQuery.urlLimit = 3000;
Query query = QueryFactory.create(queryString, Syntax.syntaxARQ);
QueryExecution qexec = QueryExecutionFactory.sparqlService(this.endPoint,query);
ResultSet resultSet = qexec.execSelect();
The code runs fine, but sometimes raises the following exception.
HttpException: 404
at com.hp.hpl.jena.sparql.engine.http.HttpQuery.execGet(HttpQuery.java:349)
at com.hp.hpl.jena.sparql.engine.http.HttpQuery.exec(HttpQuery.java:295)
at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execResultSetInner(QueryEngineHTTP.java:346)
at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execSelect(QueryEngineHTTP.java:338)
What is the reason for such exception?
A HTTP 404 is a standard HTTP error that means that the requested resource was not found i.e. the server could not find the service you asked for.
As a public service open to everyone DBPedia is heavily used and often experiences outages for various reasons e.g. maintenance, hardware/software outages, DoS attacks (whether intentional or from unintentionally bad queries)
According to SPARQL Endpoint Status for DBPedia the endpoint has around 99% availability which means that sometimes it will be unavailable.
There are many possible reasons. We don't have enough information to say with certainty which applies here.
As #RobV says, HTTP 404 is a standard HTTP error which indicates that the server (which was operational) could not find the resource you asked for -- but we don't know what resources you asked for when you did and didn't get this error, so cannot analyze further.
The 404 does not indicate the server is down, nor that it is refusing to serve you. These conditions (and many others) would result in different error codes.

python - HTTP Error 503 Service Unavailable

I am trying to scrape data from google and linkedin. Somehow it gave me this error:
*** httperror_seek_wrapper: HTTP Error 503: Service Unavailable
Can someone help advice how I solve this?
Google is simply detecting your query as automated. You would need a captcha solver to get unlimited results. The following link might be helpful.
https://support.google.com/websearch/answer/86640?hl=en
Bypassing Captcha using an OCR Engine:
http://www.debasish.in/2012/01/bypass-captcha-using-python-and.html
Simple Approach:
An even simpler approach is to simply use sleep() a few times and to generate random queries. This way google will not spot that you are using an automated system. But the system is far slower ...
Error Handling:
To simply get remove the error message use try and except
I encountered the same situation and tried using the sleep() function before every request to spread the requests a little. It looked like it was working fine but failed soon enough even with a delay of 2 seconds. What solved it finally was using:
with contextlib.closing(urllib.urlopen(urlToOpen)) as x:
#do stuff with x.
This I did because I thought opening too many requests keeps it open and had to closed. Nevertheless, it worked quite consistently with as less as 0.5s delay time.

How to avoid timeout, i.e., do time-unlimited query, on Virtuoso SPARQL endpoint?

Every time I do a query in http://dbpedia.org/sparql endpoint or my local Virtuoso store, I get a time out error after some considerably large time (like 30 mins through my own experience) .
For querying DBpedia's online SPARQL endpoint, I use the following statements:
Query query = QueryFactory.create(q); //q - query string
QueryExecution qexec = QueryExecutionFactory.sparqlService("http://dbpedia.org/sparql/", query);
qexec.setTimeout(-100);
I read that Timeout value less than zero (i.e., negative) will never allow a timeout to happen, so I have set qexec.setTimeout(-100) value. But still I get a timeout.
How to solve this problem? Is it also true that http://dbpedia.org/sparql blocks your IP address after certain amount of large queries? Can I not run continuous unlimited (i.e., very large, 10^6) queries? Thanks.
Questions specifically regarding Virtuoso are generally best raised on the public OpenLink Discussion Forums, the Virtuoso Users mailing list, or through a confidential Support Case.
That said, regarding your specific questions -- the server-side timeout setting trumps that requested by any query -- i.e., the query setting only has effect when it's shorter than that set on the server. You can adjust the server-side setting (MaxQueryExecutionTime), among many other things, on your own instance.
DBpedia-specific questions, discussion, submissions, etc., are usually best directed to the DBpedia discussion list. The public DBpedia endpoint does indeed have various usage limitations, which are part of what make it viable as a generously provided public service.

Querying WCF Services with OData Params in POST instead of GET

We call wcf svcs (not ours) and we're using gets for searching a product database.
Example:
http://foo.com/SearchProducts.svc?$skip=0$take=10$includeTotalCount=true
We were passing the Odata parameters to page the results of the SearchProducts svc. The svc has been changed to a POST because one of our filters "skus" is sometimes huge (hundres of skus) which causes the GET to break because the uri is too large. The easiest solution we thought was to just change the call to a post but now the Odata params dont seem to be used.
Do these params need to be sent in a different manner when doing a POST?
Compliant OData service will not support POST verb for queries (unless you use POST tunneling, but then you're going to be hitting the URL limit anyway). So I wonder how it works for you at all.
The URL size limit can be overcome using several approaches:
Simplify the query expression. Obviously this can only go so far, but it's usually the best solution as it will likely speed up the query execution as well.
Use batch instead. You can send the GET request inside a batch. The length of the URL is not an issue in this case, since the query URL is sent in the payload of the batch.
Define a service operation for the complex query you're using (but since you don't own the service this is probably not a good solution for you).