How to request a page title in a foreign language using wikipedia API - wikipedia-api

I am trying to use a simple GET request
"https://en.wikipedia.org/w/api.php?action=query&titles=音乐&prop=langlinks&lllimit=500"
but with chinese characters in the title wikipedia can't find the page even though it exists https://zh.wikipedia.org/wiki/%E9%9F%B3%E4%B9%90
I have tried URL encoding the chinese word, I tried both simplified and traditional. I tried giving it unicode in ascii like "\u97f3\u4e50"
Does anyone know how to do this?

I solved it.
There are a few things to remember when doing this:
You need to use the wikipedia of your target language. so in this case
zh.wikipedia.org
Chinese wikipedia externally displays the charset of the users region (simplified for mainland, standard for Taiwan). But internally it depends on who wrote the article. The title in your api query must be in the original character set of the person who created it. So for Music, 音樂 will not work and you must use the simplified 音乐. But for Notebook computer the simplified 笔记本 will not work and you must use 筆記本. You have no choice but to try both. .NET includes a set of methods for converting between the two character sets.

Related

unable to perform search on custom_field(JIRA-Python)

I'm getting the below error when I search on custom_field.
{"errorMessages":["Field \'customfield_10029\' does not exist or you do not have permission to view it."],"warningMessages":[]}
But I have enough permissions(Admin) to access that field. And also I enabled the field visible.
URL = 'https://xyz.atlassian.net/rest/api/2/search?jql=status="In+Progress"+and+customfield_10029=125&fields=id,key,status'
Custom fields in JQL searches are referenced using the abbreviation 'cf' followed by their ID inside square brackets '[id]', so your URL would be:
URL =
'https://xyz.atlassian.net/rest/api/2/search?jql=status="In+Progress"+and+cf[10029]=125&fields=id,key,status'
Make sure you properly encode the square brackets in UTF-8 format in your language's encoding method.
PS. Generally speaking, it's much easier to reference custom fields in JQL searches by their names, not their IDs. It makes the search URL easier to read and understand what is being searched for.
I get a 400 response code with customized field syntax:
https://domain/rest/api/2/search?maxResults=500&jql=cf[10025]='xxxxxxxxxd'&fields=id,key,issuetype,status,customfield_10025

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.
The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

System.Web.HttpUtility.UrlEncode method gives wrong result with different language value

Web.HttpUtility.UrlEncode method in my project. When I am encoding name in English language then I got correct result. For example,
string temp = System.Web.HttpUtility.UrlEncode("Jewelry");
then I got exact result in temp variable. But if I wrote name in Russian language then I got different result.
string temp = System.Web.HttpUtility.UrlEncode("ювелирные изделия");
then I got value in temp variable like "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
Can anyone help me how to achieve exact name as per language?
Thank you!
Actually, the method has "done the right thing" for you!
It encodes non-ASCII characters so that it can be valid in all of the cases and transmit over the Internet. If you put your temp variable in an URL as a parameter, you will get your correct result at server side. That's what UrlEncode means for. Here your question is not a problem at all.
So please have a look at this link for further reading to understand about URL Encoding: http://www.w3schools.com/tags/ref_urlencode.asp
If you input that Russian word to the "URL Encoding Functions" part in the page I have given, it will return the same result as Web.HttpUtility.UrlEncode method does.
Can anyone help me how to achieve exact name as per language?
In short: not with that method, but it might depend on what is your exact goal.
In details:
In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=. Any other character needs to be encoded with the percent-encoding (%hh).
This is why UrlEncode produces
UrlEncode("Jewelry") -> "Jewelry"
UrlEncode("ювелирные изделия") -> "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
The string of "ювелирные изделия" contains characters that are not allowed in a URL as per RFC 3986.
Today, modern browsers could work with UTF-8 in URL it might be not necessary to use UrlEncode(). See example: http://jsfiddle.net/ybgt96ms/

Special characters in the URL

I am working on a ASP.NET MVC app.
This app displays a detail information regarding a product.
The product name can have any special chars like single quote, the percentage symbol, the Registered symbol the one with a circle and 'R' inside, the Trademark symbol etc.
Currently all these are replaced with a '-'.
If the name is like this:
Super - Men's 100% Polyester Knit Shirts
It appears like this in the URL:
8080/super---men-s-100-polyester-knit-shirts/maverick
- men-s-100-polyester-knit-shirts
This is done in Js like so:
Name.replace(/([~!##$%^&*()_+=`{}\[\]\|\\:;'"<>,.\/? ])+/g, '-').replace(/^(-)+|(-)+$/g, '');
So the question is, should the name be displayed as-is in the URL?
If yes, some pointers please.
If no, please provide some valid reasons like standards as followed today that will help me put the point across the table.
Regards.
The short answer is not to fiddle with it. It's as good as it gets out of the box.
The Url can only contain a small number of alphanumeric letters. which basically means you can only have 0-9 a-z and - _ . ~.
All other characters need to be encoded. Now that you can have arabic url's too it has gotten a little more complicated.
But assuming your website is indo-european this is it. So you will never be able to have full product names in your url.
And renaming them as something more cool like replacing % with "percent" in the url can bring desaster upon your url's as in some cases the "fake" names may not end up unique and therefore end up with unreliable routing.
look at URI characters on wiki

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).