Scrapy: how to solve the "empty" item in html due to a foreign language symbol? - scrapy

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.

The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

Related

NSString containing aphostrophe (') is not set properly to SOAP service

I have strange in my view problem. I have UITextField which contains username. When it contains apostrophe (') the service cannot read username properly. I suppose it is connected with Unicode. I try to see what are the codes and I get:
L'TEST - contains code 8217
' - is 39
` - is 96
can anyone explains to me why this happens so I can fix this issue
L'TEST - contains code 8217
That would be L’TEST with U+2019 ‹’› \N{RIGHT SINGLE QUOTATION MARK}. If you look closely, in most fonts this character is displayed with a slight curl. It's not an apostrophe, but misused as one.
can anyone explains to me why this happens
Common causes:
"The Fool" Some mischievous input system silently substituted apostrophe for quotation mark. Word processors and mobile OS on-screen keyboards do that. It's well-meaning, but sometimes produces the wrong result.
"The Clueless" User is ignorant how to correctly type an apostrophe and picked the similar looking quotation mark.
"The Angry" Your UI text field (or something else in the chain originating from the user) forbids entry of apostrophes for some retarded reason. The user absolutely refuses to write something orthographically incorrect and manually substitutes apostrophe for quotation mark in order to work around the defect software.
so I can fix this issue
This is a social problem, not a software problem.
Since a SOAP service is based on XML and single quotes are a delineator in XML, you need to escape your single quotes by replacing "'" with "\'" in all text fields. It's a very common issue.

Special unicode question mark characters in database table

Firstly anyone who reads this and response, thanks for your assistance.
I'm having a problem where I have a site (primarily in English), with many translations for different language. I have a database which stores these translations. Unfortunately one of the language seems to be populated with question mark characters between each general character. Because of this, any text which contains these characters wont show up in IE.
Is there any SQL statements that will seek these characters out and remove them? There's a find/replace option, but I can't seem to find a rule that applies.
Thanks for any help you can give.
As an example, this is how text shows in a table:
�i�O�N� �k�i�t� �d�e� �s�u�p�p�o�r�t� �V�é�l�o� - which stops it showing IE.
Removing these as below will show it in IE:
iON kit de support Vélo
Any idea how I go about this?
Thanks :)
Your translation database contains mangled data that has come from misinterpreting UTF-16-encoded input as ISO-8859-1 (or the closely related Windows code page 1252; you can't tell the difference from the example data).
You could attempt to undo the damage by extracting the data, encoding it back to what is hopefully the original set of bytes, and re-decoding it, then inserting it back into the database. For example in PHP:
$mangled = "i\0O\0N\0 \0k\0i\0t\0 \0d\0e\0 \0s\0u\0p\0p\0o\0r\0t\0 \0V\0\xE9\0l\0o\0"
$fixed = iconv('utf-16le', 'utf-8', $mangled)
# "iON kit de support V\xC3\xA9lo"
but it would be best to go back to the original input data and re-import it properly really.
Just removing zero bytes from a UTF-16-encoded bytes string (str_replace("\0", '', $mangled)) isn't really fixing it, it would work for the ASCII characters (U+0000–U+007F) but you would end up with ISO-8859-1 bytes for characters U+0080–U+00FF (more usually you would want UTF-8) and any other characters outside that range would remain unreadable nonsense.

Special characters in the URL

I am working on a ASP.NET MVC app.
This app displays a detail information regarding a product.
The product name can have any special chars like single quote, the percentage symbol, the Registered symbol the one with a circle and 'R' inside, the Trademark symbol etc.
Currently all these are replaced with a '-'.
If the name is like this:
Super - Men's 100% Polyester Knit Shirts
It appears like this in the URL:
8080/super---men-s-100-polyester-knit-shirts/maverick
- men-s-100-polyester-knit-shirts
This is done in Js like so:
Name.replace(/([~!##$%^&*()_+=`{}\[\]\|\\:;'"<>,.\/? ])+/g, '-').replace(/^(-)+|(-)+$/g, '');
So the question is, should the name be displayed as-is in the URL?
If yes, some pointers please.
If no, please provide some valid reasons like standards as followed today that will help me put the point across the table.
Regards.
The short answer is not to fiddle with it. It's as good as it gets out of the box.
The Url can only contain a small number of alphanumeric letters. which basically means you can only have 0-9 a-z and - _ . ~.
All other characters need to be encoded. Now that you can have arabic url's too it has gotten a little more complicated.
But assuming your website is indo-european this is it. So you will never be able to have full product names in your url.
And renaming them as something more cool like replacing % with "percent" in the url can bring desaster upon your url's as in some cases the "fake" names may not end up unique and therefore end up with unreliable routing.
look at URI characters on wiki

SQL Strip the Font Format(Colour or other)

I have a problem to strip out the format in a note table
Here is an example:
";\red31\green73\blue125;
\viewkind4\uc1\ltrpar\f0\fs20 USEFUL TEXT BODY \cf1\f3
\ltrpar\f0\fs17
"
How to get rid of those stuff? I want to play safe not to replace anything after'\'
Many thanks,
Rick
Your making it quite difficult for yourself by not replace '\' .
If you look at http://other9.tripod.com/Refs/easy-rtf.html you will see that there are different RTF codes and there is no default size for the codes.
Additionally, it is not like HTML where there must be a necessary "closing" tag which makes it additionally difficult.
The only thing I can think of is to record all possible RTF codes (or use an RTF parser library) and hence be able to recognize if a \ is or is not RTF code.

Rails ActiveRecord: Inserting text containing unprintable/weird characters

I am inserting some text from scraped web into my database. some of the fields in the string have unprintable/weird characters. For example,
if text is "C__O__?__P__L__E__T__E",
then the text in the database is stored only as "C__O__"
I know about h(), strip_tags()... sanitize, ... etc etc. But I do not want to sanitize this SQL. The activerecord logs the SQL correctly, and when run in phpMySQL, the query is executed correctly. something happens between the SQL query generation and it being executed.
Help is much appreciated.
Just replace the question mark in the string with a string containing a question mark, I haven't found any other way either:
["C__O__?__P__L__E__T__E", '?']
works perfectly.
Can you escape the question mark using "\?"?
Hmmmm.. using CGI escape, I found out that the character coming in the system is not what I expected it to be. It is not a question mark (%3F) but a question mark (%D5).
C__%D5__M__P__L__%80___T__%80__
C__%3F__M__P__L__%3F___T__%3F__
Eventually I gsubbed out the non-printable characters before saving.
gsub(/[^[:print:]]/, '')
Only after removing the invalid characters in my string, was I able to save the item properly.
None of the other solutions worked, partially because the problem was not understood clearly upfront.
I know this is way late, but I ran into the same problem when we were trying to process a file as UTF-8 that actually used the ISO-8859-1 character encoding. I suspect you had a similar issue in your scraping where you assumed the wrong encoding and it ended up causing things to fail.