Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Why do most (all?) websites only support usernames in ASCII? Are there any security considerations if an admin decides to start accepting Unicode usernames?
Homoglyph attacks. User 'cat' and 'сat' are different unicode strings although they look the same. The first letter in the second 'сat' is Russian 'с' - "CYRILLIC SMALL LETTER ES" to be exact. The system can't easily tell that you're spoofing another user's name - to the computer the nicks are different.
Edit: Preventing mixed scripts does not solve the problem. For example 'сосо' is pure Cyryllic and can be used to spoof ascii 'coco'.
Also, left-to-right override (and friends.) Leave them unsanitized and they'll mess up your whole page.
HTTP authentication?
There could be some problems with sending the unicode username (and/or password) over existing protocols. One case that I have run into before is with Basic authentication. There is no well defined way to handle sending these unicode usernames/passwords in the basic auth headers.
While it is at all questionable why there should ever be username and not just a 'password' to identify a user, I think there's no reason to disallow unicode usernames.
What's more important, is that password to be validated as lanuguage-agnostic: it should treat keystokes regardless of user's keyboard setting. This means, "שלום" and "akuo" would be the same password. This is important, because the user often doesn't see the password characters he's typing, and they are getting severely pissed if the CAPSLOCK is on.
While you can go ahead and allow unicode, understand that some usernames will not work as expected thanks to different cultures applying different rules to the same characters.
Consider the basic case for breaking case sensivitity: In Turkish, the usernames "Id1" and "id1" are different (in Turkish there are two different Is, one with a dot and one without, resulting in 2 captial and 2 small letters that do not match the same captialization rules as English). So while any Turkish person can enter their name in their own language, the program will not treat their name as they expect - instead it will undergo a strange transformation into mutant English.
Special latin characters in European languages have similar overlaps, making it seemingly random as to which language they are being entered in. Other regions of the world have similar shared characters where the rules of use differ - in some cases national and cultural hatreds could result in some very angry people when the characters making up their username are treated as if it was written in the language of their hated enemy (due to that being the operating systems default setting for those foreign characters).
Your observation is not always true. And, the choice of ASCII is largely human factors rather than technical or security issues.
For most of the case, it is just for the ease of programming. A programmer never know that all software, libraries, utilities in website will break or not with some characters. Why risks the website development while ASCII works well? Also, some packaged web software would hinder the use of Unicode in user name. This contributes the issue that many websites only support usernames in ASCII.
Theoretically, all current software can handle 8-bit data well. There is no problem in storage or transmission nowadays. Even if some protocols not, they can translate in UTF-7 or with other transformation schemes.
There are some issues with Unicode. It is more on the side of data processing. It might be display, fonts, readiness of software and software libraries for non-BMP characters, collation, comparison, input methods, writing directions. Administrators might not knowledgeable enough to handle them. Depending on the nature of website, it could be a problem, but mostly not.
For admin purpose, it is not easy to type some exotic characters. It makes admin hard to search for users. It is also hard for an admin to keep offensive usernames in foreign languages off the website.
However, it is not uncommon that Chinese usernames are used Chinese website. It might not always in ASCII. So do other cultures and languages. Some global projects accept nearlly all kinds of Unicode characters. Wikipedia is an example.
Plain ASCII is rare, I'd say. Often it's just that no one thinks of it since in Western Europe Latin 1 suffices and for the US as well. Some databases make distinctions between text in legacy character sets and Unicode (varchar vs. nvarchar) or for other databases a special character set has to be set.
Especially in the US many people never even notice that ASCII won't be enough. Some try to find excuses with »Users have to enter it« or similar which are mostly bogus, though.
To answer your question, I doubt there are security considerations, except maybe for spoofing other people's names using different scripts (a and а look identical, but one is Latin, one is Cyrillic – this has been done with URLs before). Generally I see it as an oversight by developers who probably should know better.
I would say a big reason is the lack of support for unicode in most PHP installations. It isn't easy to work with, so why allow it when the possibilities in ASCII are sufficient to cover your entire user base?
Or, we could just stop giving a crap about what a username looks like, and whether WE can pronounce/ remember it. That should be the USERS concern. If no one remembers you, that's your loss. And, as for name spoofing, that is almost unavoidable in any case. And yet, rarely do you ever hear of username spoofs.
Imagine a forum, imagine someone posts with an account that LOOKS identical to yours. You get in trouble, say you didn't do it, post a link to your history, see the post isn't there. Click the profile of the guy who ACTUALLY posted it, and bam, you have his profile. He's now bannable.
Having the same name doesn't mean you have the same user data. Any application that doesn't make it easy for you to differentiate two similar users is piss poor anyway and needs to be rewritten.
Related
For I18n testing, I'm looking for a test string that have a good representation of all commonly used languages (supported by UTF-8) and have all the special chars of these languages that normally have issues in display.
Will use this test string to keep sure that our system process these languages correctly and have the correct font that can display all these languages correctly.
E.g. the sample text should have chars from latin languages, Far East Languages, right to left languages...
There is no clear answer to your question, as it is full of ambiguous terms, for instance "commonly used languages" or "normally have issues in display". This is highly dependent on OS, OS version, the text engine used to display the text, fonts installed. Pretty much the whole tech stack.
Sprinkling "all" in the question (all the special chars, all ... languages) make any answer useless.
You will looking at a string of tens thousands of characters. Then you have a lot of combining marks, and ligatures. Do you want to check all of those combinations too? Those might also have "issues in display"
If all you want to do is check that your application works in (most) languages, try taking some (not all) characters from each Unicode block. Might also want to avoid historical scripts (i.e. cuneiform, Egyptian hieroglyphs, etc.) the are not covered by common fonts.
In general, if you application does not corrupt the string somehow, it will render properly. And if it does not, then it is not your app at fault, it is some limitation in the underlying technology (i.e. the Windows console)
If you explain what you are trying to do, you might get a better answer.
Or you can just search for internationalization testing.
In order to implement a CAPTCHA for my login page, I would like to understand how a translation test can be considered secure compared to popular image recognition patterns.
All customers will be bilingual speakers of an orally learnt and used Polynesian language i.e., no formal spelling conventions (hence the translation to English not the reverse), so instead of asking them to read distorted letters I would like to ask them to translate a simple sentence into English to be validated from the PHP server side.
Is this secure/accurate?
The basic idea to state that this kind of CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is totally insecure is that while the OP states that "currently" Google Translator doesn't offer support for Polynesian language, it cannot be excluded that it will do so in the future.
More generally, translation is not a valid CAPTCHA test because of the following considerations:
Comparing a random sentence VS its automated translation using a public translator (e.g. a future version of Google, Bing) is equal for a hacker submitting the same phrase to the translation engine
Using a whitelist of sentences and their translations will be eventually overwhelmed by the accuracy of the automated public translators
I mean that modern public computer translators are perfecting their accuracy. If you assume that a public translator is unable to perform an accurate job today and challenge the user with a known phrase the translator cannot process, technology will tend to eventually fix that translation and you will get the challenge sentence easily spotted by robots.
That is the main principle of ReCaptcha being used as an OCR, but from the opposite side. I will suggest you to read this paper but briefly the researchers state that ReCaptcha is destined to improve its accuracy far more than automated OCRs because of user input.
Since Google and Bing Translate widely use user-submitted data to improve their translation process, they will be subject to a human-aided machine learning eventually breaking the Turing Test for that kind of challenge (e.g. ReCaptcha will read like a human, Translate will translate like a human)
After reading the comments, it seems the only danger I face is a vague future Google Translate one, which is unlikely to eventuate. So I'm going to stick my head out and say that this is indeed a good security measure which could conceivably be useful to many businesses or organisations that have such a customer base. Thanks for the assist.
Major point in it's favor is ease of use for the customers all of which so far prefer it to trying to read captcha. I put it on a live system so had 80+ people use it today.
I presume they all speak English too then? Unusual to require your users to be bilingual. Even if this is the case today, is it possible that with future growth you might be excluding certain users? What if someone moves into the area who wants to signup but only speaks English?
Language is a funny imprecise thing. You could take a sentence and probably translate it a number of different ways. Computers deal in precision so you need a question where there can only be one answer.
Also, the whole idea of a CAPTCHA is to make sure it's a real person but it may not be too hard to write a program that uses google translate or something similar. It may not always get it right but it'd probably get through some of the time.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
User Names and White-Spaces
Are there any known problems to take into consideration while programming a website where login names my include spaces? If there is none, why most websites avoid it?
Most known websites, or at least those with even an ounce of programming experience behind them, should not have problems with spaces in usernames, as they are escaped when inserted into the database. Some websites may avoid spaces in usernames just as a personal policy. Other than that I cannot think of any programming-based reasons.
Decent website software should not have any problem (read: it will not break) with whitespace in usernames. Forms, databases etc all support arbitrary text sufficiently.
However:
Using usernames with whitespace will prevent their reuse if a website decides to e.g. provide an E-mail redirection service later. Or in general to use them with a "legacy" service where whitespace is not supported.
Whitespace is vulnerable to various diseases. The split-at-end-of-line disease, the who-trimmed-my-spaces virus, the damn-how-many-spaces-are-there syndrome etc. It inserts unnecessary ambiguity while communicating something (usernames) which should never be ambiguous.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Most sites at least employ server access log checking and banning along with some kind of bot prevention measure like a CAPTCHA (those messed-up text images).
The problem with CAPTCHAs is that they poss a threat to the user experience. Luckily they now come with user friendly features like refresh and audio versions.
Anyway, like linux vs windows, it isn't worth the time of a spammer to customize and/or build a script to handle a custom CAPTCHA example that only pertains to one site. Therefore, I was wondering if there might be better ways to handle the whole CAPTCHA thing.
In A Better CAPTCHA Peter Bromberg mentions that one way would be to convert the image to HTML and display it embedded in the page. On http://shiflett.org/ Chris simply asks users to type his name into an input. Examples like this are ways to simplifying the CAPTCHA experience while decreasing the value for spammers. Does anyone know of more good examples I could use or see any problem with the embedded image idea?
Image presented as HTML table is just a technical speed bump. There's no difficulty in extraction of pixels from such document.
IMHO CAPTCHA puts focus on a wrong thing – you're not interested whether there's a human on the other side. You wouldn't like human to spam you either. So take a step back and focus on spam:
Analyze text (look for spammy keywords, use bayesian filtering)
Analyze links (blacklist spammy domains – SURBL, LinkSleeve)
Look at traffic patterns and block floods
There's no single perfectly accurate method, but you can use few of them and weight the result to get pretty close.
Have a look at source code of Sblam! (it's a completely transparent server-side comment spam filter).
Alternatives to captchas are going to be to consider the problem from other angles. The reason for this is because captchas are built around the idea that a human and computer actor can be distinguished. As Artificial intelligence progresses, this will always become an increasingly difficult problem as the gap between computer and human users shrinks.
The technique used here on slashdot is for other users of the site to act as gatekeepers, marking abuse and removing offending posts before they become noticeable to a wide audience.
Another technique is to detect spam-like posts directly, using the same technology used to filter spam from email. Obviously it isn't 100% effective for email, and wont be for other uses, either, but if you can filter out 75% of the spam with very few false positives being filtered, then other techniques will only have to deal with the remaining 25%.
Keep a log of spam-related activity, so that you can track trends about offending ip addresses, content of posts, claimed user agent, and so forth, so that you can block abusive users at a routing level.
In nearly all cases, your users would rather put up with the slight inconvenience of abuse prevention, than the huge inconvenience of a major spam problem.
Ultimately, the arms race between you and spammers is one of cost-benefit. Initially, it will cost spammers close to nothing to spam your site, but you can change that to make it very difficult. Even if they continue to spam your site, the benefit they recieve will never grow beyond a few innocent users falling for their schemes. Once the cost of spamming rises sharply above the benefit, the spammers will go away.
Another way to benefit from that is to allow advertising on your site. Make it inexpensive (but not free, of course) and easy for legitimate advertisers to post responsible marketing material for your users to see. Would be spammers may find that it is a better deal to just pay you a few dollars and get their offering seen than to pursue clandestine methods.
Obviously most spammers won't fit in this category, since that is often more about getting your users to fall victim to malware exploits. You can do your part for that by encouring users to use modern, up to date browsers or plugins so that they become less vulnerable to those same exploits.
This article describes a technique based on hashed field names (changing with each page view) with some of them being honeypot fields (i.e. the request is rejected if they're filled) that are hidden from human users via various techniques.
Basically, it relies on spam scripts not being sophisticated enough to determine which form fields are actually visible. In a way, that is a CAPTCHA, since in order to solve it reliably, not only would they have to implement HTML, CSS and JavaScript fully, they'd also have to recognize when a field is too small to see, colored the same as the background, hidden behind another field, placed outside the browser's viewport, etc.
It's the same basic problem that makes Web Standards a farce: there is no algorithm to determine whether a webpage "looks right" - only a human can decide that.
seen this?
It's a system with cute pictures instead of captcha ;)
But I still think honeypots are a better solution - they're so cheap&easy&invisible
I really think that Dinah hit the nail on the head. The fact seems to be that the beauty of the whole CAPTCHA setup is that there is no standard. Standardizing would only help the market to be more profitable.
Therefore it seems that the best way to handle the CAPTCHA problem is to come up with a fairly hard system for bots to catch that is NOT used by anyone else on the planet. It could be a question system, a very custom image creator, or even a mix of JS calls that only browsers respect.
By the time that your site is big enough for spammers to care you should have the budget to rethink your CAPTCHA setup and optimize it much more. In the mean time we should be monitoring our server logs and banning bad agents, refers, and IP's.
In my case I created a CAPTCHA image that I believe is very different from any other CAPTCHA I have seen. This should do fine for now along side my Apache logs + htaccess banning and Aksimet checking. Maybe I should spend time on a reporting feature as well.
although not a true image captcha, good turing test is asking users a random question - common options are: is ice hot or cold? 5+2= ..? etc.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have to write up a technical document in Framemaker that explains various programming source-code.
So my document consists of a bunch of text, followed by a bunch of source code (Java, XML) and then followed by more text, etc.
This question is not about whether I should or should not use Framemaker - that is the software I have to use . . .
What I'm confused about is how to format source code as part of my document. Has anyone done this for a technical document and come across any instructions or tips? So far my Googling hasn't produced anything relevant to what I need to do.
At the very least, create a paragraph style for code samples, use a good monospaced font, and don't forget to turn off hyphenation.
When I used to do this, I would create a table style and paste the code in there, so I had a nice title header above it, and it stood out a bit. The only gotcha there is that Frame table cells won't break across a page break, so if your code is longer than a page or threatens to go below the bottom of a page, you'll need to create multiple rows in your table and break up the code across the rows.
From a paper I wrote on this some years ago which will be available again online next week.
Typographers are primarily concerned with legibility, and have tools, practices, and traditions
dating back hundreds and indeed thousands of years on which to rely when setting texts in
natural languages. However, computer programs are not written in natural languages. They
are written in ‘programming languages’: artificial languages, which have their own rules of
syntax, their own conventions of presentation, and their own criteria of legibility. Computer
code is therefore a special domain for typesetting, just as are music, mathematics, and chemistry.
These domains have their own rules, which are not the rules used when setting natural
languages.
Computer programming itself is of very recent origin, and the
practice of setting it in type doesn’t go back more than about 45 years: significant volumes of
computer code have only been published in the last 20 years or less. The associated typographical
discipline is immature or indeed practically non-existent, and the typographical
expectations of the practitioners in the field are also low, as you can see by inspecting many trade books. There's no reason why you can't try to do better.
Use a sans serif font. In one of my books I used the same font family, FF Scala for the text and FF Scala Sans for the code. I think it looks great but there are contrary opinions: these may force you to use a monospaced font, although personally I think this is very outdated. Avoid Courier, it doesn't blend with anything.
Indentation is part of the notation. You must respect the existing left indents. The source code will already be tabbed. Reduce each tab to one or two spaces at most, otherwise you will run out of horizontal room.
Try to lose as much vertical space as possible, e.g. suppress blank lines.Try to get the entire sample on to one page. Let it float if necessary to accomplish that.
Line breaks are part of the notation. Don't add line breaks without consulting the author.
Quotation marks are part of the notation. Don't change single to double or vice versa.
Justification: Computer programs are always written, viewed, and set left-justified, right-ragged.
Page breaks. When setting computer code in a book, page breaks can’t just follow the simple orphan/widow principles used when typesetting natural languages. Instead, the logical ‘blocks’ of the code must be kept together if possible. It is not usually possible for the typographer
to determine the block boundaries in code, although a blank line is generally an acceptable
point for a page break. ‘Block comments’ should be kept with the following block of code.
If you don’t know what these are, ask the author.
Hyphenation. Programming languages are not natural languages and do not observe the usual hyphenation conventions. Consult the author if you need to hyphenate, or just don't. Words in program text must never be hyphenated or line-broken except in accordance with the author’s instructions.
Upper and lower case. Case in program code is usually significant to the computer, and practically always to writers and their readers. Pairs of words are often used which differ only in case, representing different things: e.g. BufferedOutputStream and bufferedOutputStream.
Programmers, especially author-programmers, are usually highly systematic about
case, in ways which may not necessarily make sense to the typographer (or other programmers!).
Practical recommendations
Indent in em units. The solution to many of the issues in typesetting computer programs is the em. The author’s tabs will most likely be to the next multiple of 8 spaces (1 , 9 , 17, …); typographic tabs for program code should be in multiples of 1 or 2 ems. Adopting the em as the unit of indentation may at first ‘look funny’ to the author, as the indents may be much narrower than seen on screens or printouts. However, as long as the vertical alignment of tab stops is preserved, the author’s intention is fully preserved.
Line breaks must be as per MS.
Page breaks: If page breaks may occur in the middle of program code, the author must be consulted as to preferred page break points. Usually this is to be avoided altogether in short examples; in longer programs, the author should indicate all possible page breaks in the MS.
Quotes: Conventionally, ‘straight’ quotes are used, not typographic quotes. This is historically determined, by the use of fonts without typographic quotes (e.g. Courier, Helvetica) in typeset computer code. It is not required by the properties of the notation.
I see no reason against using typographic quotes when setting computer programs as
long as single quotes stay single and double quotes stay double, i.e. as long as the author’s
quotes are preserved rather than ‘corrected’ to standard typographic practice.
Numerals: Conventionally, lining numerals have always been used in program code. If you can be bothered using old-style numerals in program code, or if the font is built that way, I can see no reason against it. You must choose a font in which 1, I, and l (lower-case L) are distinct, as also 0 (zero) and O.