8

I was just thinking about how recaptcha is getting harder when I thought about another posible solution. Images won't last forever so we will need something else some day - like human logic or emotion. Google and others are trying grouping images by category (find the image that doesn't belong) but that requires a large amount of images and doesn't work for the blind.

Anyway, what if a massive collection of text was gathered (public-domain books from each language) and a sentence was shown to the user with 1 (or 2) words that were a select box of choices? Only computers that knew correct English/Spanish/German grammar would be able to tell which of the words belonged in the sentence.

Would there be any problems with this approach? I would assume that it would be easy enough for anyone that knew the language that the sentense was displayed in to figure out the answer easier than trying to read the reCAPTCHA text. Plus, storing an insane number of sentences would only take a couple gigabytes of space and wouldn't take anywhere near the CPU time creating images/audio takes. In other words, anyone could host their own captcha system with minimal impact on system performance.

Is there a problem with this approach? More specifically I'm looking for the main problem with this approach.

migrated from stackoverflow

Xeoncross
  • 1,213
  • 1
  • 11
  • 24
  • 4
    I tried doing one of the listening ones the other day as I was struggling with the visual ones on some sight. The sound was completely incomprehensible. All getting a bit silly, isn't it. – Armand Mar 08 '11 at 19:46
  • "Would there be any problems with this approach?" Have you read about the Turing Test? http://en.wikipedia.org/wiki/Turing_test Yes. There are problems with **all** approaches. The question is vague. Can you be more specific in what you'd like to know? – S.Lott Mar 08 '11 at 19:46
  • 2
    Or you could try to administer a Voight-Kampff test as a captcha. It seems to work on replicants... – FrustratedWithFormsDesigner Mar 08 '11 at 19:58
  • 4
    This sounds like an xkcd comic.... http://xkcd.com/810/ – Tyanna Mar 08 '11 at 20:10
  • @S.Lott Some of the people that have answered below have answered the question. There doesn't need to be a final answer - just any problems that can be though of that would defeat this setup. I'll mark the answer as the one with the most votes after everyone had a chance for some input. – Xeoncross Mar 08 '11 at 20:10
  • @Xeoncross: "Some of the people that have answered below have answered the question" Since the question is vague, it's hard to say what would constitute an "answer". It would help if you could actually **update** the question to clarify what kind of answer you're expecting. The "would there be any problems" is trivially answered by "yes", so that cannot possibly be what you're looking for. Since some of us can't read minds, could you provide us a clue as to what you're looking for? – S.Lott Mar 08 '11 at 20:13
  • @S.Lott, there isn't anything complex about this. The question is just a call for any input that can pose a problem to the idea. I will mark the answer with the most upvotes as "correct" even though I'm sure there are many "correct" answers to this question. – Xeoncross Mar 08 '11 at 20:15
  • @Xeoncross: "there isn't anything complex about this" Correct. It's not "complex". It's vague. The answer is "yes" there are problems. Since that answer is silly, please **update** your question to define what you want to know above and beyond the word "yes". – S.Lott Mar 08 '11 at 20:40
  • 1
    Well, firstly, not everybody may know a language perfectly, and many users may not be native speakers (e.g. me). But, most importantly, I wouldn't rely too much on the assumpion that all humans _are endowed with logic_. Experience taught me that in certain cases human logic is but a myth.. So chance may be that you would end up filtering out humans and letting bots through.. – Lucius Mar 08 '11 at 20:45
  • 2
    I don't know, but I've seriously considered buying captcha-breaking software just to legitimately log in to some sites. :) – davidhaskins Mar 08 '11 at 20:48
  • @Tyanna: I was actually wondering if this sort of "new" captcha would stir the research for better spell correctors :) – Matthieu M. Mar 09 '11 at 18:12

9 Answers9

7

First, I give you IBM's Watson. I think computing has far exceeded the simple fill in the blank problems of language.

Next, I give you all the Spelling/Grammar checkers implemented in software. Determining if a word is grammatically correct in a sentence is solved in > 90% of cases. I'll even stick my neck out and say they are better at literacy than most humans I know.

I don't think your CAPTCHA idea will work as well as you are expecting...

Dan McGrath
  • 11,163
  • 6
  • 55
  • 81
  • Spelling and grammars obey to rules, which even if obscure, do work. On the other hand, it would definitely stirr research in that direction too... – Matthieu M. Mar 09 '11 at 18:10
  • Assuming the high rate of correct usage by many spell-checkers, and the probability of guessing the remaining words as Jeff mentioned, I believe this would greatly reduce the effectiveness of the CAPTCHA. I think it would work fine for many smaller sites - but not one that spammers would mind spending a little time trying to break. Computers already speak better language than humans - spell-checkers prove that. – Xeoncross Mar 10 '11 at 15:49
6

Let's see, how long would it take to always select the first choice and eventually get it right?

JeffO
  • 36,816
  • 2
  • 57
  • 124
  • 3
    Ha, the base case for my own overengineered response. But, yes, this is why multiple-choice CAPTCHAs are silly. – Meredith L. Patterson Mar 08 '11 at 20:02
  • Certainly a problem that needs to be addressed. However, for now, my sites employ IP rate limiting and failure logging. Unless the spammer has a large-unrelated network (same as a distributed DoS) I would catch him. – Xeoncross Mar 08 '11 at 20:05
  • You wouldn't even give someone 3 chances? – JeffO Mar 08 '11 at 20:07
  • 3? I'm thinking more like 40 hits in 30mins before I ban them. – Xeoncross Mar 08 '11 at 20:17
  • 2
    Suppose wrong answer gives you 10 minute penalty and good answer reduces penalty by 5 minutes. If your penalty is greater than zero, you enter penalty mode. The penalty mode means application requires you to pass captcha before you reach any functionality. A brute force bot will reach a few years of penalty in no time, while human can (given clear instructions) recover from initial mistake in two steps. – Jacek Prucia Mar 08 '11 at 21:07
  • @Jacek That's a very interesting idea! Could go along with the idea that "remember-me" cookies only allow access to certain parts of the site - have to re-login to access the rest. – Xeoncross Mar 08 '11 at 21:58
  • @Jacek Prucia - you want to penalize a user for over a lengthy history of captcha mistakes? Just don't use multiple choice answers. It's really that easy. – JeffO Mar 10 '11 at 03:18
4

If you are pulling sentences from public domain books, a bot wouldn't need to know anything about grammar. It would merely need to index those same sentences and do a search to find which word the actual sentence used. And that assumes that you reasonably solve the problem Jeff O suggested where you can circumvent the problem by guessing the first option every time.

Plus, many of the sentences in the universe of public domain books would be inappropriate for this sort of endeavor. Many would be ambiguous without context. Many would contain objectionable content (imagine presenting a random sentence from Huckleberry Finn). So you'd have to have invest a decent amount of effort to get to a set of sentences that won't be offensive and won't be ambiguous. If you accept that some sentences will be ambiguous, you lose much of the ability to punish bots for incorrect guessing.

Justin Cave
  • 12,691
  • 3
  • 44
  • 53
3

A more challenging problem for bots would be to remove a word from a sentence, then present a choice between four different words of the same part of speech. (E.g., remove a noun; which of these four nouns fits best here?)

Tagging and parsing algorithms aren't perfect, but corpus-based approaches have gotten to the point where you can train up a parser well enough to help you beat the odds on a CAPTCHA with commodity or open-source software. (When you're spamming in volume, it's okay if some messages don't get through, as long as enough of them do to increase your overall success rate.)

Computers aren't as good with semantics yet though.

  • Yes, I was thinking about the word being a certain part of speech with matching PoS replacements since I have worked in automated text parsing and can see a compute beating this system if the choice of correct words was obvious. Then again, the less obvious - the harder for non-[language] speakers to answer. – Xeoncross Mar 08 '11 at 20:08
  • Of couse, there are times when people aren't so great with semantics... – FrustratedWithFormsDesigner Mar 08 '11 at 20:59
3

Most of the spam I get these days is actually not bot-generated. I get a lot of spam coming from third-world countries where people are hired for a few cents an hour to post messages on forums and blogs and such.

No system which differentiates between humans and computers will stop this.

For that reason, I've totally done away with CAPTCHA on my sites. Instead, I have a fairly simple javascript-based solution (basically, Javascript running on the client rearranges fields so that if you post with Javascript turned off, it fails). This stops 95% of bot spam, but obviously has no effect on the human spam - but then, neither would a CAPTCHA.

Dean Harding
  • 19,871
  • 3
  • 51
  • 70
  • I'm starting to see more and more sites that simply require a checkbox with a random id be clicked else a post isn't added, or a text printed somewhere on the page to be entered into a randomly named input field. Works far better than captcha, and is more userfriendly. – jwenting Mar 09 '11 at 07:20
2

Only computers that knew correct English/Spanish/German grammar would be able to tell which of the words belonged in the sentence.

The answer could become subjective (it is not in reality subjective but lack of language concepts spans all societies) and difficult for those who do not speak the language natively.

If there is a finite list of grammatical rules (which every language has) being presented then it simply becomes an algorithm; approachable now by any machine willing to implement the algorithm.

Aaron McIver
  • 3,262
  • 16
  • 19
  • And what about people who don't fully understand the language that the sentence is in? Will they not be able to pass the captcha? – FrustratedWithFormsDesigner Mar 08 '11 at 19:51
  • 1
    Good point, but it's the same answer as *What about people who don't fully understand the text in the reCAPTCHA image?* – Xeoncross Mar 08 '11 at 20:03
  • 2
    @Xeoncross There is a difference. The text in a typical CAPTCHA is abstracted. It's not that a user doesn't know that A == A; it's the presentation. The problem exists in the fact that the abstraction unfortunately removes legibility in the hope of creating a gap between what a human versus a machine can understand. It is basic in nature. An A will always equal an A. When you bring grammar into the mix you change the approach. You are now assuming each and every individual is at a given grade level of comprehension with regard to the language being used. – Aaron McIver Mar 08 '11 at 20:46
  • 1
    I must admit that I was rather excited by your analyses. Not only does it stop spam bots - but you have to be semi-literate to leave a comment! That would be so awesome! ;P – Xeoncross Mar 08 '11 at 22:01
  • 1
    @Frustrated: Some of us would consider that a feature, not a bug. ;) – Mason Wheeler Mar 08 '11 at 23:57
2
  1. All captchas are susceptible to captcha farming.
  2. Multiple choice is too easy to solve by random trying. (As pointed out by others.)

But ignoring these serious gotchas, there is the problem of languages.

Agglutinating languages like Hungarian or Finnish lend themselves easily to this kind of captchas, because words can have many suffixes and each one serves a different purpose in the sentence (e.g. the same noun has a different suffix when used as an object or a subject) However the rules are only complicated for humans, a machine will find the correct one in a few tries.

Isolating languages (English being an approximate example, Mandarin Chinese a much cleaner one) are even worse, as grammar is mostly dictated by position in the sentence and not word form.

Fusional languages like Russian or Greek probably pose yet another set of problems and so on.

To sum it up, linguistic riddles that translate well and are hard to guess randomly are notoriously hard to find. It's probably much easier to concentrate on semantics, than syntax. For example, "Continue the following sequence: Thursday, Wednesday, Tuesday..." or "bake, fry, roast..." and so on.

biziclop
  • 3,351
  • 21
  • 22
1

The usual idea behind a captcha is that it should stop bots almost all of the time. A multiple choice between N answers stops the bot only (N - 1)/N of the time, and so the bot will get through in an average of N tries.

You can implement time-outs for wrong captcha answers, but you can't be too stringent about this without seriously inconveniencing people who aren't good English (or whatever) speakers or have problems with select boxes (shaky hands, bad mice, other handicaps). What's more, time-outs are not going to stop a botnet, since the guesses can come from different IPs.

Moreover, how do you make sure there's only one legit answer? A randomly selected sentence from Project Gutenberg may make sense with several randomly selected nouns, but only one is the right answer.

David Thornley
  • 20,238
  • 2
  • 55
  • 82
0

All you're doing is making it harder for humans to use your site, while for the bots you're not adding any obstacles at all.

What you should rather focus on is creating a mechanism that automagically detects whether something that's posted is spam, and block the post if it is (deferring it to human moderation for example, and giving the poster a message to that extent).

CAPTCHAs have got to the point where they're so annoying I tend to ever more avoid sites using them, and many with me. This especially as they are widely known to have no effect on spambots whatsoever.

jwenting
  • 9,783
  • 3
  • 28
  • 45