Hide, obfuscate or otherwise prevent the harvesting of email addresses

Question

I am developing a public repository webapp for my organization.

It will be public webapp, exposed to the internet. All people and organisational units can be queried and its contact data will be displayed. It is developed as a single page app against a REST back-end. There will also probably be a mobile front-end in the future.

One requirement is that people's emails are visible, and are clickable links with the mailto:email@so.com href attribute, so users can click on the address to quickly start writing an email.

On the other hand, I want to make email harvesting difficult for spammers (I know that with the above requirement, it will always be possible to ultimately get the email adresses but I don't want it to be extra easy). So I don't want to expose the emails in clear text in my API.

The previous version of this app used server-generated text-to-image to show the address, and then the onclick handler used an AJAX call to get the actual address from the server (based on the ID of the person), then activate the "mailto" link.

It does not seem so good to generate one or two extra server calls for each person displayed, especially when displaying a search results list. I am thinking I can probably do better. For example, I could just include the email field in my API, but obfuscate/encrypt it. The app (or any future client made by us such as a mobile app) would know how to decode the email address.

Is there a better way to do this?

can't you generate the image when the email is set (or changed) and store that in the DB as a blob of data. Then you don't need the extra roundtrips, you just send it down instead of the stored email text. — gbjbaanb, Nov 26 '15 at 10:30
Yes, that could be a solution. In this case I have limited control on the database side though. And I would still need a round-trip when the user clicks on it to get the clear text. — Pierre Henry, Nov 26 '15 at 10:35
Sure, but that is a round trip only once, when the user wants to do something. If you cannot store the info in the DB (which would be the sensible place to put it), store it in a folder and name it appropriately, either sanitised email address or user id. — gbjbaanb, Nov 26 '15 at 10:40
Actually, wether I store it or not (just generate on the flight, maybe cache in memory), the idea to send it inlined in the JSON might be the key. — Pierre Henry, Nov 26 '15 at 10:49
Can you persuade the decision makers to use a contact form instead? There really is no good way to show email addresses to users but not to bots. — Philipp, Nov 26 '15 at 13:55
I am not sure a contact form will work well in that context. I will think about it though. — Pierre Henry, Nov 26 '15 at 15:25

score 5 · Answer 1 · answered Nov 26 '15 at 10:06

5

Don't overthink things, the obvious way (render image instead of text) is exactly the right thing to do here.

Nowadays, any time delay or processing cost involved in an extra server call wil be negligible compared to the kind of time it takes a user to move a mouse and perform a click in the first place. (From the viewpoint of a computer, people move in ultra-ultra-slow motion.)

answered Nov 26 '15 at 10:06

Kilian Foth

107,706
45
295
310

Yes, for a single person view, but when you have a search result page with maybe 20 emails on it, it's 20 server calls (one for each image). Might still be fast enough for the user not to notice, but just doesn't sound very nice. – Pierre Henry Nov 26 '15 at 10:13
@PierreHenry Huh? Are you ever revealing *multiple* email addresses in response to *one* user click? Doesn't a "mailto" action involve a click on *one* recipient? – Kilian Foth Nov 26 '15 at 10:23
Well in the current app, there is a search form, then the search results are displayed in a table with paging, and one of the columns in the result is the email, which is displayed as an image. So if there are 20 people displayed, there are 20 "email images", so 20 server calls. In turn each image is individually clickable. – Pierre Henry Nov 26 '15 at 10:25
2

@PierreHenry: Why isn't the server sending the "email images" as part of the search result? That is what your API should be providing as the representation for an email address of a person. – Bart van Ingen Schenau Nov 26 '15 at 11:39
Yes, see comments on the question. I had not thaught of inlining the images in the JSON and that seems like a good idea ! – Pierre Henry Nov 26 '15 at 13:09

score 2 · Answer 2 · edited Nov 26 '15 at 15:48

2

Why not use an image as stated above, and include a (lightly) encrypted email address for each image, along with a local Javascript function to resolve the obfuscated email upon click? That way everything stays single-trip, but most spam harvesters aren't going to be hooking into the event and looking for a process path for a result, they're just going to read the tag and hope it's valid. Simple and effective, no? Not the cure all end all, but it ought to do.

edited Nov 26 '15 at 15:48

Robert Harvey

198,589
55
464
673

answered Nov 26 '15 at 15:25

jleach

2,632
9
27

Actually, I was thinking that the encrypted address might suffice, given that it is a single page app with the rendering done client-side. Or are the harvesting robots smart enough to execute the Angular code and analyse the rendered markup ? – Pierre Henry Nov 26 '15 at 15:27
You might have a few that are intelligent enough, but for most? I'd say no. Most are cheap crap that doesn't have that kind of foresight, so that simple method ought to take care of the greater majority of cases. Note that I specifically wouldn't resolve this until onClick(). – jleach Nov 26 '15 at 15:29
So you think that resolving it for display (without using an image) would be vulnerable ? – Pierre Henry Nov 26 '15 at 15:38
No, an image to display would still be recommended. If it were to display in plaintext on the webpage, it wouldn't matter how it was generated: the bots will be able to get it easily. By obfuscating an underlying value until it's clicked on, you never have the actual value in "display mode" on the page except in the form of an image, so no direct readable text for bots. – jleach Nov 26 '15 at 15:57
To further clarify, this obfuscation of email address would be a technique in conjunction to to displaying images, specifically for the purpose of not having to require another request to the server upon actual user requirement of that email address. It lets you provide the email data along with the rest of the page, just in a format that makes it more difficult to be harvested. – jleach Nov 26 '15 at 16:03
Thanks, but, if not using an image, it being a client-side rendered single page app, the harvester, to be able to get the "display mode" of the page (the html containing the address in plain text) would have to execute the JavaScript that renders the page. And it would be too dumb to do it. No ? The only thing available directly in the API would be the encrypted email. – Pierre Henry Nov 26 '15 at 16:19
1

I don't have a lot of experience with bots and SPAs, and most SPAs tend to be behind the screen of authentication (e.g., requiring a login), so bots aren't that much of an issue, but I'd say that in the context of "pages" on an SPA, they still follow a general "navigation" of some style which bots are quite used to working with. In this case, it seems like loading the page itself is generic enough to expect that bots will get to it, and I'd take the quick obfuscation step just for good measure. It's easy to do and you can probably implement it in less time than this discussion has taken :) – jleach Nov 26 '15 at 16:24

Mike Nakis · Accepted Answer · 2015-11-26T16:52:36.480

First, let me say that the image is a good solution, and it does not require any extra roundtrips to the server while displaying search results: once generated, the image can be saved in an image file on the filesystem of the server, and served as a plain <img src="user337567.png"/>. This means that the server will essentially be caching the images, recomputing them only in the event that an email address has changed. Extra roundtrips to the server will only be required when the user clicks on an email address image, but clicking is an operation performed in human time, and therefore represents negligible overhead.

One slight problem with this approach is that spammers may be using optical character recognition technology. One way to account for this possibility would be to make the rendered email addresses difficult to read, sort of like a captcha, but you will never have any metric telling you how successful you were in this.

Other approaches:

Require authentication.

Make the webpages of your public repository webapp visible to all visitors, but hide the email addresses. When a visitor clicks on (or hovers over) a hidden email address, inform them that they have to register in order to view that information. In the registration process, require a captcha. The logic behind this:

It is only fair that you can see our email addresses if you first let us know yours, right?

Note that you can even use this approach in addition to serving email addresses as clickable images, for added security.

Also note that you can use this approach to protect a lot more information than just email addresses. (What if there is a need to also protect phone numbers later?)
Use additional anti-harvesting measures.

One approach commonly used is to require that clients use cookies, so as to be able to identify each client, and then keep track of requests received by the server from a specific client, and if the client sends too many requests too fast, then blacklist them. Normally, blacklisting means denying any service whatsoever, but in your case, blacklisting could simply mean that from that moment on you don't show them any more email addresses, or that from that moment on you start showing them images instead of email addresses.

Note that this is a generally useful thing to have, which may prevent various different kinds of abuse, and you might want to implement it regardless of what you end up doing specifically for the email addresses.
Implement "we'll call you back"

If you really want to avoid authentication, then instead of displaying email addresses, you can have a "contact" field, which, when clicked, pops up a dialog which asks the visitor to enter their email address, (probably along with a captcha,) and sends the visitor an email message to which the visitor may reply.

Thanks, but this cannot work in this particular case. Nobody will want to register just to be able to see the email of somebody he wants to contact. And we certainly don't want to have to mange (potentially thousands of) accounts, and authentication etc, just for this. — Pierre Henry, Nov 26 '15 at 16:22
About server round-trip : even if the image is on disk on the server, it still requires a round-trip (from the client browser) if you serve it with an img tag as you said. The solution might be to inline the images in the JSON as base64 encoded strings. — Pierre Henry, Nov 26 '15 at 16:25
I would think that this round-trip happens within the still-open connection of the HTTP request for the web page, (you are using "connection keep-alive", right?) so I thought that it would not be that big of a deal. But of course you are in a position of knowing better. — Mike Nakis, Nov 26 '15 at 16:28
I wouldn't care about just one GET for one image but in search results it might be like 20. Anyway, still no big deal, just trying to avoid unnecessary requests. — Pierre Henry, Nov 26 '15 at 16:30
@PierreHenry So, you really want everyone but the spammers to be able to view those email addresses, eh? That's a tough one. I amended my answer with a couple of more suggestions. — Mike Nakis, Nov 26 '15 at 16:41
Good suggestions. I'll talk with the others and see what we do :) — Pierre Henry, Nov 26 '15 at 16:44
I think a base point to make here is this: if you're going to display the emails without requiring authorization, you're going to have some spam bots getting it no matter what you do. You can easily cull 80% or so of it, but if you're particularly worried about that last 20%, you can spend 20 times longer trying to take care of it, yet will never get it to 100% secure. Either pick an "ok point", based on effort required, or put in some sort of authorization. — jleach, Nov 26 '15 at 16:49
I accepted this answer because of the additional suggestions. I ended up implementing a pseudo authentication for the app against the API, using a challenge based on some hashing. Not really hard for a hacker to crack if motivated, given that he can have access to the JS source code of the client, but at least the API is not "free-service" for the whole internet. — Pierre Henry, Feb 02 '16 at 14:47

Hide, obfuscate or otherwise prevent the harvesting of email addresses

3 Answers3