31

Say I have a have request payload

PUT /user
{
  email: "invalid"
  ...
}

In the backend there is a email regex, which I cannot modify. Currently the behavior is to output:

{
  "error": "'email' fails to pass regex '<some_regex_here>'`
}

Should I go with existing behavior or change the output response to

{
  "error": "'email' is invalid"
}
夢のの夢
  • 427
  • 1
  • 4
  • 5
  • 101
    So you are asking: if Sally from accounting types her email wrong, should she get a screen full of gobbledygook? – user253751 Mar 15 '21 at 17:00
  • 5
    There's nothing inherently bad about showing a regex to the enduser. But unless the end user is very technical, it's also not likely to be very helpful to them. We're not really in a position to answer this question tho, what are the requirements given to you by the product owner or end user? – Rik D Mar 15 '21 at 17:02
  • According to the architecture, which side has the responsibility to maintain an internal log of error messages? – rwong Mar 15 '21 at 21:08
  • 17
    if your enduser truly reading your API responses? isn't there a frontend in-between your backend and your end-user? Is the end-user you talk about a dev building that frontend? Depending on who your enduser is, i.e. to whom the response will be exposed and what the surrounding architecture is the answer may change drastically. – Frank Hopkins Mar 16 '21 at 05:33
  • Give a simple understandable message to the user, and *Log* the details. If the problem needs digging into, the log can be checked by the troubleshooter. – JDługosz Mar 16 '21 at 15:07
  • 12
    Oh, so I need to enclose my mail address in slashes, and it should have a price in dollars attached at the end ...? – Hagen von Eitzen Mar 16 '21 at 15:30
  • 1
    "Is ... regex ... bad practice?" Yes. There, simplified it for you. For very simple matches, writing it by hand will be more readable and run faster. For non-trivial matching, using a parser generator will be more readable, *probably* run faster, and be less likely to result in bugs. Regex is virtually never the solution answer to any technical problem. – Mason Wheeler Mar 16 '21 at 23:41
  • If you really need the regex, why not put it into an internal log? The user doesn't need to know the technicalities of the error, just that something went wrong. Ideally, I'd say you want to inform the user that they entered something incorrectly and that the specifics are recorded in an internal log for devs to check. – The Betpet Mar 17 '21 at 11:04
  • If you put a regex in an error message and rely on it as the only source of info for a user then you must provide your direct contact information, phone number and email, and encourage the user to contact you since you are the only one that can explain it in lay-man terms. – MonkeyZeus Mar 17 '21 at 14:36
  • 12
    Even before you consider showing a regex to the user, **you should not validate e-mail addresses with regular expressions**. Either the expression will be imprecise or it will be [too long to be of any help to the user](https://code.iamcal.com/php/rfc822/full_regexp.txt) (and I’m not even sure that one doesn’t miss anything). – user3840170 Mar 18 '21 at 07:36
  • 1
    The best generic regex for e-mail verification is mostly: `.+@.+\..+` :D or at least `([^ ]+|".+")@.+\..+`. Otherwise you need to do this very long thing what is posted above ↑ – morsik Apr 22 '21 at 07:30
  • It's perfectly OK to show the regex to your API consumer. As the API producer, they are your end user and they are (expected to be) technical enough to understand regex. If fact, it will be helpful for them, if they are the front-end developer also, to put first level of validation based on your regex. – matrix Apr 28 '21 at 12:51

6 Answers6

150

For any error message (and mostly for any message at all), you need to ask yourself:

  • Who is the audience of the message?
  • What can they do about the problem?
  • What information do they need to solve the problem?

I would argue that knowing the regex is pretty much useless to the end user, because even if they know what a regex is, it doesn't help them fix the problem:

  • They made a typo; the fact that the email is wrong is enough information for them to take a second look at it.
  • The email is correct; that means the regex is probably wrong. Doesn't help them (the end user) to fix the problem, because they don't have a problem. It is you (the developer) that has the problem.

Knowing the regex would allow me to tweak the email address so that it passes the regex, but that makes no sense; if I tweak the email address just so that it passes the regex, it will no longer work for the intended purpose.

Mike P
  • 109
  • 3
Jörg W Mittag
  • 101,921
  • 24
  • 218
  • 318
  • 13
    I think this is the current best answer, because it's the only one to mention the audience. If another developer has to write some code to consume this API, the actual regex in the output of the error message might be very helpful, especially if the data type is not as well known as an email address. – Rik D Mar 15 '21 at 19:25
  • 7
    While it doesn't apply to an email address specifically, there are other cases where you can make tweaks to pass the regex check, e.g. pick a new username or password when creating an account or add the country code to your phone number. Not that the typical end user would be expected to understand a regex (even many programmers don't understand regex that well, if at all). – NotThatGuy Mar 16 '21 at 09:42
  • 4
    While the end user may not be able to solve the problem, knowing the erroneous regexp might allow them to provide a better error report. "Your regexp doesn't accept an email with a 3 in it." – Barmar Mar 16 '21 at 14:25
  • 3
    In some cases, it's possible that a regex could disallow a valid e-mail address which the user could alter to pass. For instance if the regex disallows some characters, the user could change x.y+z@a.com to xy@a.com, which is equivalent. – dbkk Mar 16 '21 at 18:39
  • 17
    @dbkk: No, they are not equivalent. Those are two different email addresses. The fact that one provider (namely Google) happens to route those two addresses to the same mailbox does not make those email addresses equivalent, nor does it mean that other providers to the same. – Jörg W Mittag Mar 16 '21 at 19:15
  • While this is a good analysis, one thing missing is, "how quickly and reliably does the developer know which regex was being used to match the email address?" Even if the developer does know exactly what version of the software was in use at the time, going and finding that particular code in that particular version may take time and be subject to human error. (Don't take this as an argument for any particular case, but as something to consider in the overall context of your system and its error reporting.) – cjs Mar 17 '21 at 06:59
  • 1
    @cjs: I understood the question asking about reporting errors *to the user*. Logging errors *for developers* is a different thing, hence my first bullet point. – Jörg W Mittag Mar 17 '21 at 07:14
  • 1
    Yes, but reporting errors to the user may _also_ be part of the process that gets error information to the developer. In this case, one must balance what's best for each, along with the costs of the various options. (At one extreme, errors that I believe should very rarely or never happen I might report in ways completely useless to the user, such as `Error code A3FF291C` because, though I could improve that, I feel spending time on other areas of the product would provide more overall user benefit. I might change it later if experience shows my initial analysis was wrong.) – cjs Mar 17 '21 at 07:22
  • @cjs Yes, this is especially helpful with internal systems and smaller niche products, where error reports are often coming from a limited population of users and going to a single developer or small team. Also, error messages are useful for helping tech support with troubleshooting, not just for developers. Nothing is more worthless to tech support than a generic error message like "An error occurred." – barbecue Mar 17 '21 at 13:10
  • @Jörg Sure they're technically according to the RFC not the same email, but for the user they are the same email. And receiving a sensible error message makes a practical difference: In one case I know that removing that + specifier will allow me to pass the email check, in the other I'm left with a "an error occurred" or whatever which gives me no hint what went wrong. – Voo Mar 17 '21 at 15:17
  • @Voo. Whether or not they are the same email *depends on the provider*. It's pretty common (not just Gmail) for me@example.com and me+myself@example.com to be routed to the same inbox, but it's not necessarily the case. – TRiG Mar 18 '21 at 14:51
  • 1
    @TRiG Jörg's point I think is that technically they're *always* different email addresses according to the RFC, just when you sign up for say gmail you get a group of email addresses that are conveniently mapped to a single inbox for you. But yes in practice that's distinguishable from "single email" for every actual user, so it's not a practical distinction to make. – Voo Mar 18 '21 at 14:57
  • @Voo: The important point is that it is *only* Google that is doing this. For my company email, that is *not* true. On the other hand, for my personal email *any* address is routed to the same inbox. On yet another of my emails, joerg@provider.com and joerg@provider.de are routed to the same inbox. If you are trying to validate *mailboxes* instead of *addresses*, you are in for a world of hurt. – Jörg W Mittag Mar 18 '21 at 18:44
  • @Jörg My company provided self-hosted email thingie (no idea what it is, I assume they probably cribbed the idea from gmail) has the exact same feature. So no this is not just google doing it (a quick search shows that outlook.com is doing it too). But the point is: It's very hard for a developer to consider all circumstances and not providing useful information in error messages to the user because you think it can't possibly help, regularly turns out is incorrect.. just as this example shows. – Voo Mar 18 '21 at 18:48
  • @JörgWMittag I think you meant that it's _not only_ Google doing this general sort of thing, right? (As you demonstrate in the rest of your comment.). Many systems map multiple e-mail addresses to a single inbox, both automatically (removing any trailing `+...` on the user part of an address to determine the inbox is common) and via arbitrary configuration ("aliases" from `/etc/alias` on Unix systems or via various other means for many other e-mail providers). – cjs Mar 19 '21 at 06:52
  • I first saw the description of "good" error messages from [David Pogue](https://www.macworld.com/article/158186/desktopcritic.html) back in 2000 (which I referenced [here](https://antipaucity.com/2011/09/26/effective-error-messages/#.YImzV-spBAc) – warren Apr 28 '21 at 19:12
31

Yes, this is bad for various reasons.

A normal end user is not going to gain anything from reading the validation regex over just reading an error message.

An attacker may or may not be able to use the exact regex to craft an attack string that causes denial of service or compromise of security. This is not likely, but it's certainly more likely with the regex than without it.

Requirements on the format of user-selectable values should always be expressible in a single, simple sentence. Anything more complex will cause more confusion than it resolves. Note: simply saying that your email must satisfy RFC XXXX is not simple enough - the official spec for email addresses is already surprisingly (or perhaps staggeringly) complex.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 42
    "Note: simply saying that your email must satisfy RFC XXXX is not simple enough" – It can also be slightly annoying because a lot of the regexes you find online are actually wrong, so there is a possibility that you are telling a user "you must comply with RFC" when in fact that user *is* complying with the RFC and *you* are the one that is not. – Jörg W Mittag Mar 15 '21 at 17:41
  • 23
    In practice, pretty much every email validating regex is useless. It's far more useful, and with no false negatives, to simply check there's an `@` in it then send a verification email. – OrangeDog Mar 16 '21 at 09:53
  • 7
    @OrangeDog: Indeed. In fact, whether or not an email address is spec conformant is pretty much irrelevant in pretty much every case I can think of. In almost every use case, the reason you ask for an email address is because you want to send emails to that address. The only way to verify that you can successfully send emails to that address is to successfully send an email to that address. There are plenty of email providers that accept non-spec-compliant email addresses. And there are an infinite number of syntactically spec-compliant email addresses that don't have an associated mailbox. – Jörg W Mittag Mar 16 '21 at 18:03
  • 10
    @JörgWMittag The only way there could be an inifnite number of such addresses is if you allow them to be infinitely long, which definitely [is *not* spec-compliant](https://stackoverflow.com/a/574698/864696). The number of spec-compliant addresses that are not serviceable is definitely finite, even if it is absurdly large. – Ross Presser Mar 16 '21 at 18:12
  • 13
    @RossPresser: The second I hit Enter, I knew someone would call me out on that. – Jörg W Mittag Mar 16 '21 at 18:28
  • 3
    @JörgWMittag I'm glad I could meet your expectation, then! :) – Ross Presser Mar 16 '21 at 18:46
  • @RossPresser If we take 256 as being the maximum length, and an extremely limited character set as the available contents (e.g. downcased a-z and certain other characters, something like [-a-z.+]), we're looking at a number of available email addresses in the range of 30^256. While it's true that 30^256 is as far away from infinity as 1 is, trying to wrap your head around that number is as good a place as any to start trying to figure out how infinity works. As a quantity it's _effectively_ infinite, and the central point remains: Restricting to some regex does very little for you. – Williham Totland Mar 17 '21 at 03:51
  • @WillihamTotland That is _not_ a good way of figuring out how infinity, as a mathematical concept works. Distinguishing "infinite" from "computationally infeasable" is. – cjs Mar 17 '21 at 07:07
  • 3
    Infinite isn't just a mathematical concept, it's also an English word with a widely accepted meaning of "unimaginably large." – barbecue Mar 17 '21 at 12:34
  • 1
    @WillihamTotland I already stipulated that it's absurdly large. "Infinity" has a definite mathematical meaning and there is nothing lost by avoiding this term. "Effectively infinite" is an almost meaningless term that doesn't tell you anything more than "unreachably large" (or "computationally infeasible, thanks @cjs), which conveys the exact meaning without incorrect connotations. I said nothing about what regexes do or don't do. Finally, if you're going to invoke a character set, remember that digits are perfectly legal and in wide use in email addresses. – Ross Presser Mar 17 '21 at 13:41
  • 1
    @barbecue First, I dispute that such meaning is widely accepted, or, especially in the computer world, that the word is widely used as such. Second, if I allow your two meanings in use, why use a term that can be misunderstood when better terms are easily at hand? – Ross Presser Mar 17 '21 at 13:43
  • 1
    @RossPresser I have no reason to lie. I've got lots of dictionaries on my side, including [the OED](https://oed.com/view/Entry/95411). Feel free to consult any of them for yourself. People with no understanding of advanced mathematics whatsoever use the word infinite in conversation. It's pretty self-evident that the strict mathematical definition cannot be the most commonly used one. – barbecue Mar 17 '21 at 14:23
  • I think I've used this regex for email address validation before. I dare you to show it to a user: `(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])` – Flydog57 Mar 17 '21 at 22:27
  • IMHO, in the case of a web app, it could be of help to have a simple regex to validate an email on the UI to avoid typos and invalid requests, but then you will have the proper validation on the backend (typically with email-specific code and not a regex). – Andrew Apr 29 '21 at 13:51
16

As someone who previously used email addresses that too many sites thought were invalid, I appreciated at least knowing that you used a regex for validation, because unless all it does is check for an @ with at least one character on each side, I almost guarantee you got it wrong. In the worst case I saw, it accepted my email during registration, but later rejected it during login.

Even a non-technical user can post a question somewhere that says, "Site X won't accept my email address. It keeps saying it doesn't match the regex, whatever that is." And someone can tell them it's most likely the site's fault for only accepting a subset of valid email addresses, and they'll know to look out for the "regex" word, even if they don't know what it means.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • 2
    Having also been bitten by this many times, I'd even go so far as to give examples in the error page of valid email addresses that the system nevertheless rejects (but then I reserve a circle of hell for systems which reject valid email addresses on syntactic grounds). – Patrick Stevens Mar 16 '21 at 14:52
  • 1
    @PatrickStevens, I think poorly written regex's are the primary cause of many bugs, and I would add that they can't be proven correct, and in some cases they might never terminate on some inputs. You should never use regex to process untrusted user inputs, and until you've validated their credentials, generally email & password, you can't trust them, so it's a really bad idea to use a regex to validate that sort of user input. – jwdonahue Mar 17 '21 at 04:52
  • Curious what your email is (you can post it obfuscated obviously). – Luke Vo Apr 22 '21 at 09:46
  • @LukeVo, some providers support adding a `+topic` to the left side of your email, making it easier to filter out emails from a certain place. I tried to use that for a while and had a ton of issues. – Karl Bielefeldt Apr 22 '21 at 12:51
  • Agreed, my experience is that a regex is less helpful than checking for an `@`. If a working email is important then you need to send verification anyway. – Rafi Apr 28 '21 at 09:16
10

From a general security perspective, the "best practice" principle is to avoid exposing internal details of the system to a user when an error occurs, to prevent a hacker from using that information to breach the system.

That's why IIS operates in two modes: a "User Mode," where a faulty page displays, at most, an HTTP response code like 404 or 500, and an authenticated "Administrative Mode," which will also supply detailed error information like stack traces.

In some cases, pages will actually display incomplete or outright wrong information. For example, in login pages it is common to respond to an incorrect password with something like "Authentication Failed," without identifying whether the login name or password is the problem. If a user tries to open a web page for which they don't have adequate permissions, the web server may simply respond with 500 instead of telling the user they don't have permission.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 6
    This is known as the “computer says no” approach. I don't agree that IIS should be taken as a model of “best practice”. Here “Adminstrative” should really be read as “debug” mode. Supply enough information that would allow someone with enough expertise—be it the end-user, or someone trying to help the end-user—to get _some_ handle on the problem. Simply displaying “computer says no” is unhelpful to everyone concerned. – liyang Mar 16 '21 at 12:07
  • 2
    This is called security through obscurity, and it's a bad practice. – Ben Crowell Mar 16 '21 at 14:43
  • 5
    @BenCrowell: If it's the **only** way that you provide security, then yes, it's security through obscurity and therefore a bad practice. You can still have doors without windows. – Robert Harvey Mar 16 '21 at 14:45
  • Just thinking here - you could log the stacktrace, assign a simple ID to the logs, and communicate that simple ID to the user "An error has occurred. If you contact the helpdesk, please refer to ticket `#WEB-57`". The ID itself does not leak information. Bonus: you can more easily deduplicate identical stacktraces. – MSalters Mar 18 '21 at 08:54
  • I would not specify the details of an authentication error to prevent guessing used email addresses which in turn could lead to password guessing. And instead of HTTP 500 I would use a 401. – Thomas Junk Apr 29 '21 at 18:05
  • @ThomasJunk: Sounds reasonable. Any error code will do, really, so long as it doesn't expose details. – Robert Harvey Apr 29 '21 at 18:09
2

You have a HTTP API. Probably RESTful one, but there's no need to jump to conclusion.

There are three point of views in play:

  • API is usually consumed by other code. This means that API is consumed by someone who wrote the code. A programmer. Or a tech savvy user. It would be a good user experience for them to provide as detailed error message as possible. If you are worried for the end user, you needn't to be. Just change the message on the FRONTEND to something your END USER will understand.
  • This being a HTTP API, and the e-mail in question being an user input, this particular behavior should be implemented as 400 Bad Request. Again, at this point, we are dealing with the client error 4xx, client being the frontend or other API consuming your API. It is a good practice to include enough information in 4xx error messages for the consumer to fix stuff on their side. And let them (developers of the frontend) deal with end users and transforming the error messages. IMHO it's too soon to make conclusion about end users at API level.
  • Finally, security. I don't see any security problem with displaying regex used to validate an e-mail. Security by obscurity is a discouraged practice and does not achieve any real security. Implement proper security instead.

With that being said, definitely include the Regex into the error message, I as a client side developer want to know why our users cannot register with the app thta's using your API without DMing you or looking into your backend code.

netchkin
  • 240
  • 2
  • 4
0

Things don't get black or white here. There are multiple questions to answer.

  1. What is the API character?
    • Public API: If the REST is a public API, you might want to document it well and add samples that are both valid and invalid. This is a good practice in general. I like the informational flavor of the error showing what regex was used for the validation, which might be helpful for the developers. However, it is possible the email is correct and the Regex fails anyway (read point no. 3).
    • Private API: The REST is for the communication of the internal systems. You don't need to be that verbose as from the previous point, however, it is a good practice in general.
  2. What happens with the error message?
    • Caught and logged: It might be useful to see the Regex the validation failed on at least right in logs.
    • Propagated to front-end: Here is a question, who is a consumer, in other words, what is the qualification of a reader of the error message. If it is anybody who is either not technically qualified OR the Regex knowledge has no benefit for them, it makes no sense to add it.
  3. What if the email address is correct but it still fails on Regex?
    • Null validation: Be careful. You might check for null and throw such an error message which is misleading. This is valid for any validation happening either before or after Regex validation.
    • Regex correctness: There are various Regex expressions and each one behaves a bit differently and follows different standards. Read more here and here. A scenario a valid email doesn't pass is possible. Again, the presence of such a verbose error message depends on the target who deals with it. If it is a public API and a tech-savvy person is a user, he might create an issue that the email is valid but such Regex doesn't match it with a link to the Regex101.com sample. For anybody else, such information has no real or minimal value.
  4. How about security?
    • Safety first: Is there at least a minimal risk of abuse of knowing internal system details which might cause any harm in the future? If so, forget to expose such information. Also, it is a good practice to rather merge such messages into a generic one:

      The combination of the email, birth date, password and security code is invalid.

Answering these would give you a general idea of whether it is better to include a Regex or not.

Disclaimer: If you finally decide to include the Regex in the error message, remember to place the actual one and not a hard-coded text. There is nothing worse to display a different Regex from the one which is actually used.