64

I'd like to use Youtube as an example: they use IDs in the form of PEckzwggd78.

Why don't they use simple integers?

Or imgur.com - they also use IDs such as 9b6tMZS for images and galleries. Not sequential integers.

  • Why don't they use integers (particularly sequential ones)?

  • In what cases is it a wise decision to use such string IDs instead of integers?

Rakori
  • 787
  • 1
  • 5
  • 7
  • 48
    What makes you believe the IDs are not just simple integers? I know a lot of web services which use integers in the DB but display them in some base64 encoding so URLs look nicer. [Interestingly, the youtube IDs almost map to 64bit integers.](https://stackoverflow.com/questions/6180138/whats-the-maximum-length-of-a-youtube-video-id#comment50080773_6250619) – Josef Nov 28 '17 at 10:29
  • 1
    @Josef From OP's comment, it seems OP does not recognize the need or the importance of URL shortening. – rwong Nov 28 '17 at 10:34
  • 3
    @rwong But OPs question is why don't they use numeric IDs and the answer might be: They use numeric IDs, they just display them in base64 instead of base10 or base2. I don't know that for sure, though, so I am asking OP what specifically makes them think the IDs are not simple 64bit integers in base64. – Josef Nov 28 '17 at 10:45
  • 9
    https://www.youtube.com/watch?v=gocwRvLhDf8 – Roger Lipscombe Nov 28 '17 at 11:00
  • 3
    Isn't that the same as [this](https://softwareengineering.stackexchange.com/questions/301620/why-do-some-prominent-web-sites-use-alphanumeric-strings-for-resource-ids-instea). – the_lotus Nov 28 '17 at 13:47
  • ~6 bits per character instead of ~3 bits per character for decimal presentation of integers. And integers do not come in all possible lengths. Otherwise they are exactly the same. Otherwise it doesn't matter at all. – NoDataDumpNoContribution Nov 28 '17 at 18:02
  • Just to further clarify other people's answers and comment, your example (`PEckzwggd78`) decodes to `0x3C4724CF082077BF` in hexadecimal or `4343480837235308479` in decimal. Compare the base64 encoded string (12 chars) with the hexadecimal (16 chars) or even the decimal (19 chars); why should you use more chars for the exactly same value? – frarugi87 Nov 29 '17 at 16:20
  • 3
    Possible duplicate of [Why do some prominent web sites use alphanumeric strings for resource IDs instead of numbers?](https://softwareengineering.stackexchange.com/questions/301620/why-do-some-prominent-web-sites-use-alphanumeric-strings-for-resource-ids-instea) – Brad Werth Nov 29 '17 at 18:28

10 Answers10

106

Youtube can't use sequentional IDs for two reasons:

  1. Its databases are almost certainly distributed, making sequential numbering complicated.

  2. It has a privacy option "Unlisted videos": those that don't show up in the search results, but are available if you know the ID.

Therefore, the video IDs should be reasonably random and unpredictable. Whether the ID is represented by digits only, or by a combination of letters and digits, is irrelevant: there is a trivial mapping from one representation to another.

IMil
  • 849
  • 1
  • 5
  • 7
  • 13
    Numeric ids dont have to be sequential – Sopel Nov 28 '17 at 10:02
  • 28
    @Sopel I think IMil's point is that Youtube needs to generate IDs that are sparse. In other words, if it is estimated that you will only ever need to store `2^40` items, in some architectures there are legitimate reasons for choosing a space of `2^80` or `2^120` bits. Examples of reasons are: reducing collision without technically checking for collision; using the sparseness of keys as part of making secrets hard to find (the "unlisted video"), etc. – rwong Nov 28 '17 at 10:33
  • 13
    @Sopel the question was "Why don't they use integers (particularly sequential ones)?" I explain that: 1) sequential IDs are undesired; 2) integers and strings are basically the same thing – IMil Nov 28 '17 at 11:01
  • 3
    The "therefore" clause doesn't logically follow but the two numbered points are correct. As an example of why randomness is not a necessary consequent: sequential numbering with uniform gaps will work to provide unique ids in multiple independent databases such that the results can be combined in a datawarehouse - this is a form of sharding. That is, suppose you anticipate no more than 10000 regional databases (perhaps you have only 10 right now so 10000 is sufficient). Then each db can have an identity column counting by 10000 with unique last 4 digits, there will be no collision on merge. – davidbak Nov 29 '17 at 02:09
  • 2
    @davidbak the requirement for randomness follows from (2). Uniqueness may indeed be obtained by assigning non-overlapping ranges to different database instances, but this would leave the IDs predictable. – IMil Nov 29 '17 at 12:24
  • 1
    @IMil - there are other ways to achieve that as well, e.g., do not use an identity column but instead assign from a given pseudorandom number generator where each db uses a different sequence/seed. They're deterministic (not random) from YouTube's POV but unpredictable from attacker's POV. And if you use specific sharding values at the end (my example above) for the "private" dbs then they're easy to exclude from search results as well. – davidbak Nov 29 '17 at 18:34
  • 2
    @davidbak I didn't mean random numbers in a strong cryptographical meaning. They may be deterministic from internal point of view, but should be unpredictable to outsiders, that's the whole point. I might try to edit the answer to make this clearer. – IMil Nov 29 '17 at 18:56
81
  • On the form of the IDs: They're using Base64 (using the characters a-z, A-Z, 0-9, -, and _). This allows them to have 6 bits of information per character. YouTube uses 11-character video IDs, which means they can generate 26*11, or more than 7*1019 IDs. As Tom Scott put it, that's "enough for every single human on planet Earth to upload a video every minute for around 18,000 years." Base64 is also easy to work with, because 64 is a power of 2, which means every character represents an exact number of bits. We use hexadecimal (base 16) for the same reason.

  • On the non-sequential nature of the IDs: it means they don't need a synchronized counter between all the servers that assign IDs to videos. They can just generate a random number, check if it's already in use, and go from there. They could even assign each server a block of IDs to pick from and eliminate the duplication checking. I don't know if they're doing that, but they could.

  • Another reason for the non-sequential IDs is that it is what makes "unlisted" videos work. These are videos that won't show up in search results or as suggestions, but that are accessible if you've got the link. If you're using sequential counting, you can just go to a video, increase the ID by one, and the idea of unlisted videos is now broken.

  • Non-sequential IDs also help hide information from competitors, such as the total amount of videos, or the number of videos uploaded per timeframe.

I can highly recommend Tom Scott's video. His information is almost always both interesting and accurate.

rchard2scout
  • 754
  • 4
  • 10
  • 6
    Let's also point out that 11 characters of a base64 encoding store 66 bits of information, which means that they can easily map a 64bit integer into such a string. I.e. internally, they could use a 64bit int anyway (but need not do so). – Bernhard Hiller Nov 28 '17 at 11:29
  • 1
    For comparison, the conventional decimal representation could require as many as 20 characters, “wasting” up to 9 characters compared with Base64. – dan04 Nov 28 '17 at 22:30
  • The Tom Scott video explains this perfectly. – AGB Nov 29 '17 at 10:46
14
  • Integers do not scale that well, a "normal" 32-bit unsigned integer will max out just over 4 billion.

  • They may not want you to know how many items they have on line or keep track of the rate they are growing. If your order numbers just count up it will be easy for a stranger to deduct how many orders you got in a certain period.

  • Letters can hold more information than digits, you need fewer letters to express the same "number". Not in binary format but in the presented format on a screen or on paper. For instance, you could have the year or an entire date as part of your id, or the type of customer.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • 8
    1) one can use int 64 – Rakori Nov 28 '17 at 07:13
  • 4
    2) why? ........... they're all public anyway. those that aren't public -- aren't accessible. that's it – Rakori Nov 28 '17 at 07:14
  • 4
    3) can you elaborate? express what information ? – Rakori Nov 28 '17 at 07:14
  • 1
    also, they reveal info. how many videos are on you tube, how many users, whats the growth this quarter etc – Ewan Nov 28 '17 at 08:45
  • 1
    Makes the "guess attacks" harder. – Laiv Nov 28 '17 at 09:43
  • Letters do store more information than digits, but a string of letters doesnt store more information tham a number (occupying the same space) – Sopel Nov 28 '17 at 09:55
  • 2
    For 1: same goes for int32 and int64. While int64 is potentially way larger, it could be not large enough. – Nepho Nov 28 '17 at 10:28
  • 3
    In the database you would store a number as a number. So a 32 bit int would take 32 bits. Text would have less density (how much poorer text is would depend on encoding) – Taemyr Nov 28 '17 at 11:27
  • 1
    The first point doesn't really make sense because integer is a very broad term that comes from mathematics. The number of bits in it is totally language and maybe even architecture dependent. Eg, Python's `int` type is in fact an arbitrary precision integer. And, eg, in Sqlite, `INTEGER` is a type that may be up to 64 bits (unless it's a primary key). Postgres also has integer types ranging from 16 to 64 bits as well as the `NUMERIC` type allowing for arbitrary precision integers. – Kat Nov 29 '17 at 16:04
  • The first point is indeed not that pressing considering we have 64 and even 128 bit integers (guids/uuids). The numbers would appear impractically long though. If you were to use guids you would also not use digits to express them but do it like Windows with hexadecimal characters which amounts to 32 characters per guid. – Martin Maat Nov 30 '17 at 06:13
9

1) Why do some websites use letters in their IDs? Are they strings?

We don't know if those websites store IDs in their database as strings. Numbers and strings are really the same to computers. A string is just a number, just shown with a different base. 'A' = 0x41 = 65 = 0b1000001, to the computer it's all the same. But if you display it, the larger the base, the shorter the representation, and shorter URLs are easier to read and share for humans. Sites like YouTube and Imgur use base 62 (letters, upper and lower case, plus digits) or larger (add a dash or other valid URL characters), which is relatively short for big numbers. What would you prefer to use, youtu.be/23489234892348234933 or youtu.be/B9k6KMrv8vh?

2) Why are non-sequential IDs used?

The answer by IMil explains it well:

Youtube can't use sequentional IDs for two reasons:

  • Its databases are almost certainly distributed, making sequential numbering complicated.

  • It has a privacy option "Unlisted videos": those that don't show up in the search results, but are available if you know the ID.

These also explain why the IDs are so large: (YouTube doesn't host 23,489,234,892,348,234,933 different videos, obviously)

  • When generating IDs, it's a problem if you accidentally generate the same ID twice, so you need a big ID space to prevent the birthday problem

  • People can just guess the URL of unlisted videos if the chance of any given valid ID being used for a video isn't very, very small.

Jasmijn
  • 1,561
  • 9
  • 14
  • 3
    > "YouTube doesn't host 23,489,234,892,348,234,933 different videos, obviously" I'm not so sure if this is obvious or not ;) – mike3996 Nov 28 '17 at 11:42
  • `People can just guess the URL of unlisted videos if the chance of any given valid ID being used for a video isn't very, very small.` -- how do you know if an unlisted video isn't accessible for everyone except its author? even if someone else has guessed its ID – Rakori Nov 28 '17 at 21:26
  • 1
    @Rakori [because that is how Youtube works.](https://support.google.com/youtube/answer/157177?co=GENIE.Platform%3DDesktop&hl=en-GB) – Josef Nov 29 '17 at 11:35
  • 2
    @progo I mean if every single person in the world has uploaded 3.3 billion videos to YouTube on average... ;) – Jasmijn Nov 29 '17 at 18:03
6

why not just integers, particularly sequential ones? And when, in what cases is it a wise decision to such string ID instead of integers?

  • Better UTF-8 space - when you turn a number into a string you get at most 10 combinations per character (0-9), but when you allow any alpha numberic characters you get 62 combinations per character (a-z, A-Z, 0-9), so by using alphanumeric strings you can produce shorter urls than if you used numeric strings. This is important for sites where users are sharing urls -- like Youtube and Imgur.
  • Sequential integers are more difficult to produce. To produce a sequential increasing integer you must either have a single thread produce the numbers, or coordinate many hosts in a distributed system, and when you run a high volume application like Youtube or Imgur that doesn't scale as nicely as a randomly generated string (not to say that they are randomly generating)

As an aside, it's not necessarily the case that the internal representation is a string. They could very likely be encoding a numeric identifier as a alphanumeric string for the shorter url.

Samuel
  • 9,137
  • 1
  • 25
  • 42
  • 1
    2) in case of a string ID, but you'll need to verify that a string ID has been generated already before inserting a new record into a db. what's the difference with an int ID then? – Rakori Nov 28 '17 at 07:16
  • @Rakorin Even when using something as simple as UUIDv4 the chance of collison is miniscule. Use enough randomness and the chance is pretty non-existent, so that the duplicity does not really need to be validated. – Andy Nov 28 '17 at 07:48
  • 1
    @davidpacker and how is that different from generating a longer integer? – Sopel Nov 28 '17 at 09:57
  • @Sopel As Samuel has pointed out, the integers would take up more space, i.e. be longer, than the strings. Otherwise, there really is not any difference. – Andy Nov 28 '17 at 13:07
  • 1
    @davidpacker only when printed – Sopel Nov 28 '17 at 13:14
  • @Sopel The likelihood of a collision from a 32-bit or 64-bit integer is much higher than a UUIDv4. There is a table with collision probability here: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table You could use an arbitrary precision numeric data type like `BigInteger` but you would lose almost all the benefits of using numbers as ID's – Samuel Nov 28 '17 at 18:56
  • @Samuel You don't have to use `BigInteger` because you don't need full arithmetic. A int array with 'string semantics' would work best – Sopel Nov 28 '17 at 19:10
  • @Sopel integers take up more space in a URL too, because URLs [consist of caracters](https://tools.ietf.org/html/rfc3986#page-11) – Josef Dec 04 '17 at 09:51
  • @Josef do you store billions of urls in database? – Sopel Dec 04 '17 at 09:52
  • @Sopel as a matter of fact I do, but that doesn't matter. Billions of URLs are transferred trough the internet each day. Billions of youtube videos are referenced/embedded in other pages **by URL**. So there is some saving there. Still, the reason why they use base64 is probably because it looks nicer. But if you transfer characters anyway, it makes no sense to use a ID limited to only base2/10/16 – Josef Dec 04 '17 at 09:58
  • @Josef but you're still assuming that the number from the database cannot be sent as base64 to have benefits of both where needed. – Sopel Dec 04 '17 at 10:00
  • @Sopel no, of course the number can be stored as int64 in the DB and sent using base64 as URL. The URL then takes less space than using base10 – Josef Dec 04 '17 at 10:11
  • @Josef and that I agree with – Sopel Dec 04 '17 at 10:25
2

As you've pointed out that it would be easy to use a universally unique ID just using numbers because under the hood everything is just 0 and 1 and you could expand the number to more precision going up to 128 bit or more.

I think the main reason is that, assuming some arbitrary fixed range like uint32 (just for the sake of an example), if you use letters as well you can have a shorter ID in total.

I imagine that it's an esthetics reason for the URL. Instead of having 4,129,873,773 with letters it's much shorter Fu837t (just fictious made up by me). A user might even be able to remember the URL for giving it to a friend. Platforms like Youtube usually have longer UUIDs than 32 bit because they would run out of space quickly.

Ewald B.
  • 219
  • 1
  • 4
  • 3
    This i think is the answer. Using strings is neither more efficient nor easier to maintain uniqueness. The reason is that its easier to represent as a url – Sopel Nov 28 '17 at 10:01
  • if a user is able to remember Fu837t, but can't he remember 2390? – Rakori Nov 28 '17 at 21:23
  • 4
    @Rakori: Fu837t would compare to 2223955238, so yes. The 2390 would be encoded as "Vg", so: also yes. – Mooing Duck Nov 29 '17 at 01:28
  • @MooingDuck, no. How do you know what the algorithm for generating that string ID is? – Rakori Nov 29 '17 at 04:49
  • 3
    @Rakori it is not an algorithm, it is an encoding. There are algorithms to transfer numbers between different encodings, but which one is used does't matter as long as the encoding is well defined. Url safe base64 encoding is well known and [standardized](https://tools.ietf.org/html/rfc4648#page-7). – Josef Nov 29 '17 at 11:39
2

Content hashes

The word "hash" is not found in the existing, nice, answers, so here we go:

Often, data can be identified by its content hash instead of an independent, artificial ID. This is particularly evident in software like git or file systems like ZFS where this particular property of using content hashes not only makes stuff easier (for example de-duplication), but also has other nice properties like trivial caching, a secure history, detecting bit rot etc.

Hashes usually come as hex numbers (or an even larger letter space), so that's why you don't see integer IDs. There simply are no integers (in those cases).

Hashes are good if your data objects are immutable (like in ZFS or git); they would be great to store images, for example, on large CDNs. I do not know whether those particular IDs actually are hashes, but it would certainly make sense (and as Michael Kjörling commented, short IDs are probably not hashes for obvious reasons - as comparison, git uses SHA-1 values which are 20 bytes or 40 hex digits).

AnoE
  • 5,614
  • 1
  • 13
  • 17
  • 1
    At least Youtube video IDs are too short to be hashes. The birthday paradox applies; in short, on average, with a hash space of n bits, you will start seeing collisions after seeing 2^(n/2) input blobs. With ~60-70 bits in the ID, that's 30-35 bits of uniqueness, or a few billion entries. I'm pretty sure they host more videos than that by now. And, of course, most hashes are integers just fine; that they aren't normally printed in decimal form has no bearing on whether or not they are integers. Admittedly, the same data could probably be interpreted as floating-point binary data... – user Nov 28 '17 at 13:58
  • 4
    @MichaelKjörling: Well, YouTube video IDs are too short to be *cryptographic* hashes, but there are plenty of hash functions that have 64 bits of output or less — CRC-16/32/64, Java `hashCode()`, etc. Of course, the shorter the hash, the more likely random collisions are. – dan04 Nov 28 '17 at 22:38
  • If you wanted people to remember the URL, you wouldn't have made it case-significant. And having to say "upper" or "lower" in front of every letter is much less efficient than just saying numbers. – Lenne Nov 30 '17 at 13:19
2

A short URL is desirable since it makes linking and sharing simpler (e.g you can share a link in a SMS, it is faster to type and so on). Services like Youtube or Imgurl want you to share URLs casually, so this is an important consideration.

Using alphanumerical ID's rather than numerical means you need fewer characters to express an ID of the same bit-size. For example 6 digits give you a million unique id's but 6 alphanumeric characters (using the base64 set) gives you 68 billion unique identifiers.

For all we know, the alphanumerical identifiers could be sequential numbers, just encoded in an alphanumeric format like base64. But often commercial services eschew sequential codes to prevent people from guessing ID's and to avoid disclosing business information like the amount of customers.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
1

There's several reasons why you would use non-numeric ids, but also understand that not all values with alphabetic characters are really strings. YouTube has the reputation of an incredible number of videos, on the order of 300 hours of video uploaded every minute (ref). The unique integers representing those videos can get quite long, so the use something like Base64 URL encoded numbers (ref).

Types of Identifier Representations:

  • Simple integers: (12345, 981027489382493)
  • Base 16 integers: 123456789abcdef -- also known as Hex
  • Base 64 integers: 9b6tMZS
  • Readable strings: 12032017-Read-my-awesome-article-01

They all have their strengths and weaknesses. The more unique characters you can use for your identifiers the fewer characters you need to represent a number. Base 64 numbers are a pretty good compromise because there is an established variant that works for URLs and compresses the number of characters needed to represent a number 6 to 8 (i.e. 3/4th the size).

Readable strings work for blogs because they can raise the searchability, and it's a lot easier to generate unique titles when the number of records is small.

Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160
0

Ok one of the reasons is that the characters are sent as characters and not as integers anyhow. This is because of how an HTTP Get works.

When you say, "why not use an integer?" Well, the integer is then chopped up and every digit is sent as a character and you end up with a string of characters anyhow. So why not use all the options for a character?

There is also the human factor:

Take imgur for instance : https://imgur.com/*****/s6UqP

s6UqP,

The range for every character is: a through z capital, a through z sub-capital, and 0 to 9= 26+ 26+ 10 = 62 options for every position in the string. With five positions that's 916132832 possible combinations. If you would only use numbers, you would need 9 digits.

People can hold roughly 7 objects in memory, 9 digits is too much, 5 characters is doable.

Magical number 7

Pieter B
  • 12,867
  • 1
  • 40
  • 65
  • It remembers Gfycat: they use three words, two adjectives and an animal name. Because there are many possibilities ([1502 adjetives](https://assets.gfycat.com/adjectives) and [1751 animals](https://assets.gfycat.com/animals)) they have more than 3 billion combinations using just three objects. – Gustavo Rodrigues Nov 30 '17 at 14:15