7

Suppose

  • a program A opens a text file A using encoding A to decode the file, and
  • a program B opens a text file B using encoding B.

When we copy some text from file B in program B to file A in program A using mouse selection, ctrl+c and then ctrl+v, I heard that the GUI of the OS (e.g. X window system in Linux, and I guess something similar in Windows) handles the transfer between the programs.

For example, program A can be any program which accepts text-paste, such as a text editor (e.g. emacs, gedit) or any other program, and program B can be any program which accepts text-copy, such as a text viewer (e.g. a web browser such as firefox, chrome), a text editor, or any other program.

Question:

Note that encoding A and encoding B can be different. What should happen under the hook of ctrl+c and ctrl+v so that the pasted text in file A in program A can be consistent with the original text in file A?

  • When hitting ctrl+c in file B and program B, is the binary content of the copied text in the "clipboard" of GUI of the OS the same as the binary content of the original text in file B? I.e. is the encoding for the copied text in the "clipboard" still encoding B? What program should determine the encoding of the copied text in the "clipboard"?

  • When hitting ctrl+v in file A and program A, is the binary content of the pasted text in file A the same as the binary content of the original text in file B? I.e. in the new file A, the original text should be decoded with encoding A, and the pasted text should be decoded with encoding B? What program should determine the encoding of the pasted text in file A?

Rohit Gupta
  • 203
  • 2
  • 3
  • 12
Tim
  • 5,405
  • 7
  • 48
  • 84
  • 1
    A much better answer, that really answers the OP question, is posted here: [https://stackoverflow.com/a/1929874/2323252](https://stackoverflow.com/a/1929874/2323252) – Fer B. Jun 13 '19 at 08:37

2 Answers2

2

The simplest solution is to use a standard encoding. For example, in Windows, one standard encoding is "unicode", which refers to UTF-16, the encoding recommended for Windows applications. The programs which accept clipboard input have to be able to interpret the encoding. This is all documented on MSDN.

Unicode (Windows) Standard Clipboard Formats

Frank Hileman
  • 3,922
  • 16
  • 18
  • What happens under the hook of ctrl+c and ctrl+v, when the two files are given to you with arbitrary encodings? Same question when you don't know the encodings of the two files? – Tim Jan 15 '15 at 19:23
  • In Windows, it is the sending process (the copier to clipboard) that determines the clipboard format. It has no idea who will receive the clipboard data. One clipboard data object can contain multiple formats. – Frank Hileman Jan 15 '15 at 19:28
  • Practically speaking, any modern Windows process that puts text into a clipboard, will probably use UTF-16. This is unrelated to the file format. If you put a whole file onto the clipboard, it is not the file contents, but a pointer (path) to the file. – Frank Hileman Jan 15 '15 at 19:30
  • (1) When you hit ctrl+c to copy part of file B in program B into the clipboard, what process puts the text into the clipboard? Is it part of the GUI of Windows? (2) You said the encoding for the content in clipboard is probably utf-16. If the encoding of file B isn't utf-16, then must there be change of encoding during ctrl+c? – Tim Jan 15 '15 at 19:33
  • I'm fairly certain that the encoding isn't changed - the data is copied as-is. I've had to deal with this first-hand (as a user, not as a developer), where the data was encoded in one format and pasted into a program expecting a different format and the resulting character set was not what was expected. – Michael Jan 15 '15 at 19:38
  • This is starting to sound more like a user question, not a programming question. The "program" is the code executed, the "process" is the thing executing the code. When you copy part of a file, you have the file open in a process. So only the process is putting data in the clipboard. If you copy a file from the operating system GUI, it is not going to copy the file contents onto the clipboard, but the path to the file. – Frank Hileman Jan 15 '15 at 19:38
  • The process putting data into the clipboard has complete control over the format. This is why we use standard formats. – Frank Hileman Jan 15 '15 at 19:40
  • A windows text editor opening a UTF-8 file most likely uses UTF-16 in memory to represent the text. So it has already been converted. – Frank Hileman Jan 15 '15 at 19:41
1

A program transferring text can do one of three things: Transfer without telling which encoding is used, transfer using a standard encoding, or transfer while specifying the encoding.

The receiving program will then either know the encoding or not, depending on what the sending program did. If the encoding is not known to the receiver then it can look at the text and make a guess. For text containing Unicode preceding it with a byte order marker helps (a Unicode BOM makes quite clear which Unicode encoding is used and will be very, very rare otherwise. If text is encoded in some Windows codepage other than 1152, there us basically no way for the receiver to figure out the encoding, so your text will be rubbish.

gnasher729
  • 42,090
  • 4
  • 59
  • 119