3

Is there a possible test to check if a PDF file contains text or it is created by scanning paper sheets ?

  • text : plain text that, for example, I can copy & paste while I am reading the PDF. Not searchable / hidden text. Created for example from a wordprocessor.

  • not text : PDF from scanners, containing only JPG or TIFF. There isn't OCR, the PDF contain only images.

I cannot use externals tools, because I am working on an embedded system, not linux, not windows, there arent' external libraries. I must work on bare C programming. That is why I prefer a plain function, not depending on external libraries.

Is there any info into the PDF header ?

My general idea is : read the first 64 Kbytes and try to compress, with a generic library. Pdf with jpg and tiff, scanned from paper, usually compress bad. Pdf of text compress well. Will that work? Are there better methods ?

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
Massimo
  • 131
  • 5
  • A PDF can contain BOTH things. You need to find out whether one PDF file contains only one thing or only the other? – Tulains Córdova Aug 20 '16 at 22:51
  • mmh I edit and explain better. – Massimo Aug 20 '16 at 23:19
  • 3
    I don't see "not depending on externla libraries" a viable way to attack this. Just too much work. There is a number of free libraries around like Apache PDFBox, that allow extraction of text from a pdf file - That could be a first step. – tofro Aug 21 '16 at 09:30
  • I report this useful comment from a deleted answer. PDFs normally have their own internal compression, so this isn't going to work. – Philip Kendall – Massimo Aug 21 '16 at 22:06
  • Uff, please stick to my question. On this embedded system, not linux, not windows, there arent' external libraries. I must work on bare C programming. If for you this scenario is not worthing, please disregard my quest. thanhks. – Massimo Aug 21 '16 at 22:08
  • @Massimo [Snort](https://www.snort.org/faq/readme-http_inspect) provides a prime example of how to do this with their "decompress_pdf" option. You should take a look at their implementation approach. – rwong Aug 22 '16 at 00:58

1 Answers1

4

Edited to directly address Massimo's requirement that no external code be used (a.k.a. clean room development).

At the minimum, one has to implement all of the mechanics to parse (but not render) PDF document structure, according to the PDF specification, such as ISO32000-1:2008, PDF 1.7 (purchase needed.)

PDF document structure is a very big and deep tree, comparable to some of the world's most complicated XML documents. The "schema" of PDF is documented in the file format specification, in the linked document above.

Once you arrived at a document page's "content stream" node (there can be one or more, or nested), or the "Form XObject" nodes, nested inside a page, you can parse its content, to look for content that is delimited by "BT" and "ET". There are lots of other things to watch out, though.

Text content streams are almost always Deflate-compressed (commonly associated with the "zlib" library). So you will need to find the beginning of deflated streams, find the length of the stream, decompress it, and search for strings inside.

To find images, you will need to find "Image XObjects" and a few others that can contain imaged data.

This is an incomplete explanation and is only applicable to those who are not using libraries or tools.


There are many approaches. All of them require libraries or tools; some are open-source or liberally licensed. Yet others require commercial licenses for commercial use, including use in hosted services. So a quick answer is no.

All content stream, including text or images, are compressed. Some of them are Deflate-compressed, but a dozen other compression methods (called filters) are used in PDF.

Tools exist to extract plain text from PDF. PDF that contain scanned but not recognized text (i.e. human-readable, non-machine-readable) must be rendered to image and then OCRed.

Certain types of PDF contains neither text nor images: they contain vector graphic commands that render text strokes, one stroke at a time. The capital letter "A" may be stored as several strokes, for example. For this type of PDF, again it is necessary to render these graphic commands, producing an image that is human-readable, and then OCRed into machine-readable text.


There is no quick way to parse the "PDF header". In fact, just parsing the "PDF header" is complicated enough that will make you a famous programmer and a CEO of a tech company.


To estimate whether a task is doable or not with PDF, it is important to have a first-hand view of a PDF file's structure. Here are some examples of tools used to obtain a tree-view of structure:

(Notes: There are many such tools. This is not an advertisement or endorsement. These diagnostic tools are for a programmer's use to understand a PDF structure, not software components that can be integrated or shipped or resold.)


I haven't used Poppler but it seems promising.


This part is off-topic but gives a reminder to others that the clean-room requirement can be relevant if, for example:

  • One is implementing this functionality in a new programming language (think of some of the rising stars e.g. Go and Rust).
  • One is implementing this in a language without creating external processes or calling foreign-language functions (such as implementing this in Javascript that run on browsers)
  • One is implementing this in hardware, where the basic computational element is stream processing (compress, decompress, filter) and string pattern matching, such as firewalls and novel architecture e.g. Automata Processor
rwong
  • 16,695
  • 3
  • 33
  • 81
  • rwong, terrific suggestions but off topic. I stated "cannot use external tools". – Massimo Aug 21 '16 at 00:45
  • @Massimo If you, for some reason, cannot use external tools, then I recommend you two read this two books. I've perused them and I'm sure they teach you all you need to know to achieve what you want. http://shop.oreilly.com/product/0636920021483.do http://shop.oreilly.com/product/0636920025269.do – Tulains Córdova Aug 21 '16 at 00:56
  • 3
    @Massimo Then buy [the PDF 1.7 specification](http://www.iso.org/iso/catalogue_detail.htm?csnumber=51502) and implement it yourself. – rwong Aug 21 '16 at 01:34
  • 2
    @Massimo You will have to implement a minimum of PDF parsing to reach for each page's "content stream", look for "BT" "ET" operators, and do the same in the content streams of "Form XObjects". (I specifically mentioned the latter because it costed me some data loss.) – rwong Aug 21 '16 at 01:42