Importing PDF documents
You can import PDF documents into a MAXQDA project in various ways, for example,
- drag and drop PDF files from the Windows Explorer or macOS Finder directly into the “Document System” window,
- click the plus icon in the “Document System”, or
- click on the Texts, PDFs, Tables icon on the Import menu tab.
For general information on importing and organizing files, see Import and Group your Data.
Color highlighting and comments
Color highlighting in a PDF document is imported into a MAXQDA project as codings. A parent code with the name "Word/PDF highlighting" is created at the top level in the code system. For each color, a subcode with the color name in English is created and assigned to the corresponding text passage. If a color highlighting in a PDF contains a comment, this is saved in the comment on the coded text segment.
Comments in a PDF document are imported as in-document memos and can be displayed in the "Document Browser". Multiple related comments (threads) are combined into a single memo.
The option to code highlighted text and import comments as in-document memos during import can be turned on/off in the local preferences of the “Document System” window. The preferences can be opened by by clicking on the gear icon in the upper right corner of the window.
Working with PDF documents
There are some special considerations when working with PDF documents, because the PDF format was not designed for text processing, but rather as a layout format for printing, and therefore the files are much larger than simple text documents.
Saving PDF files outside the MAXQDA project file
By default, all PDF files smaller than 5 MB are saved in the project file upon import. PDF files larger than 5 MB are not saved in the MAXQDA project itself, but rather in the folder for externally saved files, and only a reference to the externally saved data is created. You can customize the maximum file size as well as the location for externally saved files in MAXQDA’s global preferences, which you can access via the preferences symbol in the lower left corner of MAXQDA's main window.
Coding text and image segments
Text and image segments in PDF documents can be coded with the mouse. Select and create a frame around the desired segments to subsequently code them. MAXQDA does not distinguish between text and image codings in regard to code frequency; however, in the Coding Query when searching for overlap, the query will search independently for text and image segments, that is, overlap between text and image segments will be ignored. The “Near” function for image segments always returns a result of 0, both in the Complex Coding Query and in the Code Relations Browser.
If a text is in the form of a scanned PDF file, Optical Character Recognition (OCR) must be performed with a suitable program before importing the PDF into MAXQDA. This process makes it possible to highlight and code the text in MAXQDA, otherwise it would only be possible to highlight images.
Paragraphs in PDF files
Unlike text documents, PDF documents do not have a paragraph structure per se. For this reason, MAXQDA tries to recognize paragraphs in PDF documents based on various criteria, so that, for example, the functions for finding words within a paragraph or for autocoding paragraphs can be used.
Paragraph recognition works very well for most PDF documents, but please note the following limitations:
- For MAXQDA, PDF documents do not contain paragraphs across page boundaries. This means that even if the content of a paragraph continues on the next page, for MAXQDA the paragraph ends at the end of the page.
- Footnote characters in the text may be recognized as the end of a paragraph.
- The quality of paragraph recognition depends on how the PDF was created and how it is structured. For example, in PDF documents created from scanned text using OCR text recognition, the quality of paragraph recognition will be worse than in PDF documents created directly from Word.
Extract text from a PDF document and save as text document
After importing a PDF document into a MAXQDA project, you can extract the text from the PDF document. Images and formatting are ignored, only the plain text is inserted as a new text document in the “Document System”.
Select one or more PDF documents in the “Document System” and select the function Insert PDF Text as New Document. The new text will be inserted directly below the selected documents.
If you have excluded a the header or footer areas of a PDF, as described in the section below, they will also be excluded when converting a PDF into a text document.
Exclude PDF header and footer areas
PDF header and footer sections can be excluded from all MAXQDA analyses, such as Word Frequencies and MAXDictio-based analyses. You can easily adjust the exclusion areas using your mouse by dragging the respective arrows found at the top and bottom of the page and adjust the header and the footer separately. To do so, click on the respective icon in the Document Browser’s toolbar and click on "Save" to apply the changes to all pages of the PDF file.