PDF Documents

Importing PDF documents

You can import PDF documents into a MAXQDA project in various ways, for example,

drag and drop PDF files from the Windows Explorer or macOS Finder directly into the “Document System” window,
click the plus icon in the “Document System”, or
click on the Texts, PDFs, Tables icon on the Import menu tab.

For general information on importing and organizing files, see Import and Group your Data.

Tip: MAXQDA does not support editable PDF form fields. To display content from PDF forms, save your PDF document via a PDF printer as a new PDF file that contains the contents of the form fields as pure text.

Color highlighting and comments

Color highlighting in a PDF document is imported into a MAXQDA project as codings. A parent code with the name "Word/PDF highlighting" is created at the top level in the code system. For each color, a subcode with the color name in English is created and assigned to the corresponding text passage. If a color highlighting in a PDF contains a comment, this is saved in the comment on the coded text segment.

Note: When importing the color highlighting, slight color deviations from the original are possible, since MAXQDA selects the most suitable color from a list of stored colors.

Comments in a PDF document are imported as in-document memos and can be displayed in the "Document Browser". Multiple related comments (threads) are combined into a single memo.

The option to code highlighted text and import comments as in-document memos during import can be turned on/off in the local preferences of the “Document System” window. The preferences can be opened by clicking on the gear icon in the upper right corner of the window.

Paragraphs in PDF files

Unlike text documents, PDF documents do not have a paragraph structure per se. For this reason, MAXQDA tries to recognize paragraphs in PDF documents based on various criteria, so that, for example, the functions for finding words within a paragraph or for autocoding paragraphs can be used.

Paragraph recognition works very well for most PDF documents, but please note the following limitations:

For MAXQDA, PDF documents do not contain paragraphs across page boundaries. This means that even if the content of a paragraph continues on the next page, for MAXQDA the paragraph ends at the end of the page.
Footnote characters in the text may be recognized as the end of a paragraph.
The quality of paragraph recognition depends on how the PDF was created and how it is structured. For example, in PDF documents created from scanned text using OCR text recognition, the quality of paragraph recognition will be worse than in PDF documents created directly from Word.

Working with PDF documents

There are some special considerations when working with PDF documents, because the PDF format was not designed for text processing, but rather as a layout format for printing, and therefore the files are much larger than simple text documents.

Saving PDF files outside the MAXQDA project file

By default, all PDF files smaller than 5 MB are saved in the project file upon import. PDF files larger than 5 MB are not saved in the MAXQDA project itself, but rather in the folder for externally saved files, and only a reference to the externally saved data is created. You can customize the maximum file size as well as the location for externally saved files in MAXQDA’s global preferences, which you can access via the preferences symbol in the lower left corner of MAXQDA's main window.

Tip: If you work with many large PDF files, for example, with a total size of more than 50 MB, it makes sense to store them externally (regardless of their file size), so that the MAXQDA file remains small and can be easily backed up.

For detailed information, see External Files.

Coding text and image segments

Text and image segments in PDF documents can be coded with the mouse. Select and create a frame around the desired segments to subsequently code them. MAXQDA does not distinguish between text and image codings in regard to code frequency; however, in the Coding Query when searching for overlap, the query will search independently for text and image segments, that is, overlap between text and image segments will be ignored. The “Near” function for image segments always returns a result of 0, both in the Complex Coding Query and in the Code Relations Browser.

If a text is in the form of a scanned PDF file, you can perform Optical Character Recognition (OCR). This process makes it possible to highlight and code the text in MAXQDA, otherwise it would only be possible to highlight images.

Extracting text from images and PDFs using OCR

MAXQDA supports Optical Character Recognition (OCR) for extracting text from images and PDF documents. This feature is particularly useful when working with scanned PDFs or image files containing text that cannot be selected directly.

Open the document:
- Open the PDF document or image in the "Document Browser."
Select text segment for OCR:
- Use your mouse to draw a frame/rectangle around the portion of the document from which you want to extract text.
Perform OCR:
- Right-click on the selected segment and select Extract text from image (OCR).
Select language:
- A dialog will appear where you need to select the document’s language for accurate text detection.
- Click OK to proceed or Cancel to abort the process.
Review and edit extracted text:
- MAXQDA will analyze the selected segment and extract the text, displaying it in another dialog window.
- You can edit the extracted text if needed.
Save or copy text:
- Copy: Click this option to copy the extracted text to your clipboard.
- Save as Memo: Click this option to save the extracted text as a memo in your project.
- Save as Document: Click this option to save the extracted text as a new document in your project.
- Close: Click this option to close the dialog without saving the extracted text.

Save a PDF document as a text document

After importing a PDF document into a MAXQDA project, you can extract the text from the PDF document. Images and formatting are ignored, only the plain text is inserted as a new text document in the “Document System”.

You can currently extract text from a PDF and save it as a text document only works for PDFs that have a readable text layer.

Select one or more PDF documents in the “Document System” and select the function Insert PDF Text as New Document. The new text will be inserted directly below the selected documents.

Tip: With most PDF texts, the conversion makes it possible to search for co-occurrences of words within paragraphs when conducting a text search.

If you have excluded the header or footer areas of a PDF, as described in the section below, they will also be excluded when converting a PDF into a text document.

Exclude PDF header and footer areas

PDF header and footer sections can be excluded from all MAXQDA analyses, such as Word Frequencies and MAXDictio-based analyses. You can easily adjust the exclusion areas using your mouse by dragging the respective arrows found at the top and bottom of the page and adjust the header and the footer separately. To do so, click on the respective icon in the Document Browser’s toolbar and click on "Save" to apply the changes to all pages of the PDF file.

MAXQDA Manual