20.11.2019, 23:09

Looking for a way to exclude certain parts of a pdf, or alternatively select the text in a pdf that MAXDictio looks at.

The problem is that some webpages include copious tag phrases on many pages, and these get seen as frequent terms in MAXDictio.

I use the Web Collector to create pdf documents for selected pages, and then analyze using MAXDictio to generate a phrase cloud. This week I used a lot of pages from the same website. Because they include a section of tag words on each page, and repeat them, and also include a list of "other pages" with the same anchor text, MAXDictio is seeing that as frequent sequences, and the cloud is nonsense.

Other than going though a few hundred items and manually adding these to the Stop List, is there a way to rather select what part of the pdf document MAXDictio should search?

26.11.2019, 10:32

Hi Matthew,

Thanks for the question – indeed, there is a way to select the text in a PDF for at least the word frequency features, although I don't know whether using it is feasible in your case. But in principle, you can use the "Only in retrieved segments"-option to limit those features to the segments coded with e.g. an "Actual content"-code. This would require you, however, to go through the page and select everything that's not a tag phrase. This will either need to be done manually or, in theory, you could also use the lexical search and regular expressions to target those sections if they begin and end with a unique string (e.g. one of those tag phrases?).

Alternatively, as you said, one would usually exclude those tag phrases via the stop list.

I hope this helps! If not, or in case of any other questions, please don't hesitate to contact us again.

28.11.2019, 19:12

Thanks Andreas - yes, "Only Coded Segments" would give me that control and would work well enough.

The more I thought about this problem, the more concerned I was about how loads of tag words and other extraneous text in collected web pages could be diluting the text and reducing my construct validity.

Your solution will help reduce that risk
