This vignette will introduce you to phrase mining using the phm package. Those who are familiar with the tm package will recognize that there are similarities in functionality between the two.
Phrase Mining is done on a corpus with texts, or on a vector where
each element is a text, via the function
function will create a phraseDoc object, which is equivalent to a
term-document matrix stored in a more efficient manner. To see the
term-document matrix, use the function
as.matrix on its
The term-document matrix will have phrases on its rows and documents on its columns, and on the intersection of row and column there will be a frequency indicating the number of times a phrase occurs in the document.
In the case that the phrase document is created on a vector with texts, each element in the vector is considered to be a document, with as its ID the index of the element.
phraseDoc function will extract principal phrases
from the texts given to it. A principal phrase
is a phrase that is frequent in its own right (so not as part of a
different phrase), is meaningful, does not cross punctuation marks, and
does not start or end with so-called stop-words (with a few
phraseDoc function gives progress updates, since at
times it may take a while to complete. These can be silenced if desired.
When using this function in a Shiny application, these progress updates
can be given via a Shiny progress meter; the function uses about 100
progress steps, so it should be created inside a
withProgress function with the argument
set to at least 100. The argument
shiny in the
phraseDoc function should be set to TRUE in that case.
When converting a phraseDoc object to a term-document matrix using
default the Ids of the documents are displayed on the columns. This can
be changed to display the indices of the documents instead.
Once the phraseDoc object has been created, there are several functions available that will obtain information from it:
freqPhraseswill display its most frequent phrases
getDocswill display all its documents that have nonzero frequencies for phrases that appear in a vector of phrases
getPhraseswill display all phrases occurring in documents that appear in avector of document IDs or document indices
removePhraseswill return the phraseDoc object with a set of phrases removed
As an example, we create the following vector with texts:
=c("This is a test text", tst"This is a test text 2", "This is another test text", "This is another test text 2", "This girl will test text that man", "This boy will test text that man")
Create the phraseDoc object on it:
=phraseDoc(tst) pd#>  "2022-06-07 14:51:39 EDT" #>  "2022-06-07 14:51:39 EDT" #>  "Rectifying frequencies..." #>  "2022-06-07 14:51:39 EDT"
Display the term-document matrix:
as.matrix(pd) #> docs #> phrases 1 2 3 4 5 6 #> another test text 0 0 1 1 0 0 #> test text 1 1 0 0 0 0 #> will test text that man 0 0 0 0 1 1
Get the 3 most frequent principal phrases:
freqPhrases(pd,3) #> frequency #> will test text that man 2 #> test text 2 #> another test text 2
Obtain all frequencies for documents with the phrases “test text” or “another test text”:
getDocs(pd,c("test text","another test text")) #> 1 2 3 4 #> another test text 0 0 1 1 #> test text 1 1 0 0
Obtain all frequencies for principal phrases in documents 1 and 2:
getPhrases(pd, 1:2) #> 1 2 #> test text 1 1
Remove the phrase “test text” from the phrase document:
=removePhrases(pd, "test text") pdas.matrix(pd) #> docs #> phrases 1 2 3 4 5 6 #> another test text 0 0 1 1 0 0 #> will test text that man 0 0 0 0 1 1
The phm package also provides a distance measure that is optimal for text. Text distance is calculated as the proportion of unmatched frequencies, i.e., the number of unmatched frequencies divided by the total frequencies among the two vectors. Text clustering functions can be used for term-document matrices with phrases, as well as for regular term-document matrices where the terms are words (usually obtained via functions in the tm package).
Text distance is a number between 0 and 1, where 0 means that the two texts have the same terms and the same frequencies of those terms, and 1 indicates that they have no terms in common. A smaller number means that the texts are more alike, while a larger number (closer to 1) means they are less alike.
textDist will calculate the text distance
between two numeric vectors:
textDist(c(1,2,0),c(0,1,1)) #>  0.6
Each vector represents a document, and the numbers in the vectors are the frequencies of terms. In the example, the first document/vector has one occurrence of the first term, while the second document has none.
textDist can also be used on matrices, in
which case a vector with the text distance between corresponding columns
M1=matrix(c(0,1,0,2,0,10,0,14),4)) (#> [,1] [,2] #> [1,] 0 0 #> [2,] 1 10 #> [3,] 0 0 #> [4,] 2 14 M2=matrix(c(12,0,8,0,1,3,1,2),4)) (#> [,1] [,2] #> [1,] 12 1 #> [2,] 0 3 #> [3,] 8 1 #> [4,] 0 2 textDist(M1,M2) #>  1.0000000 0.6774194
Note that the first columns of the two matrices have no terms in common, and so their distance is the highest possible: 1.
textDistMatrix calculates the text distance
between all combinations of the columns of a matrix:
=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4) Mcolnames(M)=1:4;rownames(M)=c("A","B","C","D") M#> 1 2 3 4 #> A 0 0 12 1 #> B 1 10 0 0 #> C 0 0 8 1 #> D 2 14 0 0 tdm=textDistMatrix(M)) (#> 1 2 3 #> 2 0.7777778 #> 3 1.0000000 1.0000000 #> 4 1.0000000 1.0000000 0.8181818 class(tdm) #>  "dist"
Note that the output of this function is of type
textCluster will use text clustering to
cluster any term-document matrix. Its output is similar to the output of
kmeans. However, note that, if there are any
documents without terms, they will all be stored in the last
First we create a term-document matrix:
=matrix(c(rep(0,4),0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0,rep(0,4)),4) Mcolnames(M)=1:6;rownames(M)=c("A","B","C","D") M#> 1 2 3 4 5 6 #> A 0 0 0 12 1 0 #> B 0 1 10 0 0 0 #> C 0 0 0 8 1 0 #> D 0 2 14 0 0 0
Then we cluster it into 3 clusters:
tc=textCluster(M,3)) (#> << textCluster >> #> 3 clusters #> 6 documents
We can look a the output of the function:
#This shows for each document what cluster it is in $cluster tc#> 1 2 3 4 5 6 #> 3 1 1 2 2 3 #This shows for each cluster how many documents it contains $size tc#> 1 2 3 #> 2 2 2 #This matrix shows the centroid for each cluster on the columns, with terms on #the rows $centroids tc#> 1 2 3 #> A 0.0 6.5 0 #> B 5.5 0.0 0 #> C 0.0 4.5 0 #> D 8.0 0.0 0
showCluster will show the contents of one
specific cluster. It will also show a column with for each term the
number of documents it appears in, and a column with the total frequency
of each term in the cluster. The terms are displayed in descending order
of those last two columns, so the most common terms are displayed
Note that this function can be used with any clustering method; all it needs is the term-document matrix, a vector with the cluster ID for each document, and the number of the cluster to be displayed.
Let’s take a look at the clusters we have created:
showCluster(M,tc$cluster,1) #> 2 3 nDocs totFreq #> D 2 14 2 16 #> B 1 10 2 11 showCluster(M,tc$cluster,2) #> 4 5 nDocs totFreq #> A 12 1 2 13 #> C 8 1 2 9 showCluster(M,tc$cluster,3) #> $docs #>  "1" "6" #> #> $note #>  "Documents have no terms"
We see for example that the first cluster consists of documents 2 and 3, which contain only terms “B” and “D”, both occurring in both documents, with “D” having the greatest overall frequency in the cluster, so it occurs first.
The second cluster consists of documents 4 and 5, which contain only terms “A” and “C”. Both documents contain those two terms, but the frequency of “A” is the largest and thus it appears first.
Note that the last cluster looks different from the others; it contains all documents without terms. These are documents 1 and 6.
We can create a corpus from a data frame such that all variables
text variable are stored in the meta fields of
the documents. This is done using the function
conjunction with the
VCorpus function, which resides in the
df=data.frame(id=LETTERS[1:3],text=c("First text","Second text","Third text"), (title=c("N1","N2","N3"),author=c("Smith","Jones","Jones"))) #> id text title author #> 1 A First text N1 Smith #> 2 B Second text N2 Jones #> 3 C Third text N3 Jones #Create the corpus =tm::VCorpus(DFSource(df)) co #The content of one of the documents 1]]$content co[[#>  "First text" #The meta data of one of the documents; all variables are present. 1]]$meta co[[#> author : Smith #> datetimestamp: 2022-06-07 18:51:39 #> description : character(0) #> heading : character(0) #> id : A #> language : en #> origin : character(0) #> title : N1
Note that the data frame must have the variables
The PubMed website will allow a user to enter search criteria, and will return abstracts of medical publications related to those search criteria. These abstracts can be saved in PubMed format, in which case a file will be created on the user’s host system containing these abstracts together with some additional information for each publication.
Running the function
getPubMed on this file will create
a data table with its contents, with each row representing a
publication. The data table will have an id variable, containing the
PMIDs of the publications, a text variable containing the text of the
abstracts, and several other variables.
Running the function
DFSource on this data table will create a corpus with a
plain text document for each publication. This corpus may then be used
to perform phrase mining or regular (word) text mining.