3. Finding Patterns

Imagine you are building a chat bot and we are trying to find utterances in user input that express one of the following:

ability, possibility, permission, or obligation (as opposed to utterances that describe real actions that have occurred, are occurring, or occur regularly)

For instance, we want to find “I can do it.” but not “I’ve done it.”

subject + auxiliary + verb + . . . + direct object + ...

The ellipses indicate that the direct object isn’t necessarily located immediately behind the verb, there might be other words in between.

3.1. Check spaCy version

!pip show spacy
Name: spacy
Version: 3.0.5
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: c:\programdata\anaconda3\envs\lda\lib\site-packages
Requires: cymem, blis, spacy-legacy, numpy, preshed, srsly, murmurhash, jinja2, thinc, tqdm, wasabi, requests, pydantic, setuptools, typer, catalogue, packaging, pathy
Required-by: en-core-web-sm

3.2. Hard-coded pattern discovery

To look for the subject + auxiliary + verb + . . . + direct object + ... pattern programmably, we need to go through each token’s dependency label (not part of speech label) to first find the sequence of nsubj aux ROOT where ROOT indicate the root verb, then for each of children of the root verb (ROOT) we check to see if it is a direct object (dobj) of the verb.

import spacy
nlp = spacy.load('en_core_web_sm')
def dep_pattern(doc):
    for i in range(len(doc)-1):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_ == 'aux' and doc[i+2].dep_ == 'ROOT':
            for tok in doc[i+2].children:
                if tok.dep_ == 'dobj':
                    return True
    
    return False
# doc = nlp(u'We can overtake them.')
doc = nlp(u'I might send them a card as a reminder.')

3.2.1. Use displaycy to visualise the dependency

from spacy import displacy
displacy.render(doc, style='dep')
I PRON might AUX send VERB them PRON a DET card NOUN as ADP a DET reminder. NOUN nsubj aux dative det dobj prep det pobj
options = {'compact': True, 'font': 'Tahoma'}
displacy.render(doc, style='dep', options=options)
I PRON might AUX send VERB them PRON a DET card NOUN as ADP a DET reminder. NOUN nsubj aux dative det dobj prep det pobj
if dep_pattern(doc):
    print('Found')
else:
    print('Not found')
Found

3.3. Using spaCy pattern matcher

spaCy has a predefined tool called Matcher, that is specially designed to find sequences of tokens based on pattern rules. An implementation of the “subject + auxiliary + verb” pattern with Matcher might look like this:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"DEP": "nsubj"}, {"DEP": "aux"}, {"DEP": "ROOT"}]
matcher.add("NsubjAuxRoot", [pattern])
doc = nlp("We can overtake them.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print("Span: ", span.text)
    print("The positions in the doc are: ", start, "-", end)
    print("Match ID ", match_id)
    print(doc.vocab.strings[match_id]) 
    for tok in doc[end-1].children:
        if tok.dep_ == 'dobj':
            print("The direct object of {} is {}".format(doc[end-1], tok.dep_))   
Span:  We can overtake
The positions in the doc are:  0 - 3
Match ID  10599197345289971701
NsubjAuxRoot
The direct object of overtake is dobj

3.4. Summary of Rule-based Matching

Steps for using the Matcher class:

  1. Create a Matcher instance by passing in a shared Vocab object;

  2. Specify the pattern as an list of dependency labels;

  3. Add the pattern to the a Matcher object;

  4. Input a Doc object to the matcher;

  5. Go through each match \(\langle match\_id, start, end \rangle\).

We have seen a Dependency Matcher just now, there are more Rule-based matching support in spaCy:

  • Token Matcher: regex, and patterns such as

  • Phrase Matcher: PhraseMatcher class

  • Entity Ruler

  • Combining models with rules

For more information of different types of matchers, see spaCy Documentation on Rule Based Matching.

Reference: Chapter 6 of NATURAL LANGUAGE PROCESSING WITH PYTHON AND SPACY