Merge tokens based on preceeding POS tags

298 views Asked by At

I would like to implement some text manipulation as a pre-processing to keyphrases extraction. Look at the below example:

import spacy
text = "conversion of existing underground gas storage facilities into storage facilities dedicated to hydrogen-storage"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

for token in doc:
    print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')

result:

conversion NOUN   NN     ROOT   noun                 noun, singular or mass
of       ADP    IN     prep   adposition           conjunction, subordinating or preposition
existing VERB   VBG    amod   verb                 verb, gerund or present participle
underground ADJ    JJ     amod   adjective            adjective (English), other noun-modifier (Chinese)
gas      NOUN   NN     compound noun                 noun, singular or mass
storage  NOUN   NN     compound noun                 noun, singular or mass
facilities NOUN   NNS    pobj   noun                 noun, plural
into     ADP    IN     prep   adposition           conjunction, subordinating or preposition
storage  NOUN   NN     compound noun                 noun, singular or mass
facilities NOUN   NNS    pobj   noun                 noun, plural
dedicated VERB   VBN    acl    verb                 verb, past participle
to       ADP    IN     prep   adposition           conjunction, subordinating or preposition
hydrogen NOUN   NN     compound noun                 noun, singular or mass
-        PUNCT  HYPH   punct  punctuation          punctuation mark, hyphen
storage  NOUN   NN     pobj   noun                 noun, singular or mass

I would like to recognize when a given word (for example storage) is preceeded by a NOUN (like in the case of gas storage) in order to replace the space characted with an hyphen (as already done in hydrogen-storage), but I don't want to change the space character when my word is preceeded by a POS element that is not NOUN (example: into storage).

Expected output: "conversion of existing underground gas-storage facilities into storage facilities dedicated to hydrogen-storage"

Is there an efficient way to do this?

Thank you in advance for any help

1

There are 1 answers

0
fsimonjetz On

spaCy provides a rule-based matcher. It lets you define rules to find patterns like a noun followed by a noun.

from spacy.matcher import Matcher

pattern = [{"POS": "NOUN"}, {"POS": "NOUN"}]
matcher = Matcher(nlp.vocab)
matcher.add("MultiWordExpression", [pattern])

...which you can use to extract matching sequences (this is pretty much verbatim from the spaCy docs):

matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

The output for your text is

8584524718281925236 MultiWordExpression 4 6 gas storage
8584524718281925236 MultiWordExpression 5 7 storage facilities
8584524718281925236 MultiWordExpression 8 10 storage facilities

Now there is also functionality to merge tokens using the retokenizer.merge method, but that does not work in this case - see below.

with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        retokenizer.merge(doc[start:end])

In your case, there are overlapping spans ("gas storage" and "storage facilities" overlap) which result in a ValueError: [E102] Can't merge non-disjoint spans.. You'd have to make sure you only get non-overlapping span if you want to use spaCy, e.g., by changing the pattern to "a noun, followed by a singular noun" ([{"POS": "NOUN"}, {"TAG": "NN"}]), which would work and give the following result:

>>> for tok in doc:
>>>     print(tok)
conversion
of
existing
underground
gas storage # <- The match is now one token
facilities
into
storage
facilities
dedicated
to
hydrogen
-
storage

If you only need the string, I'd recommend to use the matcher as demonstrated above to find spans and then use a custom function to merge tokens based on these spans, which should be more flexible than the builtin retokenizer.