searching for paragraphs in text file using python

378 views Asked by At

I have a word file which has close to 700 pages worth of text in it. I want to only filter particular text based on headers and then extract the entire content under that particular header from those 700 pages. Once this is achieved, i want to store it in an Excel sheet. Would love to do this in Python but solutions in R are also welcomed.

1

There are 1 answers

0
r2evans On

I searched for "R" and "docx" files, and officer came up frequently. I checked its CRAN page, which pointed me to its home page, which included a section named "import Word document in a data.frame". That section linked to docx_summary which includes two lines of code. I expand on that.

But first, reproducible data.

  1. Open Word and paste the following text as unformatted text:

    Lorem Ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
    Hodor Ipsum
    Hodor. Hodor hodor, hodor. Hodor hodor hodor hodor hodor. Hodor. Hodor! Hodor hodor, hodor; hodor hodor hodor. Hodor. Hodor hodor; hodor hodor - hodor, hodor, hodor hodor. Hodor, hodor. Hodor. Hodor, hodor hodor hodor; hodor hodor; hodor hodor hodor! Hodor hodor HODOR! Hodor hodor... Hodor hodor hodor...
    Hipster Ipsum
    Lorem ipsum dolor amet mustache knausgaard +1, blue bottle waistcoat tbh semiotics artisan synth stumptown gastropub cornhole celiac swag. Brunch raclette vexillologist post-ironic glossier ennui XOXO mlkshk godard pour-over blog tumblr humblebrag. Blue bottle put a bird on it twee prism biodiesel brooklyn. Blue bottle ennui tbh succulents.
    
  2. For each of the "Ipsum" two-word lines, highlight them and assign them the "Header 1" style.

  3. Save. I saved it as Lorem Ipsum.docx.

enter image description here

Extraction in R

Let's find "Hodor Ipsum".

# library(officer) # optional, I'm doing the work without fully loading it
lorem <- officer::read_docx("Lorem Ipsum.docx")
summ <- officer::docx_summary(lorem)
summ
#   doc_index content_type style_name                                                                                                                                                                                                                                                                                                                                                   text level num_id
# 1         1    paragraph  heading 1                                                                                                                                                                                                                                                                                                                                            Lorem Ipsum    NA     NA
# 2         2    paragraph       <NA>         Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.    NA     NA
# 3         3    paragraph  heading 1                                                                                                                                                                                                                                                                                                                                            Hodor Ipsum    NA     NA
# 4         4    paragraph       <NA>                                      Hodor. Hodor hodor, hodor. Hodor hodor hodor hodor hodor. Hodor. Hodor! Hodor hodor, hodor; hodor hodor hodor. Hodor. Hodor hodor; hodor hodor - hodor, hodor, hodor hodor. Hodor, hodor. Hodor. Hodor, hodor hodor hodor; hodor hodor; hodor hodor hodor! Hodor hodor HODOR! Hodor hodor... Hodor hodor hodor...    NA     NA
# 5         5    paragraph  heading 1                                                                                                                                                                                                                                                                                                                                          Hipster Ipsum    NA     NA
# 6         6    paragraph       <NA> Lorem ipsum dolor amet mustache knausgaard +1, blue bottle waistcoat tbh semiotics artisan synth stumptown gastropub cornhole celiac swag. Brunch raclette vexillologist post-ironic glossier ennui XOXO mlkshk godard pour-over blog tumblr humblebrag. Blue bottle put a bird on it twee prism biodiesel brooklyn. Blue bottle ennui tbh succulents.    NA     NA
str(summ)
# 'data.frame': 6 obs. of  6 variables:
#  $ doc_index   : int  1 2 3 4 5 6
#  $ content_type: chr  "paragraph" "paragraph" "paragraph" "paragraph" ...
#  $ style_name  : chr  "heading 1" NA "heading 1" NA ...
#  $ text        : chr  "Lorem Ipsum" "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore m"| __truncated__ "Hodor Ipsum" "Hodor. Hodor hodor, hodor. Hodor hodor hodor hodor hodor. Hodor. Hodor! Hodor hodor, hodor; hodor hodor hodor. "| __truncated__ ...
#  $ level       : num  NA NA NA NA NA NA
#  $ num_id      : int  NA NA NA NA NA NA
ind <- with(summ, which(grepl("heading", style_name) & text == "Hodor Ipsum"))
ind
# [1] 3

This sample document doesn't really have much in the way of levels or other grouping styles/mechanisms, so I'm going to assume that the data.frame row after the applicable header is the paragraph I'm looking for.

if (ind < nrow(summ)) summ$text[ind+1]
# [1] "Hodor. Hodor hodor, hodor. Hodor hodor hodor hodor hodor. Hodor. Hodor! Hodor hodor, hodor; hodor hodor hodor. Hodor. Hodor hodor; hodor hodor - hodor, hodor, hodor hodor. Hodor, hodor. Hodor. Hodor, hodor hodor hodor; hodor hodor; hodor hodor hodor! Hodor hodor HODOR! Hodor hodor... Hodor hodor hodor..."