Insert dots/points in messy string for textual analysis in python

43 views Asked by At

I am given with a long, messy string that lacks sentence structures, i.e., the string does not consistently contain dots/points.

Therefore, I am currently unable to break-down the long string into sentences, which is required for my textual analysis.

The following example best describes what I am given with and what I would need as output.

example_string = "Football is the world's most popular sport Played on rectangular fields, two teams of eleven players each compete to score goals One of the most famous teams is Real Madrid."

output_string = "Football is the world's most popular sport. Played on rectangular fields, two teams of eleven players each compete to score goals. One of the most famous teams is Real Madrid."

I was first thinking of putting a dot/point whenever there is none between a lower-case word and a capitalized word. However, given certain words and especially names may start with a capital letter, I would incorrectly add the dot/point (e.g., in the example, I would add a dot/point before "Real Madrid")

Any help is appreciated. Thank you!

2

There are 2 answers

1
petezurich On BEST ANSWER

How about leveraging an LLM (via an API) for that?

Quick test run with GPT-4:

Prompt

    Separate the following string into sentences. List each sentence with a bullet point: "Football is the world's most popular sport Played on rectangular fields, two teams of eleven players each compete to score goals One of the most famous teams is Real Madrid."

Output

    - Football is the world's most popular sport.
    - Played on rectangular fields, two teams of eleven players each compete to score goals.
    - One of the most famous teams is Real Madrid.
1
houssem.007 On

maybe you can use regular expression to find lowercase word followed by an uppercase

import re
example_string = "Football is the world's most popular sport Played on rectangular fields, two teams of eleven players each compete to score goals One of the most famous teams is Real Madrid."
pattern = re.compile(r'(?<=[a-z])\s+([A-Z])')
output_string = re.sub(pattern, r'. \1', example_string )
print output_string

print : >>> Football is the world's most popular sport. Played on rectangular fields, two teams of eleven players each compete to score goals. One of the most famous teams is. Real. Madrid.