Trying to strip tags from extracted subs file

119 views Asked by At

I have been using a script on my Mac for years that was built with ccextractor. Since the app no longer works, I decided to switch to ffmpeg. My goal is to extract subtitles from a video file and have the resulting text in one paragraph without any line breaks. However, I've run into two issues that are beyond my skills:

  1. Line Breaks Issue: My existing awk command doesn't seem to work anymore, and the output contains many line breaks instead of a single paragraph.

  2. HTML Tags Issue: With some files, the extracted lines are embedded with HTML tags, such as <font face="Serif" size="18">Example line</font>.

Here's the code snippet I have, where I replaced the ccextractor line with ffmpeg:

ffmpeg -i "$filename" "${filename}.srt"

# If the file has subtitles, extract them
if [ "$has_subtitles" ]; then
  ccextractor "$filename" -o "${filename}.srt"
  # Use dos2unix to convert the file to the correct format
  dos2unix "${filename}.srt"
  # Remove the timestamps from the .srt file
  awk -v RS= '{
    for (i=5;i<=NF;i++){
      printf "%s%s", (sep ? " " : ""), $i
      sep=1
    }
  }
  END{ print "" }' "${filename}.srt" | pbcopy
  rm "${filename}.srt"
fi

I would greatly appreciate any assistance in resolving these issues. Specifically, I need help modifying the code to remove line breaks and HTML tags from the extracted subtitles, so the output is a single paragraph.

Thanks a lot!

I tried a few awk and sed commands to fix it but nothing works!

EDIT: I made some progress but the line breaks are still randomly messed up. NO more HTML tags.

I modified the line with: END{ print "" }' "${filename}.srt" | sed 's/<[^>]*>//g' | tr '\n' ' ' | pbcopy

2

There are 2 answers

0
Daweo On

have the resulting text in one paragraph without any line breaks

To convert multi-line file to single-line inform GNU AWK that you want non-default ORS (output row separator), either empty string or space, consider following simple example, let file.txt content

Able
Baker
Charlie

then

awk 'BEGIN{ORS=""}{print}' file.txt

gives output

AbleBakerCharlie

whilst

awk 'BEGIN{ORS=" "}{print}' file.txt

gives output

Able Baker Charlie 

Keep in mind that there is not trailing newline, if you wish to have one trailing newline use one of following

awk 'BEGIN{ORS=""}{print}END{printf "\n"}' file.txt
awk 'BEGIN{ORS=" "}{print}END{printf "\n"}' file.txt

(tested in GNU Awk 5.1.0)

0
Fab dub On

I actually made it work this way:

sed -E '/^[0-9]+$/d; /^[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}$/d; s/<[^>]*>//g' "${filename}.srt" | tr -d '\r' | tr '\n' ' ' | tr -s ' ' | sed 's/^ *//;s/ *$//' | pbcopy