Parsing elements from a markdown file in python 3

Question

Parsing elements from a markdown file in python 3

14.7k views Asked by Andrew Spott At 03 December 2016 at 07:29

How might I get a list of elements from a markdown file in python 3? I'm specifically interested in getting a list of all images and links (along with relevant information like alt-text and link text) out of a markdown file.

this Is some prior art in this area, but it is almost exactly 2 years old at this point, and I expect that the landscape has changed a bit.

Bonus points if the parser you come up with supports multimarkdown.

Original Q&A

There are 3 answers

**Håken Lid** · Answer 1 · 2016-12-03T09:27:43+00:00

You can convert the markdown into html with Python-Markdown, and then extract what you want from the html document using Beautiful Soup, which makes extracting images and links very straightforward.

This might seem like a complicated pipeline, but it's certainly easier and more robust than for instance writing an ad hoc markdown parser using regular expressions. These modules are battle tested and efficient.

**Sergio Correia** · Answer 2 · 2016-12-07T03:55:49+00:00

If you take advantage of two Python packages, pypandoc and panflute, you could do it quite pythonically in a few lines (sample code):

Given a text file example.md, and assuming you have Python 3.3+ and already did pip install pypandoc panflute, then place the sample code in the same folder and run it from the shell or from e.g. IDLE.

import io
import pypandoc
import panflute

def action(elem, doc):
    if isinstance(elem, panflute.Image):
        doc.images.append(elem)
    elif isinstance(elem, panflute.Link):
        doc.links.append(elem)

if __name__ == '__main__':
    data = pypandoc.convert_file('example.md', 'json')
    doc = panflute.load(io.StringIO(data))
    doc.images = []
    doc.links = []
    doc = panflute.run_filter(action, prepare=prepare, doc=doc)

    print("\nList of image URLs:")
    for image in doc.images:
        print(image.url)

The steps are:

Use pypandoc to obtain a json string that contains the AST of the markdown document
Load it into panflute to create a Doc object (panflute requires a stream so we use StringIO)
Use the run_filter function to iterate over every element, and extract the Image and Link objects.
Then you can print the urls, alt text, etc.

**pds** · Answer 3 · 2023-12-15T11:14:19+00:00

Here's a code example of @Håken Lid's answer:

import requests
from markdown import markdown
from bs4 import BeautifulSoup

url='https://raw.githubusercontent.com/StackExchange/Stacks/develop/README.md'
html = markdown(requests.get(url).text, output_format="html5")
soup = BeautifulSoup(html, "html.parser")
for img in soup.findAll('img'):
    print(img['src'])

Output:

TechQA.

Parsing elements from a markdown file in python 3

There are 3 answers

Related Questions in PYTHON

Related Questions in MARKDOWN

Related Questions in MULTIMARKDOWN

Popular Questions

Trending Questions