Parsing elements from a markdown file in python 3

14.7k views Asked by At

How might I get a list of elements from a markdown file in python 3? I'm specifically interested in getting a list of all images and links (along with relevant information like alt-text and link text) out of a markdown file.

this Is some prior art in this area, but it is almost exactly 2 years old at this point, and I expect that the landscape has changed a bit.

Bonus points if the parser you come up with supports multimarkdown.

3

There are 3 answers

2
Håken Lid On

You can convert the markdown into html with Python-Markdown, and then extract what you want from the html document using Beautiful Soup, which makes extracting images and links very straightforward.

This might seem like a complicated pipeline, but it's certainly easier and more robust than for instance writing an ad hoc markdown parser using regular expressions. These modules are battle tested and efficient.

2
Sergio Correia On

If you take advantage of two Python packages, pypandoc and panflute, you could do it quite pythonically in a few lines (sample code):

Given a text file example.md, and assuming you have Python 3.3+ and already did pip install pypandoc panflute, then place the sample code in the same folder and run it from the shell or from e.g. IDLE.

import io
import pypandoc
import panflute

def action(elem, doc):
    if isinstance(elem, panflute.Image):
        doc.images.append(elem)
    elif isinstance(elem, panflute.Link):
        doc.links.append(elem)

if __name__ == '__main__':
    data = pypandoc.convert_file('example.md', 'json')
    doc = panflute.load(io.StringIO(data))
    doc.images = []
    doc.links = []
    doc = panflute.run_filter(action, prepare=prepare, doc=doc)

    print("\nList of image URLs:")
    for image in doc.images:
        print(image.url)

The steps are:

  1. Use pypandoc to obtain a json string that contains the AST of the markdown document
  2. Load it into panflute to create a Doc object (panflute requires a stream so we use StringIO)
  3. Use the run_filter function to iterate over every element, and extract the Image and Link objects.
  4. Then you can print the urls, alt text, etc.
0
pds On

Here's a code example of @Håken Lid's answer:

import requests
from markdown import markdown
from bs4 import BeautifulSoup

url='https://raw.githubusercontent.com/StackExchange/Stacks/develop/README.md'
html = markdown(requests.get(url).text, output_format="html5")
soup = BeautifulSoup(html, "html.parser")
for img in soup.findAll('img'):
    print(img['src'])

Output: