In HXT (Haskell), how can I put arrow results into a custom data structure?

69 views Asked by At

I'm trying to find a way to parse HTML and get it into a custom data structure. For example, I have a really short "novel" that looks like this in HTML:

Test.html:

<html>
    <head>
        <title>A Test</title>
    </head>
    <body>
        <div>
            <a name="#linkH2CH002"/>
            <p>Contents of chapter 2, para 1</p>
            <p>Contents of chapter 2, para 2</p>
            <p>Contents of chapter 2, para 3</p>
            <p>Contents of chapter 2, para 4</p>
        </div>
        <div>
            <a name="#linkH2CH003"/>
            <p>Contents of chapter 3, para 1</p>
            <p>Contents of chapter 3, para 2</p>
            <p>Contents of chapter 3, para 3</p>
            <p>Contents of chapter 3, para 4</p>
        </div>
    </body>
</html>

And I want to make this into something like:

Novel [ Chapter [Para, Para] , Chapter [Para, Para] ]

In other words, a novel has one or more chapters, and each chapter has one or more paragraphs, and each paragraph is a string.

Here's what I have so far:

module Main where

import Text.XML.HXT.Core
import Text.HandsomeSoup


data Novel =  Novel { title :: String,
                      chaps :: [Chapter] }

data Chapter = Chapter [Para]

data Para = Para [String]

main :: IO ()
main = do
  contents <- readFile "src/test.html"
  let doc = parseHtml contents
  -- Get all divs that have the child <a name="">
  let chapsRaw = doc >>> css "div" >>> (ifA (css "a" >>> hasAttr "name")(this)(none))
  chaps <- runX chapsRaw
  names <- runX $ chapsRaw >>> css "a" ! "name"
  print $ names
  print chaps
  -- Now to make Chapter [Para] for each chapter. 
  -- Something like this?
  -- Chapter $ [(runX chapsRaw >>> css "p" >>> Para)]

So far so good, but I'm stuck and getting this data into my custom data structure.

I kind of understand that an arrow will run on everything all at once, but I'm not sure how to cram all this data into my custom data structure, since arrows are still a little mysterious to me.

1

There are 1 answers

0
Isaac van Bakel On

Given that you say that each paragraph is a string, I assume you've made a mistake in your code and instead want to declare

data Para = Para String

In order to eventually get your data into your structure, you're probably going to want to use >>^, which lets you post-compose the constructor.

Collating paragraphs

For each chapter, a paragraph node will have one text child, so it's possible to turn each <p>...</p> into a Para using

css "p" >>> getChildren >>> getText >>^ Para

which selects <p> nodes, pulls out their child text node, and turns it into a Para.

Collating chapters

Applying the above arrow directly to chapsRaw would collect the paragraphs for all chapters, rather than one per chapter.

There may be a more idiomatic way, but a simple way to instead do this per chapter is with listA, which collects the results of a list arrow.

The whole thing

runX $ chapsRaw >>> listA (css "p" >>> getChildren >>> getText >>^ Para) >>^ Chapter