I'm trying to find a way to parse HTML and get it into a custom data structure. For example, I have a really short "novel" that looks like this in HTML:
Test.html:
<html>
<head>
<title>A Test</title>
</head>
<body>
<div>
<a name="#linkH2CH002"/>
<p>Contents of chapter 2, para 1</p>
<p>Contents of chapter 2, para 2</p>
<p>Contents of chapter 2, para 3</p>
<p>Contents of chapter 2, para 4</p>
</div>
<div>
<a name="#linkH2CH003"/>
<p>Contents of chapter 3, para 1</p>
<p>Contents of chapter 3, para 2</p>
<p>Contents of chapter 3, para 3</p>
<p>Contents of chapter 3, para 4</p>
</div>
</body>
</html>
And I want to make this into something like:
Novel [ Chapter [Para, Para] , Chapter [Para, Para] ]
In other words, a novel has one or more chapters, and each chapter has one or more paragraphs, and each paragraph is a string.
Here's what I have so far:
module Main where
import Text.XML.HXT.Core
import Text.HandsomeSoup
data Novel = Novel { title :: String,
chaps :: [Chapter] }
data Chapter = Chapter [Para]
data Para = Para [String]
main :: IO ()
main = do
contents <- readFile "src/test.html"
let doc = parseHtml contents
-- Get all divs that have the child <a name="">
let chapsRaw = doc >>> css "div" >>> (ifA (css "a" >>> hasAttr "name")(this)(none))
chaps <- runX chapsRaw
names <- runX $ chapsRaw >>> css "a" ! "name"
print $ names
print chaps
-- Now to make Chapter [Para] for each chapter.
-- Something like this?
-- Chapter $ [(runX chapsRaw >>> css "p" >>> Para)]
So far so good, but I'm stuck and getting this data into my custom data structure.
I kind of understand that an arrow will run on everything all at once, but I'm not sure how to cram all this data into my custom data structure, since arrows are still a little mysterious to me.
Given that you say that each paragraph is a string, I assume you've made a mistake in your code and instead want to declare
In order to eventually get your data into your structure, you're probably going to want to use
>>^, which lets you post-compose the constructor.Collating paragraphs
For each chapter, a paragraph node will have one text child, so it's possible to turn each
<p>...</p>into aParausingwhich selects
<p>nodes, pulls out their child text node, and turns it into a Para.Collating chapters
Applying the above arrow directly to
chapsRawwould collect the paragraphs for all chapters, rather than one per chapter.There may be a more idiomatic way, but a simple way to instead do this per chapter is with
listA, which collects the results of a list arrow.The whole thing