Python Recipe: Adding Paragraph-Level Permalinks

Posted Thursday, February 11, 2010 at 9:06 p.m. by Chris Amico in Projects about Django, JavaScript, JQuery, Python and recipes

I mentioned in my last post how useful Ben Welsh's code recipe's are. Count this post as my effort to encourage the practice among coding journalists.

Since launching the NewsHour's Annotated State of the Union, I've gotten a few questions about how it worked, particularly about linking comments to paragraphs. What's needed is paragraph-level permalinks. As it turns out, that's pretty easy to do.

The first thing you'll need is a block of clean HTML. Then, you'll need something that can parse and modify that HTML. Fortunately, tools abound.

Doing it server side

BeautifulSoup remains the easiest-to-use Python DOM parser. You could also use ElementTree, but BeautifulSoup is more forgiving.

Start by importing the library and loading your HTML:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

The soup instance gives you a bunch of methods to traverse a block of HTML. In this case, we want to find all the paragraphs, then loop through them and alter each one, added an id attribute, then adding an anchor link as a child element.

paras = soup.findAll('p', recursive=True)

The findAll method does exactly what it sounds like: It finds every <p> tag. Make sure to add recursive=True as an argument. That keeps BeautifulSoup from going more than one level deep in the DOM, which means it will ignore anything in blockquotes or lists. If you want permalinks for those, drop the recursive argument, or make it True.

Now, let's do something with these paragraphs:

for i, p in enumerate(paras):
    p['id'] = "p%s" % i

We're enumerating each paragraph here, creating a second variable, i, for each iteration of the loop. Use that to set the id attribute on each <p>, which just follows Python's dictionary style.

Now we just need to add a link back to the same paragraph. That means creating a new tag, then inserting it as a child of the p tag.

from BeautifulSoup import BeautifulSoup, Tag, NavigableString

soup = BeautifulSoup(html)
paras = soup.findAll('p', recursive=False)

for i, p in enumerate(paras):
    p['id'] = 'p%s' % i

    # here's where we create a new link
    a = Tag(soup, 'a')
    a['href'] = '#p%s' % i
    a['title'] = 'Link to this paragraph'
    a.insert(0, NavigableString(' #')) # link text
    p.append(a) # just add our new a tag to the end of p's contents

Create a tag by instantiating a Tag object, passing in our soup instance and the tag we're creating, <a> in this case. Then, like we did before, set attributes like dictionary keys. Link text (or any text) is a NavigableString instance, which we insert into our a tag. Once that's done, we have a complete link. Just append that to our p tag, and it will drop in at the end. The important thing to remember here is that BeautifulSoup treats HTML element attributes like dictionaries and contents like lists.

Finally, let's wrap this in a function so it's reusable. And since I'm a Django guy, I'll make it a template filter, so I can drop it into my blog entry template. Here's the complete code:

from django import template
from BeautifulSoup import BeautifulSoup, Tag, NavigableString

register = template.Library()

@register.filter
def enumerate_paras(html):
    """
    Given a block of HTML, create id attributes on each
    paragraph (p) and write in a link at the end of each.

    >>> html = '''<p>First line.</p>
    ... <p>Second line.</p>'''
    >>> print enumerate_paras(html)
    <p id="p0">First line.<a href="#p0" title="Link to this paragraph"> #</a></p>
    <p id="p1">Second line.<a href="#p1" title="Link to this paragraph"> #</a></p>
    """

    soup = BeautifulSoup(html)
    paras = soup.findAll('p', recursive=False)
    for i, p in enumerate(paras):
        p['id'] = 'p%s' % i
        a = Tag(soup, 'a')
        a['href'] = '#p%s' % i
        a['title'] = 'Link to this paragraph'
        a.insert(0, NavigableString(' #')) # link text
        p.append(a) # just add our new a tag to the end of p's contents

    return soup.renderContents()

That's it. A template filter is just a Python function. We're returning soup.renderContents() in this case, which outputs a string of HTML. You could also use soup.prettify() or str(soup) or just soup (in which case, Django's template system will turn it into unicode). Just make sure you return something.

And in case you didn't want to do this server side, or (gasp!) you're using something other than Django, I'll cover how to do the same thing with JavaScript in my next post. It's actually shorter.



Comments:

Comments are closed for this post. If you still have something to say, please email me.