Wednesday, March 9, 2011

Hacking configurable indent width into BeautifulSoup

Many a time I have been saved by the HTML parsing prowess of BeautifulSoup. When it comes to beautifying HTML dumps for easier analysis it is king; however I have always been bothered by its unorthodox choice of indentation width (just one space per level). So the other day I took the time to hack configurable indent width into the code for version 3.2.0.

The complete modified file is here; so long you're using version 3.2.0, you can just paste it over the original file at your installation. For details on the changes see the diff below:
***************
*** 544 ****
--- 545,547 ----
+         # Adjustable indentation patch
+         self.indentWidth = parser.indentWidth
+ 
***************
*** 752 ****
!             space = (' ' * (indentTag-1))
--- 755 ----
!             space = (self.indentWidth * (indentTag-1))
***************
*** 813 ****
!                     s.append(" " * (indentLevel-1))
--- 816 ----
!                     s.append(self.indentWidth * (indentLevel-1))
***************
*** 1080,1082 ****
!     def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
!                  markupMassage=True, smartQuotesTo=XML_ENTITIES,
!                  convertEntities=None, selfClosingTags=None, isHTML=False):
--- 1083,1088 ----
!     def __init__(
!         self, markup="", parseOnlyThese=None, fromEncoding=None,
!         markupMassage=True, smartQuotesTo=XML_ENTITIES,
!         convertEntities=None, selfClosingTags=None, isHTML=False,
!         indentWidth=' '
!     ):
***************
*** 1111 ****
--- 1118,1121 ----
+ 
+         # Adjustable indentation patch
+         self.indentWidth = indentWidth
+ 
Now setting a custom indent width is as simple as doing:
soup = BeautifulSoup(html_string.encode('utf-8'), indentWidth='    ')
beautiful_html_string = soup.prettify()