The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,444 other followers

Parsing simple html in Python

Posted by jpluimers on 2018/11/29

Was working to get fritzcap to emit a list of interfaces so I could specify which one to capture.

For that I needed to parse the output of http://fritz.box/capture.lua which consists of HTML fragments like below.

What I needed was for each consecutive entries of [WayBack] th and first [WayBackbutton tags:

  • content of the th tag
  • content of the value attribute of the button tag having a type="submit" attribute and name=start attribute

So before starting to work on it, I created [WayBackIn order to fix #5, print a list of available interfaces to potentially capture from · Issue #6 · jpluimers/fritzcap

The goal was to get a series of key/value pairs:

4-138 = AP2 (2.4 + 5 GHz, ath1) - Interface 1
4-137 = AP2 (2.4 + 5 GHz, ath1) - Interface 0
4-132 = AP (2.4 GHz, ath0) - Interface 1
4-131 = AP (2.4 GHz, ath0) - Interface 0
4-129 = HW (2.4 GHz, wifi0) - Interface 0
4-128 = WLAN Management Traffic - Interface 0a

So I built a class descending from [WayBackHTMLParser — Simple HTML and XHTML parser that ships with the [WayBackPython standard libraries.

If in the future I need more complex HTML parsing, then these links will help me choosing more feature rich parsers:

Back to the HTMLParser descendant in interfaces_dumper.py which can basically be condensed down to the code below.

  • handle_data is called for both start tags and end tags. The th value in data is only present in the start tag (at the time of end tag the data is empty), so you need to keep track of both last_start_tag and last_end_tag.
  • handle_endtag maintains last_end_tag to help handle_data.
  • handle_starttag maintains last_start_tag to help handle_data and also handles the button behaviour.
    • The buttonis only relevant if it has type="submit" and name="start" and a value attribute in that order.
    • Output is in data which is an array of key/value pairs.

class CaptureLuaHtmlParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.last_end_tag = ""
        self.last_start_tag = ""
        self.last_th_data = ""
        self.data = [] # output list of (key,value) pairs
        self.recording = 0

    def handle_starttag(self, tag, attrs):
        self.last_start_tag = tag
        if tag != 'button':
            return
        button_attribute_type_submit = False
        button_attribute_name_start = False
        for name, value in attrs:
            if (name == 'type') and (value == 'submit'):
                button_attribute_type_submit = True
            if button_attribute_type_submit and (name == 'name') and (value == 'start'):
                button_attribute_name_start = True
            if button_attribute_type_submit and button_attribute_name_start and (name == 'value'):
                self.data.append((value, self.last_th_data)) # (button value value, th content)

    def handle_endtag(self, tag):
        self.last_end_tag = tag

    def handle_data(self, data):
        if (self.last_start_tag == 'th') and (self.last_end_tag != 'th'):
            self.last_th_data = data

With all that, it’s pretty simple to get the data out and display the key value pairs using a loop:

        parser = CaptureLuaHtmlParser()
        parser.feed(html_content)
        self.logger.info("Fritz!Box interfaces from %s: key = value" % url)
        keyValuePairs = sorted(parser.data)
        for (key, value) in keyValuePairs:
            self.logger.info("  %-*s= %s" % (20,key, value))

First it sorts the output, which does not use a key as sorted gets that by itself: [WayBack] 2. Built-in Functions — sorted

This uses the built in precent string formatting operator explained here:

https://github.com/jpluimers/fritzcap/blob/feature/Show_available_fritzbox_interfaces/core/interfaces_dumper.py#L51

 

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: