解析HTML/XHTML


解析 HTML/XHTML 和解析 XML 几乎一模一样,但是底层解析器不一样。 XML 基于 xml.parsers.expat,而 HTML/XHTML 基于 html.parser

还有一点不一样的是 HTML/XHTML 没有 CDATA。除此之外,其他用法均可参考 XML

HTML/XHTML 并没有经过严格的测试,目前的重心是在 WordExcelPPT 上(也就是 XML)。所以,如果您需要这方面的支持,请联系我:jiyangj@foxmail.com

本页用到的HTML数据

HTML
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Document</title>
    </head>
    <body>
        <table>
            <thead>
                <tr>
                    <th>序号</th>
                    <th>名称</th>
                    <th>简称</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>1</td>
                    <td>大大大大</td>
                    <td>超大</td>
                </tr>
                <tr>
                    <td>2</td>
                    <td>小小小小</td>
                    <td>超小</td>
                </tr>
                <tr>
                    <td>3</td>
                    <td>666666</td>
                    <td>超6</td>
                </tr>
            </tbody>
        </table>
    </body>
</html>

示例一:获取表格元素

import UniversalParser as UP

def get_tables(manager):
    datas = []
    tables = manager | 'table'
    _ancestor =  manager.find_nodes_with_ancestor
    for table in tables:
        thead_trs = _ancestor(table.thead, tag_='tr')
        tbody_trs = _ancestor(table.tbody, tag_='tr')
        heads = []
        bodys = []
        for thead_tr in thead_trs:
            ths = _ancestor(thead_tr, tag_='th')
            heads.append([th & UP.SM.text for th in ths])
        for tbody_tr in tbody_trs:
            tds = _ancestor(tbody_tr, tag_='td')
            bodys.append([td & UP.SM.text for td in tds])
        datas.append((heads, bodys))
    return datas

if __name__ == '__main__':
    manager = UP.parse_html_or_xhtml(html_data, analysis_text=False)
    print(get_tables(manager))
    # [([['序号', '名称', '简称']], [['1', '大大大大', '超大'], ['2', '小小小小', '超小'], ['3', '666666', '超6']])]
若您有需求并且不知道如何用 Universal Parser 去实现它,请联系我。