V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
wudaown
V2EX  ›  Python

BS4 求助

  •  
  •   wudaown · 2017-05-16 23:03:29 +08:00 · 2035 次点击
    这是一个创建于 2508 天前的主题,其中的信息可能已经有所发展或是发生改变。
    <body><html> <table border="1" width="100%" cellspacing="0" cellpadding="1"> <tr bgcolor="#3366FF"> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Date </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Day </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Time </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Course </font></td> <td align="left" width="40%" valign="top"><font color="#FFFFFF"> Course Title </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Duration </font></td> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AC1101 </td> <td align="left" width="40%" valign="top"> ACCOUNTING I </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AD1101 </td> <td align="left" width="40%" valign="top"> FINANCIAL ACCOUNTING </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> BA3201 </td> <td align="left" width="40%" valign="top"> LIFE CONTINGENCIES AND DEMOGRAPHY </td> <td align="left" width="10%" valign="top"> 3 </td> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> </table> </body></html>

    这样一个 html 文件,想导出到这样的 json 格式

    {"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}

    https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322

    求助 bs4 如何导!不想用正则

    谢谢

    第 1 条附言  ·  2017-05-17 00:06:32 +08:00
    #!/usr/bin/python3
    # _*_ coding:utf8 _*_

    f = open('tmp.html')

    from bs4 import BeautifulSoup

    soup = BeautifulSoup(f)

    f.close()

    data = []
    for i in soup.find_all('td'):
    data.append(i.text.strip('\n').strip(' '))


    r = len(data)//6

    d = dict()

    for i in range(r):
    d.update( {data[3+i*6] : {'date':data[0+i*6],'day':data[1+i*6],'time':data[2+i*6],'code':data[3+i*6],'name':data[4+i*6],'duration':data[5+i*6]}})


    for k,v in d.items():
    print(k,v)
    4 条回复    2017-05-17 08:17:03 +08:00
    15015613
        1
    15015613  
       2017-05-16 23:55:52 +08:00
    In [1]: from lxml import etree
    In [2]: with open('tmp.html','r') as f:
    ...: tree=etree.HTML(f.read())
    In [10]: tmp=tree.xpath('//tr')
    In [29]: import json
    In [37]: out=list()
    ...: for tmp1 in tmp[1:]:
    ...: i=0
    ...: dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
    ...: t1=dict()
    ...: for t in tmp1:
    ...: i=i+1
    ...: t2=t.xpath('text()')[0]
    ...: t1[dict_d[i]]=t2
    ...: out.append(t1)
    In [45]: out2=dict()
    ...: for o in out:
    ...: try:
    ...: out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
    ...: except:
    ...: pass
    In [46]: out2
    Out[46]:
    {' AC1101 ': {'Course Title': ' ACCOUNTING I ',
    'Date': ' 24 November 2017 ',
    'Day': ' Friday ',
    'Duration': ' 2.5 ',
    'Time': ' 9.00 am '},
    ' AD1101 ': {'Course Title': ' FINANCIAL ACCOUNTING ',
    'Date': ' 24 November 2017 ',
    'Day': ' Friday ',
    'Duration': ' 2.5 ',
    'Time': ' 9.00 am '},
    ' BA3201 ': {'Course Title': ' LIFE CONTINGENCIES AND DEMOGRAPHY ',
    'Date': ' 24 November 2017 ',
    'Day': ' Friday ',
    'Duration': ' 3 ',
    'Time': ' 9.00 am '}}
    15015613
        2
    15015613  
       2017-05-16 23:59:35 +08:00
    from lxml import etree
    with open('tmp.html','r') as f:
    ____tree=etree.HTML(f.read())
    tmp=tree.xpath('//tr')
    import json
    out=list()
    for tmp1 in tmp[1:]:
    ____i=0
    ____dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
    ____t1=dict()
    ____for t in tmp1:
    ________i=i+1
    ________t2=t.xpath('text()')[0]
    ________t1[dict_d[i]]=t2
    ____out.append(t1)
    out2=dict()
    for o in out:
    ____try:
    ________out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
    ____except:
    ________pass
    print(out2)
    wudaown
        3
    wudaown  
    OP
       2017-05-17 00:05:16 +08:00
    @15015613 非常感谢你的回答,都是我没有见过的东西,需要慢慢消化。在等待的时候我已经用 dict,list 和 bs4 实现了。就是代码看起来很初级的样子
    justtery
        4
    justtery  
       2017-05-17 08:17:03 +08:00 via Android
    为什么不用 pyquery 呢 滑稽
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   951 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 27ms · UTC 21:23 · PVG 05:23 · LAX 14:23 · JFK 17:23
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.