1)爬虫心法 : 做个正常访问者
Example:直接网络连线,不添加任何Header
#抓取电影源码
import ssl
import urllib.request as request
context = ssl._create_unverified_context()
src = 'https://www.ptt.cc/bbs/movie/index.html'
with request.urlopen(src, context= context) as response:
data = response.read().decode("utf-8")
print(data)
error message:
urllib.error.HTTPError: HTTP Error 403: Forbidden
直接被Server拒绝,F12观察一下正常访问Server时候会发生什么。
会发送一大堆的Header,其中最重要的莫属user-agent,标识你用的是什么OS,什么Browser。
2)改进后(request中添加header)
#抓取电影源码
import ssl
import urllib.request as request
context = ssl._create_unverified_context()
src = 'https://www.ptt.cc/bbs/movie/index.html'
#建立req Object,附加header信息
req = request.Request(src, headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})
with request.urlopen(req, context= context) as response:
data = response.read().decode("utf-8")
print(data)
返回message:
PS C:\Users\85380\Desktop\LearnPy> python .\test2.py
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>看板 movie 文章列表 - 批踢踢實業坊</title>
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-base.css"
media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-print.css" media="print">
</head>
<body>
<div id="topbar-container">
<div id="topbar" class="bbs-content">
<a id="logo" href="/bbs/">批踢踢實業坊</a>
<span>›</span>
<a class="board" href="/bbs/movie/index.html"><span class="board-label">看板 </span>movie</a>
<a class="right small" href="/about.html">關於我們</a>
<a class="right small" href="/contact.html">聯絡資訊</a>
</div>
</div>
<div id="main-container">
<div id="action-bar-container">
<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/movie/index.html">看板</a>
<a class="btn" href="/man/movie/index.html">精華區</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/movie/index1.html">最
舊</a>
<a class="btn wide" href="/bbs/movie/index8210.html">‹ 上頁</a>
<a class="btn wide disabled">下頁 ›</a>
<a class="btn wide" href="/bbs/movie/index.html">最新
</a>
</div>
</div>
</div>
<div class="r-list-container action-bar-margin bbs-screen">
<div class="search-bar">
<form type="get" action="search" id="search-bar">
<input class="query" type="text" name="q" value="" placeholder="搜尋文章⋯">
</form>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">8</span></div>
<div class="title">
<a href="/bbs/movie/M.1565026014.A.B3C.html">[新聞]
「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍?</a>
</div>
<div class="meta">
<div class="author">orz44444</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E3%80%8C%E7%B5%82%E5%B1%80%E4%B9%8B%E6%88%B0%E3%80%8D%E3%80%81%E3%80%8C%E4%BA%82%E4%B8%96%E4%BD%B3%E4%BA%BA%E3%80%8D%E3%80%81%E3%80%8C%E9%98%BF%E5%87%A1%E9%81%94%E3%80%8D%E8%AA%B0%E7%9C%9F%E6%AD%A3%E7%A5%A8%E6%88%BF%E5%86%A0%E8%BB%8D%EF%BC%9F">搜尋同標題文章</a></div>
<div class="item"><a href="/bbs/movie/search?q=author%3Aorz44444">搜尋看板內 orz44444 的文章</a></div>
</div>
</div>
<div class="date"> 8/06</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">6</span></div>
<div class="title">
<a href="/bbs/movie/M.1565027230.A.041.html">Re: [新
聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神</a>
</div>
<div class="meta">
<div class="author">godshibainu</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E5%87%B1%E6%96%87%E8%B2%BB%E5%A5%87%E9%80%8F%E9%9C%B2%E3%80%8A%E9%9B%B7%E7%A5%9E%E7%B4%A2%E7%88%BE4%E3%80%8B%E7%82%BA%E4%BD%95%E8%A6%81%E6%8B%8D%E5%A5%B3%E9%9B%B7%E7%A5%9E">搜尋同標題文章</a></div>
<div class="item"><a href="/bbs/movie/search?q=author%3Agodshibainu">搜尋看板內 godshibainu 的文章</a></div>
</div>
</div>
<div class="date"> 8/06</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f3">10</span></div>
<div class="title">
<a href="/bbs/movie/M.1565027740.A.927.html">[新聞]
《復仇者4》驚見關史黛西!「就在蜘蛛人</a>
</div>
<div class="meta">
<div class="author">chufenyang</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E3%80%8A%E5%BE%A9%E4%BB%87%E8%80%854%E3%80%8B%E9%A9%9A%E8%A6%8B%E9%97%9C%E5%8F%B2%E9%BB%9B%E8%A5%BF%EF%BC%81%E3%80%8C%E5%B0%B1%E5%9C%A8%E8%9C%98%E8%9B%9B%E4%BA%BA">搜尋同標題文章</a></div>
<div class="item"><a href="/bbs/movie/search?q=author%3Achufenyang">搜尋看板內 chufenyang 的文章</a></div>
</div>
</div>
<div class="date"> 8/06</div>
<div class="mark"></div>
</div>
</div>
<div class="r-ent">
<div class="nrec"><span class="hl f2">4</span></div>
<div class="title">
<a href="/bbs/movie/M.1565031671.A.280.html">Re: [新
聞] 必備片單!帝國雜誌評選30年來30部經典代</a>
</div>
<div class="meta">
<div class="author">Payne22</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E5%BF%85%E5%82%99%E7%89%87%E5%96%AE%EF%BC%81%E5%B8%9D%E5%9C%8B%E9%9B%9C%E8%AA%8C%E8%A9%95%E9%81%B830%E5%B9%B4%E4%BE%8630%E9%83%A8%E7%B6%93%E5%85%B8%E4%BB%A3">搜尋同標題文章</a></div>
<div class="item"><a href="/bbs/movie/search?q=author%3APayne22">搜尋看板內 Payne22 的文章</a></div>
</div>
</div>
<div class="date"> 8/06</div>
<div class="mark"></div>
</div>
</div>
<div class="r-list-sep"></div>
<div class="r-ent">
<div class="nrec"><span class="hl f3">22</span></div>
<div class="title">
<a href="/bbs/movie/M.1559611458.A.DCA.html">[公告]
板規 2019/07/05</a>
</div>
<div class="meta">
<div class="author">ckshchen</div>
<div class="article-menu">
<div class="trigger">⋯</div>
<div class="dropdown">
<div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E6%9D%BF%E8%A6%8F+2019%2F07%2F05">搜尋同標題文章</a></div>
<div class="item"><a href="/bbs/movie/search?q=author%3Ackshchen">搜尋看板內 ckshchen 的文章</a></div>
</div>
</div>
<div class="date"> 6/04</div>
<div class="mark">M</div>
</div>
</div>
</div>
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-32365737-1', {
cookieDomain: 'ptt.cc',
legacyCookieDomain: 'ptt.cc'
});
ga('send', 'pageview');
</script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.26/bbs.js"></script>
</body>
</html>
3)利用第三方套件Beautifulsop解析HTML
#抓取电影源码
import ssl
import urllib.request as request
context = ssl._create_unverified_context()
src = 'https://www.ptt.cc/bbs/movie/index.html'
#建立req Object,附加header信息
req = request.Request(src, headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})
with request.urlopen(req, context= context) as response:
data = response.read().decode("utf-8")
#解析源码,取得每篇文章的标题
import bs4
root = bs4.BeautifulSoup(data, "html.parser")
#print(root.title.string)#抓到标签"root.title" / 抓到标签里面的文字"root.title.string"
#找到想要的资料在HTML中的特色,如霸王别姬<div><a></a></div>
#titles = root.find("div", class_="title") #寻找class = 'title'的div标签
#print(titles.a.string) #titles会打印出其中一个符合条件的div的a标签里面的string
titles = root.find_all("div",class_ = "title")
for title in titles:
if title.a != None:
print(title.a.string)
result:
PS C:\Users\85380\Desktop\LearnPy> python .\test2.py
[新聞] 「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍?
Re: [新聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神
[新聞] 《復仇者4》驚見關史黛西!「就在蜘蛛人
Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代
Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代
[公告] 板規 2019/07/05
今天的文章Python | Web Crawler分享到此就结束了,感谢您的阅读。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:http://bianchenghao.cn/64570.html