当前位置：自学咖网 > 技术教程 > Pyhton常见问题 > Python网络爬虫教程：知乎爬虫案例

hmoban Pyhton常见问题 2023-10-09

Python网络爬虫教程：知乎爬虫案例

一、zhihuSpider.py 爬⾍代码：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request, FormRequest
from zhihu.items import ZhihuItem
class ZhihuSipder(CrawlSpider) :
name = "zhihu"
allowed_domains = ["www.zhihu.com"]
start_urls = [ "http://www.zhihu.com" ]
rules = (
Rule(SgmlLinkExtractor(allow = ("/question/d+#.*?", )), ca
llback = "parse_page", follow = True),
Rule(SgmlLinkExtractor(allow = ("/question/d+", )), callba
ck = "parse_page", follow = True),
)
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
"Connection": "keep-alive",
"Content-Type":" application/x-www-form-urlencoded; charset=UTF
-8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/
537.36",
"Referer": "http://www.zhihu.com/" }#重写了爬⾍类的⽅法, 实现了⾃定义请求, 运⾏成功后会调⽤callback 回调
函数
def start_requests(self):
return [Request("https://www.zhihu.com/login", meta = {"coo
kiejar" : 1}, callback = self.post_login)]
#FormRequeset 出问题了
def post_login(self, response):
print "Preparing login"
#下⾯这句话⽤于抓取请求⽹⻚后返回⽹⻚中的_xsrf 字段的⽂字, ⽤于成
功提交表单
xsrf = Selector(response).xpath("//input[@name="_xsrf"]/@va
lue").extract()[0]
print xsrf
#FormRequeset.from_response 是 Scrapy 提供的⼀个函数, ⽤于 post 表 单#登陆成功后, 会调⽤after_login 回调函数
return [FormRequest.from_response(response, #"http://www.
zhihu.com/login",
okiejar"]},
ers
meta = {"cookiejar" : response.meta["co
headers = self.headers, #注意此处的
head
formdata = {
"_xsrf": xsrf,
"email": "1095511864@qq.com",
"password": "123456"
},
callback = self.after_login,
dont_filter = True
)]
def after_login(self, response) :
for url in self.start_urls :
yield self.make_requests_from_url(url)
def parse_page(self, response):
problem = Selector(response)
item = ZhihuItem()
item["url"] = response.url
item["name"] = problem.xpath("//span[@class="name"]/text()"
).extract()
print item["name"]
item["title"] = problem.xpath("//h2[@class="zm-item-title zm-editable-content"]/text()").extract()
item["description"] = problem.xpath("//div[@class="zm-editable-content"]/text()").extract()
item["answer"]= problem.xpath("//div[@class=" zm-editable-c
ontent clearfix"]/text()").extract()
return item

hmoban主题是根据ripro二开的主题，极致后台体验，无插件，集成会员系统
自学咖网 » Python网络爬虫教程：知乎爬虫案例

hmoban 普通

分享到：

相关推荐

16-python爬虫之Requests库爬取海量图片

Requests 是一个 Python 的 HTTP 客户端库。 Request支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动响应内容的编码，支持国际化的URL和POST数据自动编码。...

Pyhton常见问题 2023-10-10
130

怎么快速掌握使用python中if和elif？

在正式开题之前，小编想问大家对于学习python时候，觉得什么内容最难学？那由小编先说，肯定是关于语句的使用，如果有和小编一样的朋友，可以一起来看下以下内容了，相信可以解决大家的问题。引入：如果平时执行的过程超过两个分...

Pyhton常见问题 2023-10-13
91

django注释有什么用

Django模板template(html)中使用注释comment。下面是Django注释符及实例单行注释：使用 {# #} 单行注释，例如: {# Everything you&...

Pyhton常见问题 2023-11-22
73

07python实现traceroute程序

# 这个脚本是实现Linux中traceroute程序的，是探测从我们这个机器到我们要探测的IP地址中间都需要经过那些路由。# 原理：我们的机器发送UDP高端口的数据包，发送给目的地址，首先设置ttl为1，然后逐次增加，...

Pyhton常见问题 2023-10-09
88

Python之XML、HTML和Xpath相关介绍

当我们处理HTML文档感到十分棘手的时候，我们可以先将HTML文件转换成XML文档，然后用XPath查找HTML节点或元素。什么是XML （1）XML 是可扩展标记语言（EXtensible Markup Langua...

Pyhton常见问题 2023-11-29
91

python中pop什么意思

pop() 函数的作用就是用来移除列表中的一个元素（默认最后一个元素），并且返回该元素的值。语法 pop()方法语法： list.pop(obj=list[-1]) 参数 obj -- 可选参数，要...

Pyhton常见问题 2023-11-18
91

自学咖网