November 19, 2019

Python爬虫入门, Python采集实例

准备工作:

1、下载Python3.6.5 https://www.python.org/downloads/

2、下载IDA工具PythonCharm https://www.jetbrains.com/pycharm/download/#section=windows （查看破解方法）

3、配置环境变量
安装好Python后找到快捷键,右键点属性,找到目标路径复制

我的电脑,右键-属性-高级-环境变量-系统变量-修改Path,在变量值的最后面加一个;然后把目标路径复制进去
设置好环境变量后测试一下,Win+R,输入cmd,在输入python,如图,就证明已经设置好了环境变量

输一个简单的指令 print(“hello”),就可以在控制台打印出hello这句话.

好了准备工作,我们打开pyCharm编译器

首次使用

1、点击Create New Project.

2、输入项目名、路径、选择python解释器。如果没有出现python解释器

3、选择python解释器。可以看到，一旦添加了python解释器，pycharm就会扫描出你已经安装的python扩展包，和这些扩展包的最新版本。（估计是pycharm连接了pypi）

4、点击OK之后，就会创建一个空项目，里面包含一个.idea的文件夹，用于pycharm管理项目。

5、好了，写一个新的项目尝试一下pycharm吧！新建一个py文件,

如图1:

那我们就以爬编程区的帖子为例吧:
1.首先找到地址https://www.52pojie.cn/forum-24-1.html

2.我用的谷歌浏览器,按F12,右边Elements的<body>标签里找我们需要的帖子链接和名字,

如图2:

先整理下思路
为了获取的内容更精确,我们先获取tbody里面的内容
<tbody id=”normalthread_739688“></tbody>
然后获取链接和标题

<a href="thread-739688-1-1.html" style="font-weight: bold;" class="s xst">【Python】萌新跟我来入门Python爬虫</a>

那开始写代码:

import urllib.request
import re

先导入两个模块解释下这两个模块,新人朋友们了解下,其他有基础的朋友有兴趣可以深入研究下

urllib.request
原地址：
https://docs.python.org/3.5/library/urllib.request.html#module-urllib.request
urllib.request — 为打开url提供的可扩展类库源代码:Lib / urllib / request.pyurllib.request模块定义了方法和类,帮助打开url(主要是HTTP)在一个复杂的世界——基本和摘要式身份验证,重定向,cookies等等。

re模块
re 模块使 Python 语言拥有全部的正则表达式功能。compile 函数根据一个模式字符串和可选的标志参数生成一个正则表达式对象。该对象拥有一系列方法用于正则表达式匹配和替换。re 模块也提供了与这些方法功能完全一致的函数，这些函数使用一个模式字符串做为它们的第一个参数。
详细请看:http://www.runoob.com/python/python-reg-expressions.html

继续写代码

url="https://www.52pojie.cn/forum-24-1.html"#爬的地址
page=urllib.request.urlopen(url).read()#获取到该地址的所有内容
page=page.decode('gbk') #转码
print(page)

我们来打印下page里面的内容

是不是和图2网页里面的内容一模一样的

这里转码我讲下,’gbk’是该网页的编码类型,我们必须找到网页的编码类型,要不然就会乱码,那么我们怎么知道网页的编码类型呢,在<head>头文件里的<meta>里

如图:

现在我们获取到该地址的所有数据都存在page里面了,那么我们要怎么去过滤里面内容,找到我们需要的内容呢,这就要用到正则表达式了,我们简单学习下正则表达式:

找到我们需要过滤的内容:

<a href=”<span%20style=”color: #000000;”>thread-739688-1-1.html“ onclick=”atarget(this)” class=”s xst“>【Python】萌新跟我来入门Python爬虫</a>
开始写正则,为了观看方便,我用不同的颜色区分开:
<a href=”(thread-<span%20style=”color: #ff0000;”>.+?)”.+?class=”s xst“>(.+?)</a>

zz = r’正则表达式‘ zz =r’<a href=”(thread-.+?)”.+? class=”s xst“>(.+?)</a>‘

.+? 任意字符一个或则多个
(.+?) 提取括号中匹配的数据
意思是我需要提取的是 thread-545836-1-1.html 和个人快速启动源码

继续写代码

#正则表达式
zz = r'<a href="(thread-.+?)".+? class="s xst">(.+?)</a>'
#匹配所有符合规则的内容存到html集合里面
html=re.findall(zz,page,re.S)#re.S表示.可以代表\n
print(html)

我们打印下看下效果

效果不是很理想,那我们来优化下,把标题和链接分开,链接进行个拼接

for line in html:
    html_link = re.findall(zz_mData, line, re.S)  # re.S表示.可以代表\n
    #标题
    title  = html_link[0][1]
    #链接
    link = html_link[0][0]
    print('%d、%s https://www.52pojie.cn/%s'%(title_naumber,title,link))
    title_naumber = title_naumber+1

将html集合里面的内容每一个内容赋值给line

完整代码:

import urllib.request
import re
 
url="https://www.52pojie.cn/forum-24-1.html"#爬的地址
 
page=urllib.request.urlopen(url).read()#获取到该地址的所有内容
page=page.decode('gbk') #转码
#print(page)
 
#正则表达式
zz1= r'<tbody id="normalthread_.+?">(.+?)</tbody>'
#匹配链接和标题
zz_link = r'<a href="(thread-.+?)".+? class="s xst">(.+?)</a>'
 
#匹配所有符合规则的内容存到html集合里面
html=re.findall(zz1,page,re.S)#re.S表示.可以代表\n
#print(html)
for line in html:
    html_link  = re.findall(zz_link, line, re.S)  # re.S表示.可以代表\n
    #标题
    title  = html_link[0][1]
    #链接
    link = html_link[0][0]
    print('%d、%s https://www.52pojie.cn/%s'%(title_naumber,title,link))

爬编程区所有帖子:

import urllib.request
import re
 
page_number= 1#页数
title_naumber = 1#每个帖子的编号
mData={}
url = "https://www.52pojie.cn/forum-24-1.html" # 爬第一页的地址
page = urllib.request.urlopen(url).read()  # 获取到该地址的所有内容
page = page.decode('gbk')  # 转码
 
zz_pageNumber=r'<span title="共.+?页">(.+?)</span>'
#匹配出总页数
str_pagenumber = re.findall(zz_pageNumber, page, re.S)#str_pagenumber = [' / 177 页', ' / 177 页']
#将非数字用空字符串替换然后转化成int类型
page_Maxnumber = int(re.sub('\D','',str_pagenumber[0]))#\D表示非数字
#print(page_Maxnumber)
for index in range(page_Maxnumber):
    url = "https://www.52pojie.cn/forum-24-%d.html" % page_number # 爬的地址
    if page_number > 2:
        page = urllib.request.urlopen(url).read()  # 获取到该地址的所有内容
        page = page.decode('gbk','ignore')  # 转码
    # 正则表达式
    #匹配整个页面
    zz = r'<tbody id="normalthread_.+?">(.+?)</tbody>'
    #匹配链接和标题
    zz_mData = r'<a href="(thread-.+?)".+? class="s xst">(.+?)</a>'
 
 
    # 匹配所有符合规则的内容存到html集合里面
    html = re.findall(zz, page, re.S)  # re.S表示.可以代表\n
    # print(html)
    for line in html:
        html_link = re.findall(zz_mData, line, re.S)   #举例 ('thread-739688-1-1.html', '【Python】萌新跟我来入门Python爬虫')]
        #标题
        title  = html_link[0][1]# '【Python】萌新跟我来入门Python爬虫'
        #链接
        link = html_link[0][0]# link = 'thread-739688-1-1.html'
        print('%d、%s https://www.52pojie.cn/%s'%(title_naumber,title,link))
        title_naumber = title_naumber+1
    print("第%d页\n"% page_number)
    page_number = page_number + 1

代码写完了,是不是很简单,我们来看下效果

同理

那我们来试试爬移动区的帖子
只要更改下url地址就可以
https://www.52pojie.cn/forum-65-1.html

如图:

如何爬图片？
依旧以论坛为例吧，找了半天也没发现什么图片，那就来爬爬论坛勋章的gif图片吧

首先找到论坛勋章的网址,F12找到该网页的源码
如图:

<img src="https://static.52pojie.cn/static/image/common/5yeas.gif" alt="五年荣誉奖章" style="margin-top: 20px;width:auto; height: auto;">

开始写代码:

import urllib.request
import re
import os
 
url="https://www.52pojie.cn/home.php?mod=medal"#爬的地址
#<img src="https://static.52pojie.cn/static/image/common/5yeas.gif" alt="五年荣誉奖章" style="margin-top: 20px;width:auto; height: auto;">
 
page=urllib.request.urlopen(url).read()#获取到该地址的所有内容
page=page.decode('gbk') #转码
#print(page)
 
#正则表达式
zz = r'<img src="([url=https://static.52pojie.cn/static/image/common/.+?]https://static.52pojie.cn/static/image/common/.+?[/url])" alt="(.+?)" style=".+?">'
#匹配所有符合规则的内容存到html集合里面
html=re.findall(zz,page,re.S)#re.S表示.可以代表\n
print(html)

代码插入有点问题,正则表达式是这样的不要误导了没有前面的url标签:

Python爬虫入门, Python采集实例

我们来打印下看下效果:

效果依旧不理想,把png的图片也给匹配出来了
那么来改进下利用for循环来讲集合里面的数据分离,然后判断链接是否是以gif为结尾

i = 0
 
for line in html:
    line = html[i]
    #判断是否是gif图片
    if str(line[0]).endswith("gif"):
        p1 = line[0]
        p2 = line[1]
        print(p2 + " " + p1)
    i = i + 1

来看下效果

数据获取到了,那么接下来就是将图片下载带本地了
继续写代码:
首先导入os库

import os

写个方法创建文件夹

def mkdir(path):
    folder = os.path.exists(path)
 
    if not folder:  # 判断是否存在文件夹如果不存在则创建为文件夹
        os.makedirs(path)  # makedirs 创建文件时如果路径不存在会创建这个路径
        print("创建新文件夹")
 
        print("创建成功")
    else:
        print("该文件夹已经存在")

调用方法,将文件夹路径传入

img_path = "D:/photo/"
mkdir(img_path)

下载gif图片放到D:/photo/文件夹里面

url = p1
        #下载gif图片放到D:/photo/文件夹里面
        web = urllib.request.urlopen(url)
        data = web.read()
        f = open(img_path + p2 +".gif", "wb")
        f.write(data)
        f.close()

代码写完了
完整代码:

import urllib.request
import re
import os
 
url="https://www.52pojie.cn/home.php?mod=medal"#爬的地址
#<img src="https://static.52pojie.cn/static/image/common/5yeas.gif" alt="五年荣誉奖章" style="margin-top: 20px;width:auto; height: auto;">
 
page=urllib.request.urlopen(url).read()#获取到该地址的所有内容
page=page.decode('gbk') #转码
#print(page)
 
#正则表达式
zz = r'<img src="([url=https://static.52pojie.cn/static/image/common/.+?]https://static.52pojie.cn/static/image/common/.+?[/url])" alt="(.+?)" style=".+?">'
#匹配所有符合规则的内容存到html集合里面
html=re.findall(zz,page,re.S)#re.S表示.可以代表\n
#print(html)
 
 
def mkdir(path):
    folder = os.path.exists(path)
 
    if not folder:  # 判断是否存在文件夹如果不存在则创建为文件夹
        os.makedirs(path)  # makedirs 创建文件时如果路径不存在会创建这个路径
        print("创建新文件夹")
 
        print("创建成功")
    else:
        print("该文件夹已经存在")
 
img_path = "D:/photo/"
mkdir(img_path)
i = 0
 
for line in html:
    line = html[i]
    #判断是否是gif图片
    if str(line[0]).endswith("gif"):
        p1 = line[0]
        p2 = line[1]
        print(p2 + " " + p1)
        url = p1
        #下载gif图片放到D:/photo/文件夹里面
        web = urllib.request.urlopen(url)
        data = web.read()
        f = open(img_path + p2 +".gif", "wb")
        f.write(data)
        f.close()
    i = i + 1

代码插入有点问题,正则表达式没有前面的url标签:

那我们来看看效果

实例：发一个网易云课堂的爬虫

import requests
a={"_movies0":{
"movieid":'M6SGHFBMC',
"href":'http://open.163.com/movie/2008/1/M/C/M6SGF6VB4_M6SGHFBMC.html'
},
"_movies1":{
"movieid":'M6SGHJ9BO',
"href":'http://open.163.com/movie/2008/1/B/O/M6SGF6VB4_M6SGHJ9BO.html'
},
"_movies2":{
"movieid":'M6SGHM4EB',
"href":'http://open.163.com/movie/2008/1/E/B/M6SGF6VB4_M6SGHM4EB.html'
},
"_movies3":{
"movieid":'M6SGHKAED',
"href":'http://open.163.com/movie/2008/1/E/D/M6SGF6VB4_M6SGHKAED.html'
},
"_movies4":{
"movieid":'M6SGHMFAR',
"href":'http://open.163.com/movie/2008/1/A/R/M6SGF6VB4_M6SGHMFAR.html'
},
"_movies5":{
"movieid":'M6SGJVV7H',
"href":'http://open.163.com/movie/2008/1/7/H/M6SGF6VB4_M6SGJVV7H.html'
},
"_movies6":{
"movieid":'M6SGJVMC6',
"href":'http://open.163.com/movie/2008/1/C/6/M6SGF6VB4_M6SGJVMC6.html'
},
"_movies7":{
"movieid":'M6SGJVA93',
"href":'http://open.163.com/movie/2008/1/9/3/M6SGF6VB4_M6SGJVA93.html'
},
"_movies8":{
"movieid":'M6SGJV3FH',
"href":'http://open.163.com/movie/2008/1/F/H/M6SGF6VB4_M6SGJV3FH.html'
},
"_movies9":{
"movieid":'M6SGJURUO',
"href":'http://open.163.com/movie/2008/1/U/O/M6SGF6VB4_M6SGJURUO.html'
},
"_movies10":{
"movieid":'M6SGKG5LM',
"href":'http://open.163.com/movie/2008/1/L/M/M6SGF6VB4_M6SGKG5LM.html'
},
"_movies11":{
"movieid":'M6SGKGMOT',
"href":'http://open.163.com/movie/2008/1/O/T/M6SGF6VB4_M6SGKGMOT.html'
},
"_movies12":{
"movieid":'M6SGKK6L3',
"href":'http://open.163.com/movie/2008/1/L/3/M6SGF6VB4_M6SGKK6L3.html'
},
"_movies13":{
"movieid":'M6SGKIEME',
"href":'http://open.163.com/movie/2008/1/M/E/M6SGF6VB4_M6SGKIEME.html'
},
"_movies14":{
"movieid":'M6SGKINJV',
"href":'http://open.163.com/movie/2008/1/J/V/M6SGF6VB4_M6SGKINJV.html'
},
"_movies15":{
"movieid":'M6SGKSC2N',
"href":'http://open.163.com/movie/2008/1/2/N/M6SGF6VB4_M6SGKSC2N.html'
},
"_movies16":{
"movieid":'M6SGKVGN6',
"href":'http://open.163.com/movie/2008/1/N/6/M6SGF6VB4_M6SGKVGN6.html'
},
"_movies17":{
"movieid":'M6SGL3P1H',
"href":'http://open.163.com/movie/2008/1/1/H/M6SGF6VB4_M6SGL3P1H.html'
},
"_movies18":{
"movieid":'M6SGL2R35',
"href":'http://open.163.com/movie/2008/1/3/5/M6SGF6VB4_M6SGL2R35.html'
},
"_movies19":{
"movieid":'M6SGL3CE4',
"href":'http://open.163.com/movie/2008/1/E/4/M6SGF6VB4_M6SGL3CE4.html'
}
}
headers = {
 
    'charset': 'utf-8',
    'Accept-Encoding': 'gzip',
    'referer': 'https://servicewechat.com/wx855c5d7718f218c9/414/page-frame.html',
    'xdk-version': 'V0.11.27.0',
    'xdk-versioncode': '221',
    'xdk-env': 'v2',
    'content-type': 'application/json',
    'token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJvY2h2cTBCR3NwTEhhT0FmLWRaeTlzdWxjR2lvIiwiYXVkaWVuY2UiOiJtb2JpbGUiLCJjcmVhdGVkIjoxNTQwMjg3MDcwNDI5LCJleHAiOjE1NDYzMzUxMjZ9.wl_4ZPbAxhV9pBhrcNQrfrOo1HGlhJZmjBUZstf4QNg',
    'User-Agent': 'Mozilla/5.0 (Linux; Android 8.0.0; VTR-AL00 Build/HUAWEIVTR-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/66.0.3359.126 Mobile Safari/537.36 MicroMessenger/6.7.3.1360(0x26070336) NetType/WIFI Language/zh_CN Process/appbrand0',
    'Host': 'mars.sharedaka.com',
    'Connection': 'Keep-Alive'
 
}
import re
import os
download_path='./rc'
rc_dict={}
i=1
for mk in a.keys():
    murl=a[mk]["href"]
    #print(murl)
    content = requests.get(murl).text.replace('\n','').replace(' ','')
    #print(content)
    rcurl_s=re.compile(r"appsrc:'(.+?)'",re.DOTALL)
    rcurl= re.findall(rcurl_s, content)[0].replace('m3u8','mp4')
    titlle_s=re.compile(r"title:'(.+?)'",re.DOTALL)
    titlle= str(i)+re.findall(titlle_s, content)[0]
    rc_dict[titlle]=rcurl
    i+=1
print(rc_dict)
for mk in rc_dict.keys():
    path=os.path.join(download_path,mk+'.mp4')
    print('--------->',path)
    if not os.path.exists(path):
        r = requests.get(rc_dict[mk],stream=True)
        with open(path,'ab+') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
    else:
        print('existed')
#result = json.loads(content)["data"]

结果：

本文：Python爬虫入门, Python采集实例

Tags:Python爬虫入门, Python采集实例

Just Code

Python爬虫入门, Python采集实例

About Author

Gideon

Add a Comment

Related Posts

Related Posts

About Author

Gideon

Add a Comment