《Python网络数据采集》5.1节示例程序错误解决及re.sub用法

问题

学习《Python网络数据采集》5.1节时，作者给出了一个示例程序，试着运行了一下，发生以下错误

C:\Anaconda3\python.exe G:/Github/python-scraping/chapter5/1-getPageMedia.py
http://pythonscraping.com/misc/jquery.js?v=1.4.4
Traceback (most recent call last):
File "G:/Github/python-scraping/chapter5/1-getPageMedia.py", line 47, in <module>
urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))
File "C:\Anaconda3\lib\urllib\request.py", line 197, in urlretrieve
tfp = open(filename, 'wb')
OSError: [Errno 22] Invalid argument: 'downloaded/misc/jquery.js?v=1.4.4'

Process finished with exit code 1

仔细研究，发现是src有问题，没有在代码中进行处理，这里第一个获取到的src的url是

http://pythonscraping.com/misc/jquery.js?v=1.4.4

实际文件应该是jquery.js，但是后面带了个版本号的小尾巴“?v=1.4.4”

所以在写入文件的时候出错。

解决方法及re.sub的用法

由于示例网站里面src的url不止一处带小尾巴，所以考虑用正则表达式进行替换。

网上随便搜到一个正则表达式替换字符串的示例:

详解Python中re.sub

所以本书示例代码可以修改如下：
首先导入re模块

import re

然后在getDownloadPath函数中加入替换语句:

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = re.sub("\?.*","",path) #把问号开头的字符串都替换为空
    path = downloadDirectory+path
    directory = os.path.dirname(path)
    if not os.path.exists(directory):
    os.makedirs(directory)
    return path

问题解决，修改后的代码运行结果如下：

C:\Anaconda3\python.exe G:/Github/python-scraping/chapter5/1-getPageMedia.py
http://pythonscraping.com/misc/jquery.js?v=1.4.4
http://pythonscraping.com/misc/jquery.once.js?v=1.2
http://pythonscraping.com/misc/drupal.js?nhx1dd
http://pythonscraping.com/sites/all/themes/skeletontheme/js/jquery.mobilemenu.js?nhx1dd
http://pythonscraping.com/sites/all/modules/google_analytics/googleanalytics.js?nhx1dd
http://pythonscraping.com/sites/default/files/lrg_0.jpg
http://pythonscraping.com/img/lrg%20(1).jpg

Process finished with exit code 0

留言

nidaye 发布于 10 年前

作为本组的python达人，告诉你个更简单的方法 path = path[:path.rfind('?')] 另外代码风格实在不堪入目，建议阅读： https://www.python.org/dev/peps/pep-0008/

作者

jingouwangzi 发布于 10 年前

这个倒是可以，代码是书中提供的，等我先把功能看明白再说

孟祥涛发布于 8 年前

其实这本书中给的例子没有错，打开网页的源代码你会发现所有的src属性中，最后一个是正常的图片（.jpg)，作者也很圆滑，编写代码时没有将urlretrieve()这个函数放在for循环里，因此只是下载最后一个src，所以没有出错；不过你这个挺好，真正解决了这个问题，而不是投机取巧；

《Python网络数据采集》5.1节示例程序错误解决及re.sub用法

问题

解决方法及re.sub的用法

作者

留言

回复 jingouwangzi 取消回复