当你需要收集某个网站的所有产品展示图时，手动右键保存效率太低。上周我帮朋友下载某电商平台2000张手机壳图片，手动操作需要3天，而用Python脚本只用了15分钟。

准备工作

安装Python3.6+（建议使用Anaconda集成环境）
终端执行：pip install requests beautifulsoup4
准备目标网址（例如：https://example.com/products）

核心代码分解

import os
import requests
from bs4 import BeautifulSoup

# 创建保存目录
if not os.path.exists('images'):
    os.makedirs('images')

# 伪装浏览器头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}

# 获取网页内容
def get_html(url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        return response.text
    except Exception as e:
        print(f"获取页面失败: {e}")
        return None

# 解析图片链接
def parse_images(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 根据实际网页结构调整选择器
    img_tags = soup.find_all('img', src=True)
    return [img['src'] for img in img_tags if img['src'].endswith(('.jpg', '.png'))]

# 下载单张图片
def download_image(url, filename):
    try:
        response = requests.get(url, stream=True)
        with open(f"images/{filename}", 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)
        print(f"已下载: {filename}")
    except Exception as e:
        print(f"下载失败: {url} - {e}")

# 主程序
if __name__ == "__main__":
    base_url = "https://example.com/products"
    html = get_html(base_url)
    if html:
        image_urls = parse_images(html)
        for idx, img_url in enumerate(image_urls):
            # 处理相对路径
            if not img_url.startswith('http'):
                img_url = f"{base_url}/{img_url}"
            download_image(img_url, f"product_{idx+1}.jpg")

5个常见问题解决方案

反爬虫限制：
- 添加随机延迟：time.sleep(random.uniform(0.5, 2))
- 使用代理IP池
- 更换User-Agent

图片路径问题：

相对路径转绝对路径：

from urllib.parse import urljoin
absolute_url = urljoin(base_url, relative_path)

大文件下载中断：
- 使用分块下载（代码中已实现）
- 添加重试机制
动态加载图片：
- 需要分析XHR请求
- 使用Selenium模拟浏览器

文件名冲突：

使用MD5哈希命名：

import hashlib
filename = hashlib.md5(img_url.encode()).hexdigest() + '.jpg'

进阶技巧

多线程下载（使用concurrent.futures）

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=5) as executor:
    executor.map(download_image, image_urls)

断点续传功能
自动识别网站分页规则
添加日志记录系统

法律提示

批量下载前请确认：

检查robots.txt文件（如：https://example.com/robots.txt）
确认网站服务条款
不要对服务器造成过大负荷

最后提醒：这个脚本去年帮我节省了超过200小时工作量，但第一次运行时我把目标网站搞崩了...建议控制请求频率，做个文明的爬虫使用者。

Python脚本批量下载网站图片的5个关键步骤与常见问题解决

准备工作

核心代码分解

5个常见问题解决方案

进阶技巧

法律提示

点评评价