IBoger/SmartWatch-Clawer

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

智能手表爬虫系统

一个功能完整的智能手表数据爬取、分析和对比系统，支持从ZOL网站爬取产品信息、京东价格查询和小红书用户评价收集。

功能特点

🔍 ZOL产品爬取: 自动爬取ZOL智能手表产品信息和详细参数
💰 京东价格查询: 根据产品名称自动搜索京东价格信息
📱 小红书评价: 收集小红书平台的真实用户评价和使用体验
🗄️ 数据存储: 使用SQLite数据库存储，支持CSV/Excel/JSON多格式导出
📊 数据分析: 提供品牌对比、价格分析等功能
🎯 模块化设计: 合理的代码结构，便于维护和扩展

项目结构

smartwatch_crawler/
├── main.py                 # 主程序入口
├── config.py              # 配置文件
├── requirements.txt       # 依赖包列表
├── crawlers/              # 爬虫模块
│   ├── __init__.py
│   ├── zol_crawler.py     # ZOL网站爬虫
│   ├── jd_crawler.py      # 京东价格爬虫
│   └── xhs_crawler.py     # 小红书评价爬虫
├── models/                # 数据模型
│   ├── __init__.py
│   └── smartwatch.py      # 智能手表数据模型
├── database/              # 数据库模块
│   ├── __init__.py
│   ├── db_manager.py      # 数据库管理器
│   └── data_exporter.py   # 数据导出器
├── utils/                 # 工具模块
│   ├── __init__.py
│   ├── logger.py          # 日志工具
│   └── http_client.py     # HTTP客户端
├── exports/               # 导出文件目录
├── logs/                  # 日志文件目录
└── tests/                 # 测试文件

安装和使用

1. 环境要求

Python 3.7+
pip

2. 安装依赖

pip install -r requirements.txt

3. 基本使用

爬取ZOL产品信息

# 爬取1页产品
python main.py crawl-zol

# 爬取多页产品
python main.py crawl-zol --pages 3

查询京东价格

# 为所有产品查询京东价格（每个产品最多3个商品）
python main.py crawl-jd

# 自定义每个产品的查询数量
python main.py crawl-jd --limit 5

爬取小红书评价

# 为所有产品爬取小红书评价（每个产品最多10条）
python main.py crawl-xhs

# 自定义每个产品的评价数量
python main.py crawl-xhs --limit 20

执行完整流程

# 一键执行所有爬取任务
python main.py crawl-all

导出数据

# 导出为Excel格式（默认）
python main.py export

# 导出为CSV格式
python main.py export --format csv

# 导出为JSON格式
python main.py export --format json

# 导出对比报告
python main.py export --format comparison

# 指定输出文件名
python main.py export --format excel --filename my_data.xlsx

4. 高级使用

配置修改

编辑 config.py 文件来修改爬虫配置：

# 请求间隔（秒）
REQUEST_DELAY = 1

# 最大重试次数
MAX_RETRIES = 3

# 请求超时时间
TIMEOUT = 30

数据库操作

from database.db_manager import DatabaseManager

# 创建数据库管理器
db = DatabaseManager()

# 获取所有产品
products = db.get_all_products()

# 根据URL查找产品
product = db.get_product_by_url("https://detail.zol.com.cn/GPSwatch/index2107338.shtml")

自定义爬虫

from crawlers.zol_crawler import ZOLCrawler

# 创建ZOL爬虫
crawler = ZOLCrawler()

# 获取产品列表
products = crawler.get_product_list(page=1)

# 获取产品详情
product = crawler.get_product_details(product_url)

# 获取产品参数
parameters = crawler.get_product_parameters(param_url)

数据结构

产品信息表 (smartwatch_products)

id: 产品ID
name: 产品名称
brand: 品牌
model: 型号
zol_url: ZOL产品页面URL
param_url: 参数页面URL
image_url: 产品图片URL
price_range: 价格区间

产品参数表 (product_parameters)

id: 参数ID
product_id: 关联产品ID
category: 参数分类
param_name: 参数名称
param_value: 参数值

京东价格表 (jd_price_info)

id: 价格信息ID
product_id: 关联产品ID
jd_url: 京东商品URL
jd_title: 京东商品标题
current_price: 当前价格
original_price: 原价
shop_name: 店铺名称
rating: 评分
comment_count: 评价数量

小红书评价表 (xhs_reviews)

id: 评价ID
product_id: 关联产品ID
note_id: 笔记ID
title: 标题
content: 内容
author: 作者
like_count: 点赞数
comment_count: 评论数
tags: 标签（JSON格式）
publish_time: 发布时间

注意事项

请求频率: 程序已设置合理的请求间隔，避免对目标网站造成过大压力
反爬虫: 使用了随机User-Agent和重试机制来应对基本的反爬虫措施
数据准确性: 爬取的数据可能因网站结构变化而需要调整解析逻辑
法律合规: 请确保爬取行为符合目标网站的robots.txt和使用条款

故障排除

常见问题

网络连接错误
- 检查网络连接
- 确认目标网站可访问
- 调整请求超时时间
解析错误
- 网站结构可能已变化
- 检查日志文件获取详细错误信息
- 更新解析逻辑
数据库错误
- 确认SQLite文件权限
- 检查磁盘空间
- 重新初始化数据库

日志查看

日志文件保存在 logs/ 目录下，按日期和模块分类：

main_YYYYMMDD.log: 主程序日志
zol_crawler_YYYYMMDD.log: ZOL爬虫日志
jd_crawler_YYYYMMDD.log: 京东爬虫日志
xhs_crawler_YYYYMMDD.log: 小红书爬虫日志

贡献

欢迎提交Issue和Pull Request来改进这个项目！

许可证

MIT License

About

也是上面的活

2.40 MiB

0 forks 0 stars 1 branches 0 TagREADME

Release
0

Tag

Contributors
1

Language

Python44.4%

HTML38.5%

CSV12.8%

Markdown4.2%

Others0.1%

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111

__pycache__
crawlers
database
exports
logs
models
tests
utils
.env.example
OPTIMIZATION_SUMMARY.md
README.md
config.py
debug_zol_pages.py
export_detailed_csv.py
main.py
requirements.txt
run_tests.py
smartwatch_data.db
test_continuous_crawl.py
test_fast_crawl.py
test_fresh_crawl.py
test_limited_crawl.py
test_page_content.py
test_page_detection.py
test_simple_crawl.py
test_zol_crawler.py
zol_page1.html