生活随笔
收集整理的這篇文章主要介紹了
当当网爬虫
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
我對當當網所分類進行了遍歷 ,對分類下的商品內容精心爬取,算是一個簡單的爬取,并沒有細化分類 爬取所有的商品
下面是爬蟲的spider
import scrapy
from pyquery
import PyQuery
as pq
from dangdang.items
import DangdangItem
class SpiderSpider(scrapy.Spider):name =
"spider"allowed_domians = [
'www.dangdang.com']
def start_requests(self):start_urls =
'http://category.dangdang.com/?ref=www-0-C'yield scrapy.Request(url=start_urls,callback=self.parse,dont_filter=
True)
def parse(self,response):menu_list = response.xpath(
'//*[contains(@class,"classify_left")]//a')
for menu
in menu_list:uel = menu.xpath(
'@href').extract_first()title = menu.xpath(
'text()').extract_first()
if 'javascript' not in uel:
yield scrapy.Request(url = uel , callback=self.parse_cid,dont_filter=
True)
def parse_cid(self,response):print(
'打印cid')
try:item = DangdangItem()note_list = response.xpath(
'//div[contains(@id,"search_nature_rg")]/ul/li')
for note
in note_list:name = note.xpath(
'./p[contains(@class,"name")]/a/text()').extract_first()price = note.xpath(
'./p[@class="price"]/span/text()').extract_first()level = note.xpath(
'./p[contains(@class,"star")]/a/text()').extract_first()shop = note.xpath(
'./p[contains(@class,"link")]/a/text()').extract_first()
if shop:shop =shop
else:shop =
u'當當自營'for field
in item.fields:item[field] = eval(field)
yield itemnext_page =
'http://category.dangdang.com/'+response.xpath(
'//a[contains(@class,"arrow_r")]/@href').extract_first()
if 'javascript:void(0);' not in next_page:
yield scrapy.Request(url=next_page,callback=self.parse_cid,dont_filter=
True)
except:
pass
下面是爬蟲的pipline
import MySQLdb
from dangdang.settings
import *
class DangdangPipeline(object):def __init__(self):self.item_array = []self.db = MySQLdb.connect(MYSQL_HOST, MYSQL_USER, MYSQL_PASSWD, MYSQL_DBNAME, charset=
'utf8mb4', use_unicode=
True)self.cursor = self.db.cursor()self.insert_sql =
"""insert into {table_name}(name,price,level,shop)VALUES (%s, %s, %s, %s)""".format(table_name=table_name)
def process_item(self, item, spider):params = (item[
'name'], item[
'price'], item[
'level'], item[
'shop'])self.item_array.append(params)self.cursor.executemany(self.insert_sql, self.item_array)self.db.commit()self.item_array = []
return item
最后放圖
單機沒有跑完 大概跑了10w數據
總結
以上是生活随笔為你收集整理的当当网爬虫的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。