av一区二区在线观看_亚洲男人的天堂网站_日韩亚洲视频_在线成人免费_欧美日韩精品免费观看视频_久草视

您的位置:首頁技術(shù)文章
文章詳情頁

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

瀏覽:174日期:2022-06-22 14:32:32
一、環(huán)境準(zhǔn)備 python3.8.3 pycharm 項(xiàng)目所需第三方包

pip install scrapy fake-useragent requests selenium virtualenv -i https://pypi.douban.com/simple

1.1 創(chuàng)建虛擬環(huán)境

切換到指定目錄創(chuàng)建

virtualenv .venv

創(chuàng)建完記得激活虛擬環(huán)境

1.2 創(chuàng)建項(xiàng)目

scrapy startproject 項(xiàng)目名稱

1.3 使用pycharm打開項(xiàng)目,將創(chuàng)建的虛擬環(huán)境配置到項(xiàng)目中來1.4 創(chuàng)建京東spider

scrapy genspider 爬蟲名稱 url

1.5 修改允許訪問的域名,刪除https:

二、問題分析

爬取數(shù)據(jù)的思路是先獲取首頁的基本信息,在獲取詳情頁商品詳細(xì)信息;爬取京東數(shù)據(jù)時(shí),只返回40條數(shù)據(jù),這里,作者使用selenium,在scrapy框架中編寫下載器中間件,返回頁面所有數(shù)據(jù)。爬取的字段分別是:

商品價(jià)格

商品評數(shù)

商品店家

商品SKU(京東可直接搜索到對應(yīng)的產(chǎn)品)

商品標(biāo)題

商品詳細(xì)信息

三、spider

import reimport scrapyfrom lianjia.items import jd_detailItemclass JiComputerDetailSpider(scrapy.Spider): name = ’ji_computer_detail’ allowed_domains = [’search.jd.com’, ’item.jd.com’] start_urls = [’https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&suggest=1.def.0.base&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&page=1&s=1&click=0’] def parse(self, response):lls = response.xpath(’//ul[@class='gl-warp clearfix']/li’)for ll in lls: item = jd_detailItem() computer_price = ll.xpath(’.//div[@class='p-price']/strong/i/text()’).extract_first() computer_commit = ll.xpath(’.//div[@class='p-commit']/strong/a/text()’).extract_first() computer_p_shop = ll.xpath(’.//div[@class='p-shop']/span/a/text()’).extract_first() item[’computer_price’] = computer_price item[’computer_commit’] = computer_commit item[’computer_p_shop’] = computer_p_shop meta = {’item’: item } shop_detail_url = ll.xpath(’.//div[@class='p-img']/a/@href’).extract_first() shop_detail_url = ’https:’ + shop_detail_url yield scrapy.Request(url=shop_detail_url, callback=self.detail_parse, meta=meta)for i in range(2, 200, 2): next_page_url = f’https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&suggest=1.def.0.base&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&page={i}&s=116&click=0’ yield scrapy.Request(url=next_page_url, callback=self.parse) def detail_parse(self, response):item = response.meta.get(’item’)computer_sku = response.xpath(’//a[@class='notice J-notify-sale']/@data-sku’).extract_first()item[’computer_sku’] = computer_skucomputer_title = response.xpath(’//div[@class='sku-name']/text()’).extract_first().strip()computer_title = ’’.join(re.findall(’S’, computer_title))item[’computer_title’] = computer_titlecomputer_detail = response.xpath(’string(//ul[@class='parameter2 p-parameter-list'])’).extract_first().strip()computer_detail = ’’.join(re.findall(’S’, computer_detail))item[’computer_detail’] = computer_detailyield item四、item

class jd_detailItem(scrapy.Item): # define the fields for your item here like: computer_sku = scrapy.Field() computer_price = scrapy.Field() computer_title = scrapy.Field() computer_commit = scrapy.Field() computer_p_shop = scrapy.Field() computer_detail = scrapy.Field()五、setting

import randomfrom fake_useragent import UserAgentua = UserAgent()USER_AGENT = ua.randomROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = random.uniform(0.5, 1)DOWNLOADER_MIDDLEWARES = { ’lianjia.middlewares.jdDownloaderMiddleware’: 543}ITEM_PIPELINES = { ’lianjia.pipelines.jd_csv_Pipeline’: 300}六、pipelines

class jd_csv_Pipeline: # def process_item(self, item, spider): # return item def open_spider(self, spider):self.fp = open(’./jd_computer_message.xlsx’, mode=’w+’, encoding=’utf-8’)self.fp.write(’computer_skutcomputer_titletcomputer_p_shoptcomputer_pricetcomputer_committcomputer_detailn’) def process_item(self, item, spider):# 寫入文件try: line = ’t’.join(list(item.values())) + ’n’ self.fp.write(line) return itemexcept: pass def close_spider(self, spider):# 關(guān)閉文件self.fp.close()七、middlewares

class jdDownloaderMiddleware: def process_request(self, request, spider):# 判斷是否是ji_computer_detail的爬蟲# 判斷是否是首頁if spider.name == ’ji_computer_detail’ and re.findall(f’.*(item.jd.com).*’, request.url) == []: options = ChromeOptions() options.add_argument('--headless') driver = webdriver.Chrome(options=options) driver.get(request.url) for i in range(0, 15000, 5000):driver.execute_script(f’window.scrollTo(0, {i})’)time.sleep(0.5) body = driver.page_source.encode() time.sleep(1) return HtmlResponse(url=request.url, body=body, request=request)return None八、使用jupyter進(jìn)行簡單的處理和分析

其他文件:百度停用詞庫、簡體字文件下載第三方包

!pip install seaborn jieba wordcloud PIL -i https://pypi.douban.com/simple

8.1導(dǎo)入第三方包

import reimport osimport jiebaimport wordcloudimport pandas as pdimport numpy as npfrom PIL import Imageimport seaborn as snsfrom docx import Documentfrom docx.shared import Inchesimport matplotlib.pyplot as pltfrom pandas import DataFrame,Series

8.2設(shè)置可視化的默認(rèn)字體和seaborn的樣式

sns.set_style(’darkgrid’)plt.rcParams[’font.sans-serif’] = [’SimHei’]plt.rcParams[’axes.unicode_minus’] = False

8.3讀取數(shù)據(jù)

df_jp = pd.read_excel(’./jd_shop.xlsx’)

8.4篩選Inteli5、i7、i9處理器數(shù)據(jù)

def convert_one(s): if re.findall(f’.*?(i5).*’, str(s)) != []:return re.findall(f’.*?(i5).*’, str(s))[0] elif re.findall(f’.*?(i7).*’, str(s)) != []:return re.findall(f’.*?(i7).*’, str(s))[0] elif re.findall(f’.*?(i9).*’, str(s)) != []:return re.findall(f’.*?(i9).*’, str(s))[0]df_jp[’computer_intel’] = df_jp[’computer_detail’].map(convert_one)

8.5篩選筆記本電腦的屏幕尺寸范圍

def convert_two(s): if re.findall(f’.*?(d+.d+英寸-d+.d+英寸).*’, str(s)) != []:return re.findall(f’.*?(d+.d+英寸-d+.d+英寸).*’, str(s))[0]df_jp[’computer_in’] = df_jp[’computer_detail’].map(convert_two)

8.6將評論數(shù)轉(zhuǎn)化為整形

def convert_three(s): if re.findall(f’(d+)萬+’, str(s)) != []:number = int(re.findall(f’(d+)萬+’, str(s))[0]) * 10000return number elif re.findall(f’(d+)+’, str(s)) != []:number = re.findall(f’(d+)+’, str(s))[0]return numberdf_jp[’computer_commit’] = df_jp[’computer_commit’].map(convert_three)

8.7篩選出需要分析的品牌

def find_computer(name, s): sr = re.findall(f’.*({name}).*’, str(s))[0] return srdef convert(s): if re.findall(f’.*(聯(lián)想).*’, str(s)) != []:return find_computer(’聯(lián)想’, s) elif re.findall(f’.*(惠普).*’, str(s)) != []:return find_computer(’惠普’, s) elif re.findall(f’.*(華為).*’, str(s)) != []:return find_computer(’華為’, s) elif re.findall(f’.*(戴爾).*’, str(s)) != []:return find_computer(’戴爾’, s) elif re.findall(f’.*(華碩).*’, str(s)) != []:return find_computer(’華碩’, s) elif re.findall(f’.*(小米).*’, str(s)) != []:return find_computer(’小米’, s) elif re.findall(f’.*(榮耀).*’, str(s)) != []:return find_computer(’榮耀’, s) elif re.findall(f’.*(神舟).*’, str(s)) != []:return find_computer(’神舟’, s) elif re.findall(f’.*(外星人).*’, str(s)) != []:return find_computer(’外星人’, s)df_jp[’computer_p_shop’] = df_jp[’computer_p_shop’].map(convert)

8.8刪除指定字段為空值的數(shù)據(jù)

for n in [’computer_price’, ’computer_commit’, ’computer_p_shop’, ’computer_sku’, ’computer_detail’, ’computer_intel’, ’computer_in’]: index_ls = df_jp[df_jp[[n]].isnull().any(axis=1)==True].index df_jp.drop(index=index_ls, inplace=True)

8.9查看各品牌的平均價(jià)格

plt.figure(figsize=(10, 8), dpi=100)ax = sns.barplot(x=’computer_p_shop’, y=’computer_price’, data=df_jp.groupby(by=’computer_p_shop’)[[’computer_price’]].mean().reset_index())for index,row in df_jp.groupby(by=’computer_p_shop’)[[’computer_price’]].mean().reset_index().iterrows(): ax.text(row.name,row[’computer_price’] + 2,round(row[’computer_price’],2),color='black',ha='center')ax.set_xlabel(’品牌’)ax.set_ylabel(’平均價(jià)格’)ax.set_title(’各品牌平均價(jià)格’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’各品牌平均價(jià)格.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.10 查看各品牌的價(jià)格區(qū)間

plt.figure(figsize=(10, 8), dpi=100)ax = sns.boxenplot(x=’computer_p_shop’, y=’computer_price’, data=df_jp.query(’computer_price>500’))ax.set_xlabel(’品牌’)ax.set_ylabel(’價(jià)格區(qū)間’)ax.set_title(’各品牌價(jià)格區(qū)間’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’各品牌價(jià)格區(qū)間.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.11 查看價(jià)格與評論數(shù)的關(guān)系

df_jp[’computer_commit’] = df_jp[’computer_commit’].astype(’int64’)ax = sns.jointplot(x='computer_commit', y='computer_price', data=df_jp, kind='reg', truncate=False,color='m', height=10)ax.fig.savefig(’評論數(shù)與價(jià)格的關(guān)系.png’)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.12 查看商品標(biāo)題里出現(xiàn)的關(guān)鍵詞

import imageio# 將特征轉(zhuǎn)換為列表ls = df_jp[’computer_title’].to_list()# 替換非中英文的字符feature_points = [re.sub(r’[^a-zA-Zu4E00-u9FA5]+’,’ ’,str(feature)) for feature in ls]# 讀取停用詞stop_world = list(pd.read_csv(’./百度停用詞表.txt’, engine=’python’, encoding=’utf-8’, names=[’stopwords’])[’stopwords’])feature_points2 = []for feature in feature_points: # 遍歷每一條評論 words = jieba.lcut(feature) # 精確模式,沒有冗余.對每一條評論進(jìn)行jieba分詞 ind1 = np.array([len(word) > 1 for word in words]) # 判斷每個(gè)分詞的長度是否大于1 ser1 = pd.Series(words) ser2 = ser1[ind1] # 篩選分詞長度大于1的分詞留下 ind2 = ~ser2.isin(stop_world) # 注意取反負(fù)號 ser3 = ser2[ind2].unique() # 篩選出不在停用詞表的分詞留下,并去重 if len(ser3) > 0:feature_points2.append(list(ser3))# 將所有分詞存儲到一個(gè)列表中wordlist = [word for feature in feature_points2 for word in feature]# 將列表中所有的分詞拼接成一個(gè)字符串feature_str = ’ ’.join(wordlist) # 標(biāo)題分析font_path = r’./simhei.ttf’shoes_box_jpg = imageio.imread(’./home.jpg’)wc=wordcloud.WordCloud( background_color=’black’, mask=shoes_box_jpg, font_path = font_path, min_font_size=5, max_font_size=50, width=260, height=260,)wc.generate(feature_str)plt.figure(figsize=(10, 8), dpi=100)plt.imshow(wc)plt.axis(’off’)plt.savefig(’標(biāo)題提取關(guān)鍵詞’)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.13 篩選價(jià)格在4000到5000,聯(lián)想品牌、處理器是i5、屏幕大小在15寸以上的數(shù)據(jù)并查看價(jià)格

df_jd_query = df_jp.loc[(df_jp[’computer_price’] <=5000) & (df_jp[’computer_price’]>=4000) & (df_jp[’computer_p_shop’]=='聯(lián)想') & (df_jp[’computer_intel’]=='i5') & (df_jp[’computer_in’]=='15.0英寸-15.9英寸'), :].copy()plt.figure(figsize=(20, 10), dpi=100)ax = sns.barplot(x=’computer_sku’, y=’computer_price’, data=df_jd_query)ax.set_xlabel(’聯(lián)想品牌SKU’)ax.set_ylabel(’價(jià)格’)ax.set_title(’酷睿i5處理器屏幕15寸以上各SKU的價(jià)格’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’酷睿i5處理器屏幕15寸以上各SKU的價(jià)格.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.14 篩選價(jià)格在4000到5000,戴爾品牌、處理器是i7、屏幕大小在15寸以上的數(shù)據(jù)并查看價(jià)格

df_jp_daier = df_jp.loc[(df_jp[’computer_price’] <=5000) & (df_jp[’computer_price’]>=4000) & (df_jp[’computer_p_shop’]=='戴爾') & (df_jp[’computer_intel’]=='i7') & (df_jp[’computer_in’]=='15.0英寸-15.9英寸'), :].copy()plt.figure(figsize=(10, 8), dpi=100)ax = sns.barplot(x=’computer_sku’, y=’computer_price’, data=df_jp_daier)ax.set_xlabel(’戴爾品牌SKU’)ax.set_ylabel(’價(jià)格’)ax.set_title(’酷睿i7處理器屏幕15寸以上各SKU的價(jià)格’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’酷睿i7處理器屏幕15寸以上各SKU的價(jià)格.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.15 不同Intel處理器品牌的價(jià)格

plt.figure(figsize=(10, 8), dpi=100)ax = sns.barplot(x=’computer_p_shop’, y=’computer_price’, data=df_jp, hue=’computer_intel’)ax.set_xlabel(’品牌’)ax.set_ylabel(’價(jià)格’)ax.set_title(’不同酷睿處理器品牌的價(jià)格’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’不同酷睿處理器品牌的價(jià)格.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

8.16 不同尺寸品牌的價(jià)格

plt.figure(figsize=(10, 8), dpi=100)ax = sns.barplot(x=’computer_p_shop’, y=’computer_price’, data=df_jp, hue=’computer_in’)ax.set_xlabel(’品牌’)ax.set_ylabel(’價(jià)格’)ax.set_title(’不同尺寸品牌的價(jià)格’)boxplot_fig = ax.get_figure()boxplot_fig.savefig(’不同尺寸品牌的價(jià)格.png’, dpi=400)

python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析

以上就是python基于scrapy爬取京東筆記本電腦數(shù)據(jù)并進(jìn)行簡單處理和分析的詳細(xì)內(nèi)容,更多關(guān)于python 爬取京東數(shù)據(jù)的資料請關(guān)注好吧啦網(wǎng)其它相關(guān)文章!

標(biāo)簽: 京東 Python
相關(guān)文章:
主站蜘蛛池模板: 欧美性久久 | 一区二区高清不卡 | 欧美中文字幕一区二区三区亚洲 | 一级a性色生活片久久毛片波多野 | 午夜国产 | 九九久久精品视频 | 国产欧美在线播放 | 在线亚洲电影 | 天天操天天舔 | 91青娱乐在线 | 色秀网站 | www.99re| av手机在线播放 | 毛片网站在线观看视频 | 国产高清在线观看 | 激情久久网 | 日本中文在线 | 国精产品一区二区三区 | 午夜视频在线免费观看 | 九九免费在线视频 | 精品国产91久久久久久 | 国产男女精品 | 中文字幕国产高清 | 成人九色 | 日韩不卡一区二区 | 蜜臀久久 | 另类视频在线 | 亚洲一区 | av片免费 | 国产精品久久久久久久岛一牛影视 | 一区二区影院 | 天天影视亚洲综合网 | 欧美日韩国产精品激情在线播放 | 中文字幕在线播放第一页 | 中文字幕在线观看国产 | 亚洲www | 日韩欧美在线视频 | av网站在线看| 欧洲精品在线观看 | 国产综合精品一区二区三区 | 91精品国产美女在线观看 |