文章詳情頁

python 用遞歸實現(xiàn)通用爬蟲解析器

瀏覽：63日期：2022-06-22 13:07:30

我們在寫爬蟲的過程中，除了研究反爬之外，幾乎全部的時間都在寫解析邏輯。那么，生命苦短，為什么我們不寫一個通用解析器呢？對??！為什么不呢？開整！

需求分析

爬蟲要解析的網(wǎng)頁類型無外乎 html、json 以及一些二進制文件（video、excel 文件等）。既然要做成通用解析器，我們有兩種實現(xiàn)方式，一種是將網(wǎng)頁內(nèi)容轉(zhuǎn)換成統(tǒng)一的形式，然后用對應的解析規(guī)則去解析，比如全部將網(wǎng)頁內(nèi)容轉(zhuǎn)換成 html 形式，然后用 xpath 去提取。

另外一種是配置文件預先告知的方式，你配置成什么類型，解析器就通過對應的解析規(guī)則去解析。

統(tǒng)一網(wǎng)頁形式，需要做大量的網(wǎng)頁內(nèi)容形式轉(zhuǎn)換，而配置文件預先告知則需要在配置時指定更多解析字段。相比較而言，通過第二種方式，未來改變較多的是配置規(guī)則，不需要動核心代碼，引入 bug 的可能性較低。因此這里我們采用第二種方式實現(xiàn)解析器

進一步分析

解析器對于網(wǎng)頁內(nèi)容的提取，本質(zhì)上和我們在本地電腦上查找和整理文件，沒有什么差別。比如像下面這樣

python 用遞歸實現(xiàn)通用爬蟲解析器

解析內(nèi)容就是從中提取我們想要的信息，然后整理成我們希望的格式。比如上面的內(nèi)容，我們提取出來的形式應該是這樣

{ 'design': '設計圖.psd', 'software': 'sketch.dmg'}

而在實際的爬蟲開發(fā)過程中，網(wǎng)頁形式遠比以上的復雜。其實遇到最多的問題是在一組列表中嵌套一個列表，我們需要把這種形式提取出來。比如像下面這種形式

{ 'a': 'a', 'b': [{'c': 'c1', 'd': 'd1'},{'c': 'c2', 'd': 'd2'}]}

他提取出信息后應該是這樣

[ { 'a': 'a', 'c': 'c1', 'd': 'd1' }, { 'a': 'a', 'c': 'c2', 'd': 'd2' }]

如果小伙伴對于算法熟悉的話，應該能察覺出這種遍歷用遞歸來寫是非常方便的。但要注意的是 python 會限定遞歸的層數(shù)，小伙伴可以通過下面這個方法查看遞歸限定的層數(shù)

import sysprint(sys.getrecursionlimit())>>>1000

我這邊限定的層數(shù)是 1k。對于解析網(wǎng)頁來說完全夠用了，如果哪個人把網(wǎng)頁解析邏輯嵌套了 1000 層，我建議你直接跟老板提放棄這個網(wǎng)頁吧！

再進一步分析

我們已經(jīng)知道對于通用解析來說，就是通過配置解析規(guī)則提取頁面的對應信息。而針對有列表層級的網(wǎng)頁可能還涉及遞歸遍歷問題。那如何去配置這種解析規(guī)則呢？其實很簡單，只需要在進入每一個層級之前先指定該層的數(shù)據(jù)形式，比如下面這個原數(shù)據(jù)

{ 'a': 'a', 'b': [ {'c': 'c1', 'd': 'd1'}, {'c': 'c2', 'd' : 'd2'} ]}

想提取嵌套信息，我們的解析規(guī)則就應該是這樣的

[ { '$name': 'a', '$value_type': 'raw', '$parse_method': 'json', '$parse_rule': 'a', '$each': [] }, { '$name': '__datas__', '$value_type': 'recursion', '$parse_method': 'json', '$parse_rule': 'b', '$each': [{ '$name': 'c', '$value_type': 'raw', '$parse_method': 'json', '$parse_rule': 'c', '$each': []},{ '$name': 'd', '$value_type': 'raw', '$parse_method': 'json', '$parse_rule': 'd', '$each': []} ] }]

其中 $name 字段表示我們最終希望最外層數(shù)據(jù)所擁有的字段名，當然如果是需要遞歸到內(nèi)層的字段，則將列表保存為 __datas__ ，然后根據(jù)這個 __datas__ 進行內(nèi)層結(jié)構(gòu)的解析。最終我們得到的數(shù)據(jù)結(jié)構(gòu)應該是這樣的

[ {'a': 'a', 'c': 'c1', 'd': 'd1'}, {'a': 'a', 'c': 'c2', 'd': 'd2'}]

以上我們只演示了 json 的解析規(guī)則，如果要拿來解析 html 對象呢？很簡單，將解析方式改為 xpath 對象，然后傳入 xpath 解析語法即可。

代碼實現(xiàn)

總共分成兩部分，一部分根據(jù)原最終結(jié)果和規(guī)則進行打包，將所有涉及 recursion 邏輯的字段進行轉(zhuǎn)換，代碼如下

def _pack_json(result, rules):item = {}for p_rule in rules: if p_rule.get('$value_type') == 'raw':if p_rule.get('$parse_method') == 'json': item[p_rule.get('$name')] = glom(result, p_rule.get('$parse_rule')) elif p_rule.get('$value_type') == 'recursion':if p_rule.get('$parse_method') == 'json': tmp_result = glom(result, p_rule.get('$parse_rule')) total_result = [] for per_r in tmp_result:total_result.append(_pack_json(per_r, p_rule.get('$each'))) item[p_rule.get('$name')] = total_resultreturn item

另外一部分將上一步得到的進行解析，將打包得到的結(jié)果進行解包，即將所有內(nèi)嵌的數(shù)據(jù)提到最外層，代碼如下

def _unpack_datas(result: dict) -> list:if '__datas__' not in result: return [result]item_results = []all_item = result.pop('__datas__')for per_item in all_item: if '__datas__' in per_item:tmp_datas = per_item.pop('__datas__')for per_tmp_data in tmp_datas: tmp_item = _unpack_datas(per_tmp_data) for per_tmp_item in tmp_item:item_results.append({**per_tmp_item, **per_item}) else:item_results.append({**result, **per_item})return item_results

后再包一層執(zhí)行入口就可以了，完整代碼如下

from loguru import loggerfrom glom import glomdef parse(result, rules): def _pack_json(result, rules):item = {}for p_rule in rules: if p_rule.get('$value_type') == 'raw':if p_rule.get('$parse_method') == 'json': item[p_rule.get('$name')] = glom(result, p_rule.get('$parse_rule')) elif p_rule.get('$value_type') == 'recursion':if p_rule.get('$parse_method') == 'json': tmp_result = glom(result, p_rule.get('$parse_rule')) total_result = [] for per_r in tmp_result:total_result.append(_pack_json(per_r, p_rule.get('$each'))) item[p_rule.get('$name')] = total_resultreturn item def _unpack_datas(result: dict) -> list:if '__datas__' not in result: return [result]item_results = []all_item = result.pop('__datas__')for per_item in all_item: if '__datas__' in per_item:tmp_datas = per_item.pop('__datas__')for per_tmp_data in tmp_datas: tmp_item = _unpack_datas(per_tmp_data) for per_tmp_item in tmp_item:item_results.append({**per_tmp_item, **per_item}) else:item_results.append({**result, **per_item})return item_results pack_result = _pack_json(result, rules) logger.info(pack_result) return _unpack_datas(pack_result)

以上，就是通用解析器的完整案例。案例中僅實現(xiàn)了對于 json 的支持，小伙伴可以基于自己的項目，改造成其他的解析形式。通用解析其實是雞仔為了偷懶寫的，因為雞仔發(fā)現(xiàn)，在爬蟲開發(fā)中，大部分工作都耗在解析這部分。而有了通用解析的前端頁面，運營和數(shù)據(jù)分析師就可以根據(jù)自己的需要配置自己想爬取的站點了。人生苦短，你懂得。我去摸魚了~

實現(xiàn)方式請移步至 github 查看：https://github.com/hacksman/learn_lab/blob/master/small_bug_lab/general_parser.py

以上就是python 用遞歸實現(xiàn)通用爬蟲解析器的詳細內(nèi)容，更多關于python 遞歸實現(xiàn)爬蟲解析器的資料請關注好吧啦網(wǎng)其它相關文章！

Python 編程

上一條：Python實現(xiàn)Telnet自動連接檢測密碼的示例下一條：用Python遠程登陸服務器的步驟

相關文章：

1. python爬蟲實戰(zhàn)之制作屬于自己的一個IP代理模塊2. Java程序的編碼規(guī)范（6）3. IntelliJ IDEA設置默認瀏覽器的方法4. HTML 絕對路徑與相對路徑概念詳細5. Spring如何使用xml創(chuàng)建bean對象6. python實現(xiàn)PolynomialFeatures多項式的方法7. python實現(xiàn)在內(nèi)存中讀寫str和二進制數(shù)據(jù)代碼8. Android Studio設置顏色拾色器工具Color Picker教程9. python 利用toapi庫自動生成api10. python實現(xiàn)讀取類別頻數(shù)數(shù)據(jù)畫水平條形圖案例

排行榜

					
					python爬蟲實戰(zhàn)之制作屬于自己的一個IP代理模塊
python實現(xiàn)在內(nèi)存中讀寫str和二進制數(shù)據(jù)代碼
HTML 絕對路徑與相對路徑概念詳細
python 利用toapi庫自動生成api
Java程序的編碼規(guī)范（6）
python實現(xiàn)PolynomialFeatures多項式的方法
IntelliJ IDEA設置默認瀏覽器的方法
Spring如何使用xml創(chuàng)建bean對象
Android Studio設置顏色拾色器工具Color Picker教程
python實現(xiàn)讀取類別頻數(shù)數(shù)據(jù)畫水平條形圖案例
Spring教程之refresh()執(zhí)行邏輯淺析