文章詳情頁

python 如何獲取頁面所有a標簽下href的值

瀏覽：13日期：2022-06-20 14:31:38

看代碼吧~

# -*- coding:utf-8 -*-#python 2.7#http://tieba.baidu.com/p/2460150866#標簽操作 from bs4 import BeautifulSoupimport urllib.requestimport re #如果是網址，可以用這個辦法來讀取網頁#html_doc = 'http://tieba.baidu.com/p/2460150866'#req = urllib.request.Request(html_doc) #webpage = urllib.request.urlopen(req) #html = webpage.read() html='''<html><head><title>The Dormouse’s story</title></head><body>The Dormouse’s storyOnce upon a time there were three little sisters; and their names were<a rel='external nofollow' rel='external nofollow' id='xiaodeng'></a>,<a rel='external nofollow' rel='external nofollow' id='link2'>Lacie</a> and<a rel='external nofollow' id='link3'>Tillie</a>;<a rel='external nofollow' rel='external nofollow' id='xiaodeng'>Lacie</a>and they lived at the bottom of a well....'''soup = BeautifulSoup(html, ’html.parser’) #文檔對象 #查找a標簽,只會查找出一個a標簽#print(soup.a)#<a rel='external nofollow' rel='external nofollow' id='xiaodeng'></a> for k in soup.find_all(’a’): print(k) print(k[’class’])#查a標簽的class屬性 print(k[’id’])#查a標簽的id值 print(k[’href’])#查a標簽的href值 print(k.string)#查a標簽的string

如果，標簽<a>中含有其他標簽，比如..，此時要提取<a>中的數據，需要用k.get_text()

soup = BeautifulSoup(html, ’html.parser’) #文檔對象#查找a標簽,只會查找出一個a標簽for k in soup.find_all(’a’): print(k) print(k[’class’])#查a標簽的class屬性 print(k[’id’])#查a標簽的id值 print(k[’href’])#查a標簽的href值 print(k.string)#查a標簽的string

如果，標簽<a>中含有其他標簽，比如..，此時要提取<a>中的數據，需要用k.get_text()

通常我們使用下面這種模式也是能夠處理的，下面的方法使用了get()。

html = urlopen(url) soup = BeautifulSoup(html, ’html.parser’) t1 = soup.find_all(’a’) print t1 href_list = [] for t2 in t1: t3 = t2.get(’href’) href_list.append(t3)

補充：python爬蟲獲取任意頁面的標簽和屬性（包括獲取a標簽的href屬性）

看代碼吧~

# coding=utf-8 from bs4 import BeautifulSoup import requests # 定義一個獲取url頁面下label標簽的attr屬性的函數 def getHtml(url, label, attr): response = requests.get(url) response.encoding = ’utf-8’ html = response.text soup = BeautifulSoup(html, ’html.parser’); for target in soup.find_all(label): try: value = target.get(attr) except: value = ’’ if value: print(value) url = ’https://baidu.com/’ label = ’a’ attr = ’href’ getHtml(url, label, attr)

python 如何獲取頁面所有a標簽下href的值

以上為個人經驗，希望能給大家一個參考，也希望大家多多支持好吧啦網。如有錯誤或未考慮完全的地方，望不吝賜教。

Python 編程

上一條：Python 如何安裝Selenium(推薦)下一條：Python基礎之hashlib模塊詳解

相關文章：

1. php測試程序運行速度和頁面執行速度的代碼2. ASP中常用的22個FSO文件操作函數整理3. 三個不常見的 HTML5 實用新特性簡介4. Warning: require(): open_basedir restriction in effect,目錄配置open_basedir報錯問題分析5. ASP調用WebService轉化成JSON數據，附json.min.asp6. SharePoint Server 2019新特性介紹7. React+umi+typeScript創建項目的過程8. 無線標記語言(WML)基礎之WMLScript 基礎第1/2頁9. ASP.NET Core 5.0中的Host.CreateDefaultBuilder執行過程解析10. php網絡安全中命令執行漏洞的產生及本質探究

排行榜

					
					淺談Django QuerySet對象(模型.objects)的常用方法
Python+logging輸出到屏幕將log日志寫入文件
Python如何讀寫CSV文件
Python使用jupyter notebook查看ipynb文件過程解析
解決Python 進程池Pool中一些坑
IntelliJ IDEA調整字體大小的方法
docker容器調用yum報錯的解決辦法
使用UML編寫Java應用程序 (1)
ASP.NET Core 5.0中的Host.CreateDefaultBuilder執行過程解析
php測試程序運行速度和頁面執行速度的代碼
Python xlwings插入Excel圖片的實現方法