文章詳情頁

python 爬取國內(nèi)小說網(wǎng)站

瀏覽：2日期：2022-06-17 11:08:47

目錄原理先行實踐篇完整代碼原理先行

作為一個資深的小說愛好者，國內(nèi)很多小說網(wǎng)站如出一轍，什么 🖊*閣啊等等，大都是 get 請求返回 html 內(nèi)容，而且會有標(biāo)志性的<dl><dd>等標(biāo)簽。所以大概的原理，就是先 get 請求這個網(wǎng)站，然后對獲取的內(nèi)容進行清洗，寫進文本里面，變成一個 txt，導(dǎo)入手機，方便看小說。

實踐篇

之前踩過一個坑，一開始我看了幾頁小說，大概小說的內(nèi)容網(wǎng)站是https://www.xxx.com/小說編號/章節(jié)編號.html，一開始看前幾章，我發(fā)現(xiàn)章節(jié)編號是連續(xù)的，于是我一開始想的就是記住起始章節(jié)編號，然后在循環(huán)的時候章節(jié)編號自增就行，后面發(fā)現(xiàn)草率了，可能看個 100 章之后，章節(jié)列表會出現(xiàn)斷層現(xiàn)象，這個具體為啥還真不知道，按理說小說編號固定，可以算是一個數(shù)據(jù)表，那里面的章節(jié)編號不就是一個自增 id 就完了嘛？有懂王可以科普一下！所以這里要先獲取小說的目錄列表，并把目錄列表洗成一個數(shù)組方便我們后期查找！getList.py文件：

定義一個請求書簽的方法

# 請求書簽地址def req(): url = 'https://www.24kwx.com/book/4/4020/' strHtml = requests.get(url) return strHtml.text

將獲取到的內(nèi)容提取出（id:唯一值/或第 X 章小說）(name:小說的章節(jié)名稱)（key:小說的章節(jié) id）

# 定義一個章節(jié)對象class Xs(object): def __init__(self,id,key,name):self._id = idself._key = keyself._name = name @property def id(self):self._id @property def key(self):self._key @property def name(self):self._name def getString(self):return ’id:%s,name:%s,key:%s’ %(self._id,self._name,self._key)# 轉(zhuǎn)換成書列表def tranceList(): key = 0 name = '' xsList = [] idrule = r’/4020/(.+?).html’ keyrule = r’第(.+?)章’ html = req() html = re.split('</dt>',html)[2] html = re.split('</dl>',html)[0] htmlList = re.split('</dd>',html) for i in htmlList:i = i.strip()if(i): # 獲取id id = re.findall(idrule,i)[0] lsKeyList = re.findall(keyrule,i) # 如果有章節(jié) if len(lsKeyList) > 0 :key = int(lsKeyList[0])lsname = re.findall(r’章(.+?)</a>’,i) else :key = key + 1 # 獲取名字 # lsname = re.findall(r’.html'>(.+?)</a>’,i)[0] # name = re.sub(’，’,’ ’, lsname, flags=re.IGNORECASE) name = re.findall(r’.html'>(.+?)</a>’,i)[0] xsobj = Xs(id,key,name) xsList.append(xsobj.getString()) writeList(xsList)

注意一下我：如果你從別的語言轉(zhuǎn) py，第一次寫object對象可能會比較懵，沒錯因為他的object是一個class，這里我創(chuàng)建的對象就是{id,key,name}但是你寫入 txt 的時候還是要getString，所以后面想想我直接寫個{id:xxx,name:xxx,key:xxx}的字符串不就完了，還弄啥class,后面還是想想給兄弟盟留點看點，就留著了

最后寫入 txt 文件

# 寫入到文本def writeList(list): f = open('xsList.txt',’w’,encoding=’utf-8’) # 這里不能寫list，要先轉(zhuǎn)字符串 TypeError: write() argument must be str, not list f.write(’n’.join(list)) print(’寫入成功’)# 大概寫完的txt是這樣的id:3798160,name:第1章孫子，我是你爺爺,key:1id:3798161,name:第2章孫子，等等我！,key:2id:3798162,name:第3章天上掉下個親爺爺,key:3id:3798163,name:第4章超級大客戶,key:4id:3798164,name:第5章一張退婚證明,key:5

ok ! Last one這里已經(jīng)寫好了小說的目錄，那我們就要讀取小說的內(nèi)容，同理

先寫個請求

# 請求內(nèi)容地址def req(id): url = 'https://www.24kwx.com/book/4/4020/'+id+'.html' strHtml = requests.get(url) return strHtml.text

讀取我們剛剛保存的目錄

def getList(): f = open('xsList.txt',’r’, encoding=’utf-8’) # 這里按行讀取,讀取完后line是個數(shù)組 line = f.readlines() f.close() return line

定義好一個清洗數(shù)據(jù)的規(guī)則

contextRule = r’<div class='content'>(.+?)<script>downByJs();</script>’titleRule = r’<h1>(.+?)</h1>’def getcontext(objstr): xsobj = re.split(',',objstr) id = re.split('id:',xsobj[0])[1] name = re.split('name:',xsobj[1])[1] html = req(id) lstitle = re.findall(titleRule,html) title = lstitle[0] if len(lstitle) > 0 else name context = re.split(’<div class='showtxt'>’,html)[1] context = re.split(’</div>’,context)[0] context = re.sub(’ |r|n’,’’,context) textList = re.split(’<br />’,context) textList.insert(0,title) for item in textList :writeTxt(item) print(’%s--寫入成功’%(title))

再寫入文件

def writeTxt(txt): if txt :f = open('nr.txt',’a’,encoding='utf-8')f.write(txt+’n’)

最后當(dāng)然是串聯(lián)起來啦

def getTxt(): # 默認參數(shù)配置 startNum = 1261 # 起始章節(jié) endNum = 1300 # 結(jié)束章節(jié) # 開始主程序 f = open('nr.txt',’w’,encoding=’utf-8’) f.write('') if endNum < startNum:print(’結(jié)束條數(shù)必須大于開始條數(shù)’)return allList = getList() needList = allList[startNum-1:endNum] for item in needList:getcontext(item)time.sleep(0.2) print('全部爬取完成')完整代碼

getList.py

import requestsimport re# 請求書簽地址def req(): url = 'https://www.24kwx.com/book/4/4020/' strHtml = requests.get(url) return strHtml.text# 定義一個章節(jié)對象class Xs(object): def __init__(self,id,key,name):self._id = idself._key = keyself._name = name @property def id(self):self._id @property def key(self):self._key @property def name(self):self._name def getString(self):return ’id:%s,name:%s,key:%s’ %(self._id,self._name,self._key)# 轉(zhuǎn)換成書列表def tranceList(): key = 0 name = '' xsList = [] idrule = r’/4020/(.+?).html’ keyrule = r’第(.+?)章’ html = req() html = re.split('</dt>',html)[2] html = re.split('</dl>',html)[0] htmlList = re.split('</dd>',html) for i in htmlList:i = i.strip()if(i): # 獲取id id = re.findall(idrule,i)[0] lsKeyList = re.findall(keyrule,i) # 如果有章節(jié) if len(lsKeyList) > 0 :key = int(lsKeyList[0])lsname = re.findall(r’章(.+?)</a>’,i) else :key = key + 1 # 獲取名字 # lsname = re.findall(r’.html'>(.+?)</a>’,i)[0] # name = re.sub(’，’,’ ’, lsname, flags=re.IGNORECASE) name = re.findall(r’.html'>(.+?)</a>’,i)[0] xsobj = Xs(id,key,name) xsList.append(xsobj.getString()) writeList(xsList)# 寫入到文本def writeList(list): f = open('xsList.txt',’w’,encoding=’utf-8’) # 這里不能寫list，要先轉(zhuǎn)字符串 TypeError: write() argument must be str, not list f.write(’n’.join(list)) print(’寫入成功’)def main(): tranceList()if __name__ == ’__main__’: main()

writeTxt.py

import requestsimport reimport time# 請求內(nèi)容地址def req(id): url = 'https://www.24kwx.com/book/4/4020/'+id+'.html' strHtml = requests.get(url) return strHtml.textdef getList(): f = open('xsList.txt',’r’, encoding=’utf-8’) # 這里按行讀取 line = f.readlines() f.close() return linecontextRule = r’<div class='content'>(.+?)<script>downByJs();</script>’titleRule = r’<h1>(.+?)</h1>’def getcontext(objstr): xsobj = re.split(',',objstr) id = re.split('id:',xsobj[0])[1] name = re.split('name:',xsobj[1])[1] html = req(id) lstitle = re.findall(titleRule,html) title = lstitle[0] if len(lstitle) > 0 else name context = re.split(’<div class='showtxt'>’,html)[1] context = re.split(’</div>’,context)[0] context = re.sub(’ |r|n’,’’,context) textList = re.split(’<br />’,context) textList.insert(0,title) for item in textList :writeTxt(item) print(’%s--寫入成功’%(title))def writeTxt(txt): if txt :f = open('nr.txt',’a’,encoding='utf-8')f.write(txt+’n’)def getTxt(): # 默認參數(shù)配置 startNum = 1261 # 起始章節(jié) endNum = 1300 # 結(jié)束章節(jié) # 開始主程序 f = open('nr.txt',’w’,encoding=’utf-8’) f.write('') if endNum < startNum:print(’結(jié)束條數(shù)必須大于開始條數(shù)’)return allList = getList() needList = allList[startNum-1:endNum] for item in needList:getcontext(item)time.sleep(0.2) print('全部爬取完成') def main(): getTxt()if __name__ == '__main__': main()

以上就是python 爬取國內(nèi)小說網(wǎng)站的詳細內(nèi)容，更多關(guān)于python 爬取小說網(wǎng)站的資料請關(guān)注好吧啦網(wǎng)其它相關(guān)文章！

Python 編程

上一條：python 爬取天氣網(wǎng)衛(wèi)星圖片下一條：Pandas中時間序列的處理大全

相關(guān)文章：

1. PHP與已存在的Java應(yīng)用程序集成2. python b站視頻下載的五種版本3. 使用ProcessBuilder調(diào)用外部命令，并返回大量結(jié)果4. CSS自定義滾動條樣式案例詳解5. python鏈表類中獲取元素實例方法6. python 批量下載bilibili視頻的gui程序7. 詳解Vue中Axios封裝API接口的思路及方法8. python中if嵌套命令實例講解9. 使用css實現(xiàn)全兼容tooltip提示框10. Python之多進程與多線程的使用

排行榜

					
					python 批量下載bilibili視頻的gui程序
python鏈表類中獲取元素實例方法
詳解Vue中Axios封裝API接口的思路及方法
python中if嵌套命令實例講解
使用ProcessBuilder調(diào)用外部命令，并返回大量結(jié)果
使用css實現(xiàn)全兼容tooltip提示框
python b站視頻下載的五種版本
CSS自定義滾動條樣式案例詳解
PHP與已存在的Java應(yīng)用程序集成
SpringBoot快速集成jxls-poi(自定義模板,支持本地文件導(dǎo)出,在線文件導(dǎo)出)
python中HTMLParser模塊知識點總結(jié)