當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

字体加密-58同城简历信息爬取

發布時間：2023/12/8 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了字体加密-58同城简历信息爬取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

加密文件通過base64加密，先拿到加密字符串，通過base64解密，保存成woff文件
利用fonttool把woff文件保存成xml文件，再開始分析真實數據與加密數據的動態映射關系

映射關系如下：
源碼數據:&#x…; === 解密后的數據:uni…
解密后的數據會根據網頁的刷新不停的變化，所以要找到一個不變的映射關系
尋找不變的映射關系如果自己找的話可能會浪費很長時間，這里直接告訴大家結果
源碼數據:&#x…; === 解密后的數據:uni… === xml文件中的前兩個x,y坐標差值。

源碼如下，如果還看不懂的同學可以到我的B站教學視頻，過程非常詳細。
B站up主：一只會唱歌的程序狗里面有的一期視頻：字體加密-58同城簡歷信息爬取

import requests
import re
import base64
from lxml import etree
from fontTools.ttLib import TTFont
from io import BytesIO

data_map = {(0, 1549): ‘B’, (1588, 0): ‘男’, (868, 0): ‘王’, (825, 367): ‘大’, (265, -118): ‘?！? (0, 1026): ‘M’,
(-110, -150): ‘女’, (1460, 0): ‘吳’, (230, 390): ‘碩’, (156, 262): ‘趙’, (660, 0): ‘黃’, (924, 0): ‘李’,
(0, 1325): ‘1’, (0, 134): ‘8’, (0, 144): ‘經’, (0, 125): ‘2’, (1944, 0): ‘下’, (-52, -52): ‘本’, (582, 0): ‘屆’,
(0, -227): ‘5’, (146, 78): ‘應’, (228, 306): ‘科’, (-244, -426): ‘7’, (770, 0): ‘中’, (928, 0): ‘生’,
(-121, 62): ‘6’, (-833, 0): ‘E’, (299, 0): ‘陳’, (159, -123): ‘3’, (164, 0): ‘以’, (-764, 0): ‘楊’,
(-221, 0): ‘A’, (238, 0): ‘張’, (0, -1023): ‘4’, (784, 0): ‘無’, (0, 410): ‘0’, (128, -74): ‘9’,
(-46, -550): ‘驗’, (0, 110): ‘博’, (0, 132): ‘技’, (746, 0): ‘士’, (210, 358): ‘?！? (1298, 0): ‘高’,
(-74, -366): ‘劉’, (0, -508): ‘周’}

def get_font_map(content):
font_map = {}
result = re.search(r"base64,(.*?))", content, flags=re.S).group(1)
b = base64.b64decode(result)
tf = TTFont(BytesIO(b))
# print(tf.getGlyphNames())
# 運行三遍分別保存字體庫01、02、03用來做分析
with open(“ztku01.woff”, “wb”)as f:
f.write(b)

fonts = TTFont("ztku01.woff") fonts.saveXML("ztku01.xml") for index, i in enumerate(tf.getGlyphNames()[1:-1]):temp = tf["glyf"][i].coordinatesprint(temp)x1, y1 = temp[0]x2, y2 = temp[1]new = (x2 - x1, y2 - y1)key = i.replace("uni", "&#x").lower()# key = key.encode('utf-8').decode('unicode_escape')font_map[key] = data_map[new] print(font_map) return font_map

def parse_html():
url = “https://sz.58.com/searchjob/”
header = {
‘accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3’,
‘accept-encoding’: ‘gzip, deflate, br’,
‘accept-language’: ‘zh-CN,zh;q=0.9,en;q=0.8’,
‘cache-control’: ‘max-age=0’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36’,
}
response = requests.get(url, headers=header)
html = response.text
font_map = get_font_map(html)
for i in font_map:
print(i+";")
html = html.replace(i + “;”, font_map[i])
print(html)
data = etree.HTML(html)
personal_information = data.xpath(’//div[@id=“infolist”]/ul/li//dl[@class=“infocardMessage clearfix”]’)
for info in personal_information:
# 姓名
name = info.xpath(’./dd//span[@class=“infocardName fl stonefont resumeName”]/text()’)[0]
# 性別
gender = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[1]/text()’)[0]
# 年齡
age = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[2]/text()’)[0]
# 工作經驗
work_experience = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[3]/text()’)[0]
# 學歷
education = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[4]/text()’)[0]
print(name, gender, age, work_experience, education)

if name == “main”:
parse_html()

總結

以上是生活随笔為你收集整理的字体加密-58同城简历信息爬取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ubuntu下sed命令详解 - Dic
下一篇：怎么修改服务器密码忘了怎么办啊,华为云怎