字体加密-58同城简历信息爬取
加密文件通過base64加密,先拿到加密字符串,通過base64解密,保存成woff文件
利用fonttool把woff文件保存成xml文件,再開始分析真實數據與加密數據的動態映射關系
映射關系如下:
源碼數據:&#x…; === 解密后的數據:uni…
解密后的數據會根據網頁的刷新不停的變化,所以要找到一個不變的映射關系
尋找不變的映射關系如果自己找的話可能會浪費很長時間,這里直接告訴大家結果
源碼數據:&#x…; === 解密后的數據:uni… === xml文件中的前兩個x,y坐標差值。
源碼如下 ,如果還看不懂的同學可以到我的B站教學視頻,過程非常詳細。
B站up主:一只會唱歌的程序狗 里面有的一期視頻:字體加密-58同城簡歷信息爬取
import requests
import re
import base64
from lxml import etree
from fontTools.ttLib import TTFont
from io import BytesIO
data_map = {(0, 1549): ‘B’, (1588, 0): ‘男’, (868, 0): ‘王’, (825, 367): ‘大’, (265, -118): ‘?!? (0, 1026): ‘M’,
(-110, -150): ‘女’, (1460, 0): ‘吳’, (230, 390): ‘碩’, (156, 262): ‘趙’, (660, 0): ‘黃’, (924, 0): ‘李’,
(0, 1325): ‘1’, (0, 134): ‘8’, (0, 144): ‘經’, (0, 125): ‘2’, (1944, 0): ‘下’, (-52, -52): ‘本’, (582, 0): ‘屆’,
(0, -227): ‘5’, (146, 78): ‘應’, (228, 306): ‘科’, (-244, -426): ‘7’, (770, 0): ‘中’, (928, 0): ‘生’,
(-121, 62): ‘6’, (-833, 0): ‘E’, (299, 0): ‘陳’, (159, -123): ‘3’, (164, 0): ‘以’, (-764, 0): ‘楊’,
(-221, 0): ‘A’, (238, 0): ‘張’, (0, -1023): ‘4’, (784, 0): ‘無’, (0, 410): ‘0’, (128, -74): ‘9’,
(-46, -550): ‘驗’, (0, 110): ‘博’, (0, 132): ‘技’, (746, 0): ‘士’, (210, 358): ‘?!? (1298, 0): ‘高’,
(-74, -366): ‘劉’, (0, -508): ‘周’}
def get_font_map(content):
font_map = {}
result = re.search(r"base64,(.*?))", content, flags=re.S).group(1)
b = base64.b64decode(result)
tf = TTFont(BytesIO(b))
# print(tf.getGlyphNames())
# 運行三遍分別保存字體庫01、02、03用來做分析
with open(“ztku01.woff”, “wb”)as f:
f.write(b)
def parse_html():
url = “https://sz.58.com/searchjob/”
header = {
‘accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3’,
‘accept-encoding’: ‘gzip, deflate, br’,
‘accept-language’: ‘zh-CN,zh;q=0.9,en;q=0.8’,
‘cache-control’: ‘max-age=0’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36’,
}
response = requests.get(url, headers=header)
html = response.text
font_map = get_font_map(html)
for i in font_map:
print(i+";")
html = html.replace(i + “;”, font_map[i])
print(html)
data = etree.HTML(html)
personal_information = data.xpath(’//div[@id=“infolist”]/ul/li//dl[@class=“infocardMessage clearfix”]’)
for info in personal_information:
# 姓名
name = info.xpath(’./dd//span[@class=“infocardName fl stonefont resumeName”]/text()’)[0]
# 性別
gender = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[1]/text()’)[0]
# 年齡
age = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[2]/text()’)[0]
# 工作經驗
work_experience = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[3]/text()’)[0]
# 學歷
education = info.xpath(’./dd//div[@ class=“infocardBasic fl”]/div/em[4]/text()’)[0]
print(name, gender, age, work_experience, education)
if name == “main”:
parse_html()
總結
以上是生活随笔為你收集整理的字体加密-58同城简历信息爬取的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ubuntu下sed命令详解 - Dic
- 下一篇: 怎么修改服务器密码忘了怎么办啊,华为云怎