为了写个通用的小说下载程序,而不得不写的一个网页页面源码分析代码
原因也很简单,这个下载程序用winhttp api直接访问 html 页面,下载页面源码文本
得到网页源码后,需要获取特定范围的特定文本(小说页面的文本)
几个选择:
1. 是使用web控件的DOM对象根据元素路径或特定元素直接定位,
web控件的内核就是IE,缺点是浪费内存和cpu资源,且存在可能的解析错误(ie落伍了),如果使用web控件则控件不得设为不可见状态, .visible 的值必须为 true,否则不能加载网页,解决办法是把web控件放在边框外, 整个过程完全交由web控件下载加载获取,这个程序写出来虽然特别简单,但是我有资源洁癖。。。
2.直接创建HtmlDocument对象,加载网页源码,一样使用dom对象
最大的问题是几乎不可用,HtmlDocument对象是DOM对象内核,存在与web控件相同的问题,htmldocument没法在不显示出页面的情况下解析整个页面,他的元素子对象都是空的,说白了就是没法用
web控件/ie肯定是在加载了网页源码后,在页面渲染时做了什么,DOM对象才会真正的完全解析,总之这个方案不可行
3.为了那可怜的资源洁癖,还是忍着头痛写了几个函数,把网页源码给解析了。。。
下面的代码是不完整代码,实现了解析但是没有写提取和保存的代码,每个人思路不一样
有兴趣的可以自行修改使用,解析过程主要根据字节码解析,没有使用文本形式,这样大大增加了效率
解析逻辑:网页源码转换为 unicode 字节数组,对比 < > 关键符号,
webpage_Analyzer 负责找出所有的标签:即 < > 的内容,包括 < >
webpage_analyze_tag 负责把标签内的内容解析出来,忽略 < > ,按各种关键打断符号(回车/换行/跳格/空格/冒号/等号/引号)把内容提取出来
webpage_analyze_tag_attribs 负责根据标签内提取出来的内容进行识别,识别标签名,标签属性,标签类型等等。。。
由于结构体 dom_element_in_xpath_analyzer 设计的不符合要求,相当于没法存储解析出来的数据,所以后面的代码没有写,
'// 使用这样的代码调用解析程序, debug.print 输出解析结果,稍加跟踪就可以理解整个过程
call webpage_Analyzer Text1.text
使用凤凰网(ifeng.com)首页源码测试:
<!DOCTYPE html>
<html xmlns:wb="http://open.weibo.com/wb" lang="zh">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta content="always" name="referrer">
<title>凤凰网</title>
<meta name="keywor...................后面略..............
debug输出下面例子代码:
tag field: !DOCTYPE 67002.7
tag field: html 67003.13
tag field: html 67003.42
tag field: xmlns 67003.77
tag field: : 67004
tag field: wb 67004.25
tag field: = 67004.48
tag field: http://open.weibo.com/wb 67004.78
tag field: lang 67004.99
tag field: = 67005.22
tag field: zh 67005.62
tag field: head 67005.8
tag field: meta 67005.98
tag field: charset 67006.16
tag field: = 67006.37
tag field: utf-8 67006.55
tag field: meta 67006.78
tag field: http-equiv 67007.2
tag field: = 67007.42
tag field: X-UA-Compatible 67007.63
tag field: content 67007.91
tag field: = 67008.18
tag field: IE=Edge 67008.6
tag field: meta 67008.87
tag field: content 67009.39
tag field: = 67009.63
tag field: always 67009.95
tag field: name 67010.2
tag field: = 67010.57
tag field: referrer 67010.87
tag field: title 67011.09
tag field: /title 67011.41
tag field: meta 67011.64
tag field: name 67012.02
1 Private Type xpath_url_analyzer_type_attrib 2 3 attrName As String 4 attrValue As String 5 isCheckState As Long '// 0 = and , 1 = or 6 7 End Type 8 9 Private Type dom_element_in_xpath_analyzer 10 11 elementName As String 12 elementIndex As Long 13 AttribCount As Long 14 Attribs() As xpath_url_analyzer_type_attrib 15 16 isCloseElement As Boolean 17 elementTagClass As Long 18 19 innerHtml_from As Long 20 innerHtml_to As Long 21 22 innerText_from As Long 23 innerText_to As Long 24 25 tagText_from As Long 26 tagText_to As Long 28 29 End Type 30 31 Private Type simple_htmlanalyzer_documentobjectmodel 32 33 sizeOfHtml As Long 34 ByteOfHtml() As Byte '// html code 35 36 sizeOfElement As Long 37 htmlElement() As dom_element_in_xpath_analyzer '// 根据 xpath 分析器指定的路径解析出的元素 38 40 End Type 41 42 Private Type ThisClassSet 43 44 htmlElementsOfXpath As simple_htmlanalyzer_documentobjectmodel 45 47 End Type 48 Dim PG As ThisClassSet 49 50 'Dim htmlLabelArray() As dom_element_in_xpath_analyzer 51 Dim htmlelements As simple_htmlanalyzer_documentobjectmodel 54 Function webpage_analyze_GetElementInnerHtml() As String 55 56 End Function 59 Function webpage_analyze_GetElementInnerText() As String 60 61 End Function 62 63 Private Function webpage_analyze_tag_attribs(WebPageTagfield() As String, domElement As dom_element_in_xpath_analyzer) As Long 64 65 '// 此函数的思路,是以状态机形式,连续解析标签内的内容并存储 68 '// webpage_analyze_tag 函数最后应调用此函数,解析标签内的文本含义和存储
68 '//WebPageTagfield 数组保存着标签内全部已识别的字段,如:div class = xxx id = xxx style : color = xxx disabled
69 End Function 70 71 Private Function webpage_analyze_tag(webpageTag() As Byte, ByVal tagfrom As Long, ByVal tagto As Long) As dom_element_in_xpath_analyzer 72 73 ' Debug.Print "webpage_analyze_tag", webpageTag, Timer 74 75 76 Dim iCount As Long 77 78 79 If tagto - tagfrom = 2 Then Exit Function '// 没有具体内容, 形式为: <> 80 81 With webpage_analyze_tag 82 83 .tagText_from = tagfrom 84 .tagText_to = tagto 85 86 ReDim .Attribs(100) 87 88 End With 89 90 91 Dim tagAttribField() As String 92 Dim psAttribField As Long 93 94 Dim flag_tag_leftStart As Long 95 Dim flag_DanYinHao As Long '// 单引号 标志位 96 Dim flag_ShuangYinHao As Long '// 双引号 标志位 97 ' Dim flag_Space As Long '// 空格 标志位 98 99 ' Dim flag_tag_finished As Long '// 标签名提取完成 100 ' Dim state_tag_attribs_check As Long '// 状态机: =0 待提取tag, = -1 已提取标签名, =1 属性名已提取, = 2 等号已确认, = 3 前导引号已确认, 完成后重置状态机为 = -1 101 102 Dim ByteSymbolTable(255) As Byte 103 104 '**************************************** 做一个类型表 *********************************************** 105 106 '// 13, 10, 9 32 58,enter,changeline,tab,space, 107 ByteSymbolTable(9) = 1 108 ByteSymbolTable(10) = 1 109 ByteSymbolTable(13) = 1 110 ByteSymbolTable(32) = 1 111 112 ByteSymbolTable(62) = 1 '// > 113 ByteSymbolTable(58) = 1 '// 冒号 114 ByteSymbolTable(61) = 1 '// 等号 118 ' ByteSymbolTable(62) = 8 '// > 120 121 '// 34/39 双引号/单引号 122 ByteSymbolTable(34) = 2 123 ByteSymbolTable(39) = 3 124 125 126 127 ' '// / 闭合标签 128 ByteSymbolTable(47) = 6 129 130 '// ! 注释标签 131 ByteSymbolTable(33) = 5 132 133 134 135 ' 字母/数字/-/_标记为5 136 For iCount = 65 To 90: ByteSymbolTable(iCount) = 6: Next ' A-Z 137 For iCount = 97 To 122: ByteSymbolTable(iCount) = 6: Next ' a-z 138 ByteSymbolTable(45) = 6 ' - 139 ByteSymbolTable(95) = 6 ' 下划线 _ 140 141 ' 数字 142 For iCount = 48 To 57: ByteSymbolTable(iCount) = 7: Next ' 0-9 143 144 '**************************************** 类型表完成 *********************************************** 145 146 ' SpaceBreakSymbol = 1 147 ' SingleQuoteSymbol = 2 148 ' DoubleQuoteSymbol = 3 149 ' CloseSymbol = 4 150 ' CommentSymbol = 5 151 ' WordSymbol = 6 152 ' NumberSymbol = 7 153 154 ' Dim state_tag_attribs_check As Long 155 '// 状态机: =0 待提取tag, = -1 已提取标签名, 1 属性名已提取, = 2 等号已确认, = 3 前导引号已确认, 4 属性名后没有=直接新属性 完成后重置状态机为 = -1 156 157 ReDim tagAttribField(tagto - tagfrom) 158 159 160 For iCount = tagfrom + 2 To tagto Step 2 161 162 If webpageTag(iCount + 1) = 0 Then 163 164 Select Case ByteSymbolTable(webpageTag(iCount)) 165 166 '// 13, 10, 9 32,58 ,enter,changeline,tab,space, >, : , 167 Case 1 168 '// 忽略在标签字符前的中断符,实现一定的容错处理 169 170 '// 如果不在引号范围内 or 如果遇到结尾符号: > 171 If ((flag_DanYinHao = 0) And (flag_ShuangYinHao = 0)) Or (iCount = tagto) Then 172 173 '// 如果是字段结尾 174 If flag_tag_leftStart Then 175 176 tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0) 177 178 CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart 179 180 psAttribField = psAttribField + 1 181 182 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer 183 184 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao 185 186 187 End If 188 189 '// : , = 190 If (webpageTag(iCount) = 58) Or (webpageTag(iCount) = 61) Then 191 tagAttribField(psAttribField) = String(1, webpageTag(iCount)) 192 psAttribField = psAttribField + 1 193 194 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer 195 196 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao 197 198 End If 199 200 End If 201 202 '// 34 双引号 203 Case 2 204 '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感 205 If flag_DanYinHao = 0 Then 206 207 If flag_ShuangYinHao Then 208 209 '// 如果是字段结尾 210 If flag_tag_leftStart Then 211 212 tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0) 213 214 CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart 215 216 psAttribField = psAttribField + 1 217 218 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer 219 220 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao 221 222 223 End If 224 225 226 227 Else 228 flag_ShuangYinHao = iCount 229 End If 230 End If 231 232 233 '// 39 单引号 234 Case 3 235 '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感 236 If flag_ShuangYinHao = 0 Then 237 238 If flag_DanYinHao Then 239 240 '// 如果是字段结尾 241 If flag_tag_leftStart Then 242 243 tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0) 244 245 CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart 246 247 psAttribField = psAttribField + 1 248 249 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer 250 251 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao 252 253 254 End If 255 256 257 Else 258 flag_DanYinHao = iCount 259 End If 260 End If 261 262 263 ' '// / 闭合标签 264 ' Case 4 265 ' 266 ' If flag_tag_leftStart = 0 Then webpage_analyze_tag.isCloseElement = iCount 267 268 269 270 '// ! 注释标签 271 Case 5 272 '// 注释前导 273 If (webpageTag(iCount + 2) = 45) And (webpageTag(iCount + 4) = 45) Then 274 If (webpageTag(iCount + 3) = 0) And (webpageTag(iCount + 5) = 0) Then 275 276 '// 标记注释标签 277 webpage_analyze_tag.elementName = "!" 278 279 End If 280 281 Else 282 '// ! 前导标签,非注释 283 If flag_tag_leftStart = 0 Then flag_tag_leftStart = iCount 284 285 End If 286 287 '// a~z:97 To 122, A~Z:65 To 90, 0~9:48 To 57, -:45,_:95, 288 Case 6, 7 289 290 If flag_tag_leftStart = 0 Then flag_tag_leftStart = iCount 291 292 293 294 '//Case 7 '// 应单独判断数字类型作为前导字符的情况: 如果没有字母前导/没有引号前导,则标签或属性无效 295 296 297 End Select 298 299 End If 300 301 Next 302 303 304 If psAttribField Then ReDim Preserve tagAttribField(psAttribField - 1) 305 306 307 308 End Function 309 310 Sub webpage_analyze_flag_setbits(ByVal exp1 As Long, ByVal exp2 As Long, flagVar As Long, ByVal bitsVal As Long) 311 '// 标志位带条件置位 312 If (exp1) Or (exp2) Then flagVar = bitsVal 313 End Sub 314 315 Sub webpage_analyze_flag_setbitszero(exp1 As Long, Optional exp2 As Long, Optional exp3 As Long, Optional exp4 As Long, Optional exp5 As Long, Optional exp6 As Long, Optional exp7 As Long) 316 '// 标志位全部清 0 317 exp1 = 0: exp2 = 0: exp3 = 0: exp4 = 0: exp5 = 0: exp6 = 0: exp7 = 0 318 End Sub 319 320 '// 分析web页面源码 321 Function webpage_Analyzer(ByVal strWebPage As String) As Long 322 323 Dim mulBitsBuffer() As Byte '// 提取的标签字节码 324 325 '// 获取webpage字节码,,缓存 html code, 保存 uboudund(html byte code) 326 PG.htmlElementsOfXpath.ByteOfHtml = strWebPage 327 328 '// 缓存webpage源码总字节量 329 PG.htmlElementsOfXpath.sizeOfHtml = UBound(PG.htmlElementsOfXpath.ByteOfHtml) 330 331 Dim iCount As Long '// 循环计次 332 333 Dim flag_left_FangKuoHao As Long '// 左 方括号标志位 334 Dim flag_DanYinHao As Long '// 单引号 标志位 335 Dim flag_ShuangYinHao As Long '// 双引号 标志位 336 Dim flag_Space As Long '// 空格 标志位 337 338 Dim flag_enter As Long 339 Dim flag_tab As Long 340 Dim flag_changeline As Long 341 342 For iCount = 0 To PG.htmlElementsOfXpath.sizeOfHtml Step 2 343 344 '// 查找 < 345 If PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 60 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 346 347 '// 未遇到过 < 348 If flag_left_FangKuoHao = 0 Then 349 350 flag_left_FangKuoHao = iCount 351 352 '// 在遇到过 < 但没有被引号包括时: 放弃之前的错误标签类型 353 Else 354 If (flag_ShuangYinHao = 0) Or (flag_DanYinHao = 0) Then 355 356 '// 标志位 清0 357 webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab 358 359 '// 标签识别起始位 置位 360 flag_left_FangKuoHao = iCount 361 362 End If 363 364 End If 365 ' End If 366 367 '// 查找 " 368 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 34 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 369 370 '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感 371 If flag_DanYinHao = 0 Then 372 373 If flag_ShuangYinHao Then 374 flag_ShuangYinHao = 0 375 Else 376 flag_ShuangYinHao = iCount 377 End If 378 End If 379 ' End If 380 381 '// 查找 空格 382 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 32 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 383 webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_Space, iCount 384 385 '// 查找 enter 386 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 13 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 387 webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_enter, iCount 388 389 '// 查找 changeline 390 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 10 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 391 webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_changeline, iCount 392 393 '// 查找 tab 394 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 9 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 395 webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_tab, iCount 396 397 '// 查找 ' 398 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 39 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 399 400 '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感 401 If flag_ShuangYinHao = 0 Then 402 403 If flag_DanYinHao Then 404 flag_DanYinHao = 0 405 Else 406 flag_DanYinHao = iCount 407 End If 408 End If 409 410 411 '// 查找 > 412 ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 62 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then 413 414 '// 遇到 > 时,表明标签结束 415 If flag_left_FangKuoHao Then 416 417 '// 如果单双引号都是闭合状态 418 If (flag_ShuangYinHao = 0) And (flag_DanYinHao = 0) Then 419 420 ' '// 提取标签内容, 包括左括号, 但不包括右括号 421 ' ReDim mulBitsBuffer(iCount - flag_left_FangKuoHao) 422 ' 423 ' CopyMemory mulBitsBuffer(0), PG.htmlElementsOfXpath.ByteOfHtml(flag_left_FangKuoHao + 2), iCount - flag_left_FangKuoHao - 2 424 ' 425 ' '// 解析标签 426 ' webpage_analyze_tag (mulBitsBuffer) 427 webpage_analyze_tag PG.htmlElementsOfXpath.ByteOfHtml, flag_left_FangKuoHao, iCount 428 429 webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab 430 431 432 Else 433 '// 单双引号非闭合状态, 符合:大于50个字符,且遇有空格/回车/tab/换行时,或超过256个字符时,发现 > 强制闭合 434 If iCount - flag_left_FangKuoHao > 100 Then 435 436 Select Case True 437 Case ((iCount - flag_left_FangKuoHao) > 512), ((flag_Space Or flag_enter Or flag_changeline Or flag_tab) <> 0) 438 439 ' '// 提取标签内容, 包括左括号, 但不包括右括号 440 ' ReDim mulBitsBuffer(iCount - flag_left_FangKuoHao) 441 ' 442 ' CopyMemory mulBitsBuffer(0), PG.htmlElementsOfXpath.ByteOfHtml(flag_left_FangKuoHao + 2), iCount - flag_left_FangKuoHao - 2 443 ' 444 ' '// 解析标签 445 ' webpage_analyze_tag (mulBitsBuffer) 446 447 webpage_analyze_tag PG.htmlElementsOfXpath.ByteOfHtml, flag_left_FangKuoHao, iCount 448 449 450 '// 标志位清0 451 webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab 452 453 454 End Select 455 456 End If 457 458 459 460 461 End If 462 463 Else 464 '// 未发现过 < ,单独发现闭合符号, 直接舍弃忽略, '// 标志位清0 465 webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab 466 467 468 End If 469 470 471 472 End If 473 474 Next 475 476 477 478 479 End Function
最后,还有 xpath 的路径解析函数 , 由于结构体设计的不好,代码也写的比较糊弄
1 Function xpath_Analyzer(ByVal strXpath As String) As Long 2 '/html/body/div 文档根目录下的body标签中的所有div标签(绝对路径)。 3 '/html/body/div[1] 4 '/html/body/div[@xx='cc'] 5 '/html/body/div[@xx='cc' and @xx2='cc2'] 6 '//div[1] 文档中第一个div标签(XPath 索引从 1 开始)。 7 8 '//div[last()] 文档中最后一个div标签。 9 '//div[position()<3] 文档中前两个div标签。 10 '//div[2]/p[1] 第二个div中的第一个p标签。 11 '//input[@type='text'][@name='username'] 所有type为text且name为username的input标签(多属性筛选)。 12 '//div[@class='item'][2] 第二个class为item的div标签(属性与索引组合)。 13 14 Debug.Print "XPath", strXpath, Timer 15 16 Dim iCount As Long 17 Dim arrayXPath() As String 18 Dim xpathPart As String 19 Dim currentLevel As Long 20 21 ' 分割XPath路径 22 arrayXPath = Split(strXpath, "/") 23 24 ' 至少需要 /html/body/xxx 三级路径 25 If UBound(arrayXPath) < 3 Then Exit Function 26 27 ' 初始化数组 28 ' ReDim PG.htmlElementsOfXpath.htmlElement(UBound(arrayXPath) - 1) 29 PG.htmlElementsOfXpath.sizeOfElement = UBound(arrayXPath) - 1 30 ReDim PG.htmlElementsOfXpath.htmlElement(PG.htmlElementsOfXpath.sizeOfElement) 31 32 ' 处理绝对路径(以/开头) 33 If Len(arrayXPath(0)) = 0 Then 34 35 ' 验证并设置html和body标签 36 If LCase(arrayXPath(1)) = "html" Then 37 PG.htmlElementsOfXpath.htmlElement(0).elementName = "html" 38 PG.htmlElementsOfXpath.htmlElement(0).elementIndex = 1 39 PG.htmlElementsOfXpath.htmlElement(0).AttribCount = 0 40 41 If LCase(arrayXPath(2)) = "body" Then 42 PG.htmlElementsOfXpath.htmlElement(1).elementName = "body" 43 PG.htmlElementsOfXpath.htmlElement(1).elementIndex = 1 44 PG.htmlElementsOfXpath.htmlElement(1).AttribCount = 0 45 46 ' 处理后续路径部分 47 For iCount = 3 To UBound(arrayXPath) 48 49 Debug.Print "xpart", arrayXPath(iCount), Timer 50 51 52 If xpath_analyze_label(Trim(arrayXPath(iCount)), PG.htmlElementsOfXpath.htmlElement(iCount - 1)) = 0 Then Exit Function 53 54 55 56 Next 57 58 xpath_Analyzer = UBound(PG.htmlElementsOfXpath.htmlElement) 59 60 End If 61 End If 62 End If 63 End Function 64 65 66 Private Function xpath_analyze_label(ByVal strHtmllabel As String, xpathType As dom_element_in_xpath_analyzer) As Long 67 68 Dim left_kuohao As Long 69 Dim right_kuohao As Long 70 71 '// 查找左括号 72 left_kuohao = InStr(1, strHtmllabel, "[", vbBinaryCompare) 73 74 '// 查找右括号 75 If Right(strHtmllabel, 1) = "]" Then right_kuohao = Len(strHtmllabel) 76 77 '// 如果不存在左括号, 整个条目都是标签 78 If left_kuohao = 0 Then 79 xpathType.elementName = strHtmllabel 80 xpathType.elementIndex = 1 81 Else 82 '// 有左括号的情况下, 左括号左边的都是标签 83 xpathType.elementName = Trim(Left(strHtmllabel, left_kuohao - 1)) 84 85 86 End If 87 88 '// 存在右括号时, 从提取括号中的内存 89 If right_kuohao Then 90 91 Dim strAttribsBuffer As String 92 93 strAttribsBuffer = Trim(Mid(strHtmllabel, left_kuohao + 1, right_kuohao - left_kuohao - 1)) 94 95 96 If Left(strAttribsBuffer, 1) = "@" Then 97 '// 检查括号中的内容, 第1个字符是否为 @ ,如果是, 说明这是个文本属性 98 If xpath_analyze_label_attribs(strAttribsBuffer, xpathType) = 0 Then Exit Function 99 100 Else 101 '// 没有 @ 标志的情况下,视为数值型索引属性 102 xpathType.elementIndex = Val(strAttribsBuffer) 103 104 End If 105 106 End If 107 108 xpath_analyze_label = 1 109 110 End Function 111 112 113 Private Function xpath_analyze_label_attribs(ByVal strlabelAttribs As String, xpathType As dom_element_in_xpath_analyzer) As Long 114 Dim workStr As String: workStr = strlabelAttribs & " " ' 末尾加空格,避免处理边界 115 Dim pos As Long: pos = 1 116 Dim prevPos As Long: prevPos = 1 117 Dim condCount As Long: condCount = 0 118 Dim isOr As Boolean 119 120 ' 初始化属性数组(从1开始,避免空元素) 121 ReDim xpathType.Attribs(1 To 1) 122 xpathType.AttribCount = 0 123 124 Do While pos < Len(workStr) 125 ' 查找"and"或"or"(前后带空格的完整单词) 126 Dim andPos As Long: andPos = InStr(pos, workStr, " and ") 127 Dim orPos As Long: orPos = InStr(pos, workStr, " or ") 128 129 ' 找到第一个逻辑运算符 130 If andPos > 0 And (orPos = 0 Or andPos < orPos) Then 131 ' 处理"and"前的条件 132 Call AddSimpleAttribute(Mid(workStr, prevPos, andPos - prevPos), 0, xpathType) 133 condCount = condCount + 1 134 pos = andPos + 5 ' 跳过" and " 135 prevPos = pos 136 ElseIf orPos > 0 Then 137 ' 处理"or"前的条件 138 Call AddSimpleAttribute(Mid(workStr, prevPos, orPos - prevPos), 1, xpathType) 139 condCount = condCount + 1 140 pos = orPos + 4 ' 跳过" or " 141 prevPos = pos 142 Else 143 ' 处理最后一个条件 144 Call AddSimpleAttribute(Mid(workStr, prevPos, Len(workStr) - prevPos), 0, xpathType) 145 condCount = condCount + 1 146 Exit Do 147 End If 148 Loop 149 150 ' 调整数组大小为实际条件数(删除空元素) 151 If condCount > 0 Then 152 ReDim Preserve xpathType.Attribs(1 To condCount) 153 xpathType.AttribCount = condCount 154 xpath_analyze_label_attribs = 1 155 Else 156 xpath_analyze_label_attribs = 0 157 End If 158 End Function 159 160 ' 极简版属性解析(只处理核心格式) 161 Private Sub AddSimpleAttribute(ByVal attrStr As String, ByVal op As Long, xpathType As dom_element_in_xpath_analyzer) 162 Dim trimStr As String: trimStr = Trim(attrStr) 163 If Left(trimStr, 1) <> "@" Then Exit Sub ' 必须是@开头的属性 164 165 ' 分割属性名和值(找第一个=) 166 Dim eqPos As Long: eqPos = InStr(trimStr, "=") 167 If eqPos = 0 Then Exit Sub 168 169 ' 提取属性名和值(简单处理引号) 170 Dim attrName As String: attrName = Trim(Mid(trimStr, 2, eqPos - 2)) 171 Dim attrValue As String: attrValue = Trim(Mid(trimStr, eqPos + 1)) 172 173 ' 去除值前后的引号(只处理最外层) 174 If Left(attrValue, 1) = "'" Or Left(attrValue, 1) = """" Then 175 attrValue = Mid(attrValue, 2, Len(attrValue) - 2) 176 End If 177 178 ' 扩容数组并赋值 179 xpathType.AttribCount = xpathType.AttribCount + 1 180 If xpathType.AttribCount > UBound(xpathType.Attribs) Then 181 ReDim Preserve xpathType.Attribs(1 To xpathType.AttribCount) 182 End If 183 184 With xpathType.Attribs(xpathType.AttribCount) 185 .attrName = attrName 186 .attrValue = attrValue 187 .isCheckState = op ' 0=and,1=or 188 End With 189 End Sub

浙公网安备 33010602011771号