PctGL SERIES  
http://pctgl.cnblogs.com

为了写个通用的小说下载程序,而不得不写的一个网页页面源码分析代码
原因也很简单,这个下载程序用winhttp api直接访问 html 页面,下载页面源码文本
得到网页源码后,需要获取特定范围的特定文本(小说页面的文本)

几个选择:

1. 是使用web控件的DOM对象根据元素路径或特定元素直接定位, 
web控件的内核就是IE,缺点是浪费内存和cpu资源,且存在可能的解析错误(ie落伍了),如果使用web控件则控件不得设为不可见状态, .visible  的值必须为 true,否则不能加载网页,解决办法是把web控件放在边框外, 整个过程完全交由web控件下载加载获取,这个程序写出来虽然特别简单,但是我有资源洁癖。。。

2.直接创建HtmlDocument对象,加载网页源码,一样使用dom对象
最大的问题是几乎不可用,HtmlDocument对象是DOM对象内核,存在与web控件相同的问题,htmldocument没法在不显示出页面的情况下解析整个页面,他的元素子对象都是空的,说白了就是没法用
web控件/ie肯定是在加载了网页源码后,在页面渲染时做了什么,DOM对象才会真正的完全解析,总之这个方案不可行

3.为了那可怜的资源洁癖,还是忍着头痛写了几个函数,把网页源码给解析了。。。
下面的代码是不完整代码,实现了解析但是没有写提取和保存的代码,每个人思路不一样
有兴趣的可以自行修改使用,解析过程主要根据字节码解析,没有使用文本形式,这样大大增加了效率
解析逻辑:网页源码转换为 unicode 字节数组,对比 < > 关键符号,

webpage_Analyzer 负责找出所有的标签:即 < > 的内容,包括 < >
webpage_analyze_tag 负责把标签内的内容解析出来,忽略 < > ,按各种关键打断符号(回车/换行/跳格/空格/冒号/等号/引号)把内容提取出来
webpage_analyze_tag_attribs  负责根据标签内提取出来的内容进行识别,识别标签名,标签属性,标签类型等等。。。
由于结构体 dom_element_in_xpath_analyzer 设计的不符合要求,相当于没法存储解析出来的数据,所以后面的代码没有写,

 

 
'// 使用这样的代码调用解析程序, debug.print 输出解析结果,稍加跟踪就可以理解整个过程
call webpage_Analyzer Text1.text

 使用凤凰网(ifeng.com)首页源码测试:

<!DOCTYPE html>
<html xmlns:wb="http://open.weibo.com/wb" lang="zh">

<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta content="always" name="referrer">
<title>凤凰网</title>

<meta name="keywor...................后面略..............


debug输出下面例子代码:

tag field: !DOCTYPE 67002.7
tag field: html 67003.13
tag field: html 67003.42
tag field: xmlns 67003.77
tag field: : 67004
tag field: wb 67004.25
tag field: = 67004.48
tag field: http://open.weibo.com/wb 67004.78
tag field: lang 67004.99
tag field: = 67005.22
tag field: zh 67005.62
tag field: head 67005.8
tag field: meta 67005.98
tag field: charset 67006.16
tag field: = 67006.37
tag field: utf-8 67006.55
tag field: meta 67006.78
tag field: http-equiv 67007.2
tag field: = 67007.42
tag field: X-UA-Compatible 67007.63
tag field: content 67007.91
tag field: = 67008.18
tag field: IE=Edge 67008.6
tag field: meta 67008.87
tag field: content 67009.39
tag field: = 67009.63
tag field: always 67009.95
tag field: name 67010.2
tag field: = 67010.57
tag field: referrer 67010.87
tag field: title 67011.09
tag field: /title 67011.41
tag field: meta 67011.64
tag field: name 67012.02

  1 Private Type xpath_url_analyzer_type_attrib
  2     
  3     attrName        As String
  4     attrValue       As String
  5     isCheckState    As Long     '// 0 = and , 1 = or
  6     
  7 End Type
  8 
  9 Private Type dom_element_in_xpath_analyzer
 10     
 11     elementName     As String
 12     elementIndex    As Long
 13     AttribCount     As Long
 14     Attribs()       As xpath_url_analyzer_type_attrib
 15     
 16     isCloseElement  As Boolean
 17     elementTagClass As Long
 18     
 19     innerHtml_from  As Long
 20     innerHtml_to    As Long
 21     
 22     innerText_from  As Long
 23     innerText_to    As Long
 24     
 25     tagText_from    As Long
 26     tagText_to      As Long
 28     
 29 End Type
 30 
 31 Private Type simple_htmlanalyzer_documentobjectmodel
 32     
 33     sizeOfHtml                  As Long
 34     ByteOfHtml()                As Byte                                 '// html code
 35     
 36     sizeOfElement               As Long
 37     htmlElement()               As dom_element_in_xpath_analyzer        '// 根据 xpath 分析器指定的路径解析出的元素
 38     
 40 End Type
 41 
 42 Private Type ThisClassSet
 43     
 44     htmlElementsOfXpath         As simple_htmlanalyzer_documentobjectmodel
 45     
 47 End Type
 48 Dim PG                      As ThisClassSet
 49 
 50 'Dim htmlLabelArray()    As dom_element_in_xpath_analyzer
 51 Dim htmlelements        As simple_htmlanalyzer_documentobjectmodel
 54 Function webpage_analyze_GetElementInnerHtml() As String
 55     
 56 End Function
 59 Function webpage_analyze_GetElementInnerText() As String
 60     
 61 End Function
 62 
 63 Private Function webpage_analyze_tag_attribs(WebPageTagfield() As String, domElement As dom_element_in_xpath_analyzer) As Long
 64     
 65     '// 此函数的思路,是以状态机形式,连续解析标签内的内容并存储
 68     '// webpage_analyze_tag 函数最后应调用此函数,解析标签内的文本含义和存储
 68     '//WebPageTagfield 数组保存着标签内全部已识别的字段,如:div class = xxx id = xxx style : color = xxx disabled
 69 End Function
 70 
 71 Private Function webpage_analyze_tag(webpageTag() As Byte, ByVal tagfrom As Long, ByVal tagto As Long) As dom_element_in_xpath_analyzer
 72     
 73 '    Debug.Print "webpage_analyze_tag", webpageTag, Timer
 74     
 75     
 76     Dim iCount              As Long
 77     
 78     
 79     If tagto - tagfrom = 2 Then Exit Function       '// 没有具体内容, 形式为: <>
 80     
 81     With webpage_analyze_tag
 82     
 83         .tagText_from = tagfrom
 84         .tagText_to = tagto
 85         
 86         ReDim .Attribs(100)
 87         
 88     End With
 89     
 90     
 91     Dim tagAttribField()        As String
 92     Dim psAttribField           As Long
 93     
 94     Dim flag_tag_leftStart      As Long
 95     Dim flag_DanYinHao          As Long         '// 单引号 标志位
 96     Dim flag_ShuangYinHao       As Long         '// 双引号 标志位
 97 '    Dim flag_Space              As Long         '// 空格 标志位
 98     
 99 '    Dim flag_tag_finished       As Long         '// 标签名提取完成
100 '    Dim state_tag_attribs_check As Long         '// 状态机: =0 待提取tag, = -1 已提取标签名, =1 属性名已提取, = 2 等号已确认, = 3 前导引号已确认, 完成后重置状态机为 = -1
101     
102     Dim ByteSymbolTable(255)       As Byte
103     
104     '**************************************** 做一个类型表 ***********************************************
105     
106     '// 13, 10, 9 32 58,enter,changeline,tab,space,
107     ByteSymbolTable(9) = 1
108     ByteSymbolTable(10) = 1
109     ByteSymbolTable(13) = 1
110     ByteSymbolTable(32) = 1
111     
112     ByteSymbolTable(62) = 1   '// >
113     ByteSymbolTable(58) = 1   '// 冒号
114     ByteSymbolTable(61) = 1   '// 等号
118 '    ByteSymbolTable(62) = 8     '// >
120     
121     '// 34/39    双引号/单引号
122     ByteSymbolTable(34) = 2
123     ByteSymbolTable(39) = 3
124     
125     
126     
127 '    '// /  闭合标签
128     ByteSymbolTable(47) = 6
129     
130     '// !  注释标签
131     ByteSymbolTable(33) = 5
132     
133     
134     
135     ' 字母/数字/-/_标记为5
136     For iCount = 65 To 90: ByteSymbolTable(iCount) = 6: Next         ' A-Z
137     For iCount = 97 To 122: ByteSymbolTable(iCount) = 6: Next        ' a-z
138     ByteSymbolTable(45) = 6                                     ' -
139     ByteSymbolTable(95) = 6                                     ' 下划线 _
140 
141     ' 数字
142     For iCount = 48 To 57: ByteSymbolTable(iCount) = 7: Next    ' 0-9
143     
144     '**************************************** 类型表完成 ***********************************************
145     
146 '    SpaceBreakSymbol = 1
147 '    SingleQuoteSymbol = 2
148 '    DoubleQuoteSymbol = 3
149 '    CloseSymbol = 4
150 '    CommentSymbol = 5
151 '    WordSymbol = 6
152 '    NumberSymbol = 7
153     
154 '   Dim state_tag_attribs_check As Long
155     '// 状态机: =0 待提取tag, = -1 已提取标签名, 1 属性名已提取, = 2 等号已确认, = 3 前导引号已确认, 4 属性名后没有=直接新属性 完成后重置状态机为 = -1
156     
157     ReDim tagAttribField(tagto - tagfrom)
158     
159     
160     For iCount = tagfrom + 2 To tagto Step 2
161         
162         If webpageTag(iCount + 1) = 0 Then
163         
164             Select Case ByteSymbolTable(webpageTag(iCount))
165                    
166                    '// 13, 10, 9 32,58 ,enter,changeline,tab,space, >, : ,
167                    Case 1
168                         '// 忽略在标签字符前的中断符,实现一定的容错处理
169                         
170                         '// 如果不在引号范围内 or 如果遇到结尾符号: >
171                         If ((flag_DanYinHao = 0) And (flag_ShuangYinHao = 0)) Or (iCount = tagto) Then
172                             
173                             '// 如果是字段结尾
174                             If flag_tag_leftStart Then
175                                 
176                                 tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0)
177                                 
178                                 CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart
179                                 
180                                 psAttribField = psAttribField + 1
181                                 
182                                 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer
183                                 
184                                 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao
185                                 
186                                 
187                             End If
188                             
189                             '// : , =
190                             If (webpageTag(iCount) = 58) Or (webpageTag(iCount) = 61) Then
191                                 tagAttribField(psAttribField) = String(1, webpageTag(iCount))
192                                 psAttribField = psAttribField + 1
193                                 
194                                 Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer
195                                 
196                                 webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao
197 
198                             End If
199                             
200                         End If
201                    
202                    '// 34 双引号
203                    Case 2
204                         '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感
205                         If flag_DanYinHao = 0 Then
206                         
207                             If flag_ShuangYinHao Then
208                                 
209                                 '// 如果是字段结尾
210                                 If flag_tag_leftStart Then
211                                     
212                                     tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0)
213                                     
214                                     CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart
215                                     
216                                     psAttribField = psAttribField + 1
217                                     
218                                     Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer
219                                     
220                                     webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao
221                                     
222                                     
223                                 End If
224                                 
225                                 
226                                 
227                             Else
228                                 flag_ShuangYinHao = iCount
229                             End If
230                         End If
231                    
232                    
233                    '// 39 单引号
234                    Case 3
235                         '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感
236                         If flag_ShuangYinHao = 0 Then
237                         
238                             If flag_DanYinHao Then
239                                 
240                                 '// 如果是字段结尾
241                                 If flag_tag_leftStart Then
242                                     
243                                     tagAttribField(psAttribField) = String((iCount - flag_tag_leftStart) \ 2, 0)
244                                     
245                                     CopyMemory ByVal StrPtr(tagAttribField(psAttribField)), webpageTag(flag_tag_leftStart), iCount - flag_tag_leftStart
246                                     
247                                     psAttribField = psAttribField + 1
248                                     
249                                     Debug.Print "tag field:", tagAttribField(psAttribField - 1), Timer
250                                     
251                                     webpage_analyze_flag_setbitszero flag_tag_leftStart, flag_DanYinHao, flag_ShuangYinHao
252                                     
253                                     
254                                 End If
255                                 
256                                 
257                             Else
258                                 flag_DanYinHao = iCount
259                             End If
260                         End If
261                         
262                    
263 '                   '// /  闭合标签
264 '                   Case 4
265 '
266 '                        If flag_tag_leftStart = 0 Then webpage_analyze_tag.isCloseElement = iCount
267                         
268                         
269                         
270                    '// !  注释标签
271                    Case 5
272                         '// 注释前导
273                         If (webpageTag(iCount + 2) = 45) And (webpageTag(iCount + 4) = 45) Then
274                             If (webpageTag(iCount + 3) = 0) And (webpageTag(iCount + 5) = 0) Then
275                                 
276                                 '// 标记注释标签
277                                 webpage_analyze_tag.elementName = "!"
278 
279                             End If
280                             
281                         Else
282                             '// ! 前导标签,非注释
283                             If flag_tag_leftStart = 0 Then flag_tag_leftStart = iCount
284                             
285                         End If
286                    
287                    '// a~z:97 To 122, A~Z:65 To 90, 0~9:48 To 57, -:45,_:95,
288                    Case 6, 7
289                         
290                         If flag_tag_leftStart = 0 Then flag_tag_leftStart = iCount
291                         
292                         
293             
294                    '//Case 7   '// 应单独判断数字类型作为前导字符的情况: 如果没有字母前导/没有引号前导,则标签或属性无效
295                         
296             
297             End Select
298         
299         End If
300         
301     Next
302     
303     
304     If psAttribField Then ReDim Preserve tagAttribField(psAttribField - 1)
305     
306     
307     
308 End Function
309 
310 Sub webpage_analyze_flag_setbits(ByVal exp1 As Long, ByVal exp2 As Long, flagVar As Long, ByVal bitsVal As Long)
311     '// 标志位带条件置位
312     If (exp1) Or (exp2) Then flagVar = bitsVal
313 End Sub
314 
315 Sub webpage_analyze_flag_setbitszero(exp1 As Long, Optional exp2 As Long, Optional exp3 As Long, Optional exp4 As Long, Optional exp5 As Long, Optional exp6 As Long, Optional exp7 As Long)
316     '// 标志位全部清 0
317     exp1 = 0: exp2 = 0: exp3 = 0: exp4 = 0: exp5 = 0: exp6 = 0: exp7 = 0
318 End Sub
319 
320 '// 分析web页面源码
321 Function webpage_Analyzer(ByVal strWebPage As String) As Long
322     
323     Dim mulBitsBuffer()         As Byte         '// 提取的标签字节码
324     
325     '// 获取webpage字节码,,缓存 html code, 保存 uboudund(html byte code)
326     PG.htmlElementsOfXpath.ByteOfHtml = strWebPage
327     
328     '// 缓存webpage源码总字节量
329     PG.htmlElementsOfXpath.sizeOfHtml = UBound(PG.htmlElementsOfXpath.ByteOfHtml)
330 
331     Dim iCount                  As Long         '// 循环计次
332     
333     Dim flag_left_FangKuoHao    As Long         '// 左 方括号标志位
334     Dim flag_DanYinHao          As Long         '// 单引号 标志位
335     Dim flag_ShuangYinHao       As Long         '// 双引号 标志位
336     Dim flag_Space              As Long         '// 空格 标志位
337     
338     Dim flag_enter              As Long
339     Dim flag_tab                As Long
340     Dim flag_changeline         As Long
341     
342     For iCount = 0 To PG.htmlElementsOfXpath.sizeOfHtml Step 2
343         
344         '// 查找 <
345         If PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 60 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
346             
347             '// 未遇到过 <
348             If flag_left_FangKuoHao = 0 Then
349                 
350                 flag_left_FangKuoHao = iCount
351                 
352             '// 在遇到过 < 但没有被引号包括时: 放弃之前的错误标签类型
353             Else
354                 If (flag_ShuangYinHao = 0) Or (flag_DanYinHao = 0) Then
355                     
356                     '// 标志位 清0
357                     webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab
358                     
359                     '// 标签识别起始位 置位
360                     flag_left_FangKuoHao = iCount
361                     
362                 End If
363                 
364             End If
365 '        End If
366         
367         '// 查找 "
368         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 34 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
369             
370             '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感
371             If flag_DanYinHao = 0 Then
372             
373                 If flag_ShuangYinHao Then
374                     flag_ShuangYinHao = 0
375                 Else
376                     flag_ShuangYinHao = iCount
377                 End If
378             End If
379 '        End If
380         
381         '// 查找 空格
382         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 32 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
383             webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_Space, iCount
384         
385         '// 查找 enter
386         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 13 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
387             webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_enter, iCount
388         
389         '// 查找 changeline
390         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 10 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
391             webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_changeline, iCount
392         
393         '// 查找 tab
394         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 9 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
395             webpage_analyze_flag_setbits flag_DanYinHao, flag_ShuangYinHao, flag_tab, iCount
396         
397         '// 查找 '
398         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 39 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
399             
400             '// 敏感引号类型, 对于单双引号套双单引号的情况, 只对第1个引号类型敏感
401             If flag_ShuangYinHao = 0 Then
402             
403                 If flag_DanYinHao Then
404                     flag_DanYinHao = 0
405                 Else
406                     flag_DanYinHao = iCount
407                 End If
408             End If
409         
410         
411         '// 查找 >
412         ElseIf PG.htmlElementsOfXpath.ByteOfHtml(iCount) = 62 And PG.htmlElementsOfXpath.ByteOfHtml(iCount + 1) = 0 Then
413             
414             '// 遇到 > 时,表明标签结束
415             If flag_left_FangKuoHao Then
416                 
417                 '// 如果单双引号都是闭合状态
418                 If (flag_ShuangYinHao = 0) And (flag_DanYinHao = 0) Then
419                     
420 '                    '// 提取标签内容, 包括左括号, 但不包括右括号
421 '                    ReDim mulBitsBuffer(iCount - flag_left_FangKuoHao)
422 '
423 '                    CopyMemory mulBitsBuffer(0), PG.htmlElementsOfXpath.ByteOfHtml(flag_left_FangKuoHao + 2), iCount - flag_left_FangKuoHao - 2
424 '
425 '                    '// 解析标签
426 '                    webpage_analyze_tag (mulBitsBuffer)
427                     webpage_analyze_tag PG.htmlElementsOfXpath.ByteOfHtml, flag_left_FangKuoHao, iCount
428                     
429                     webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab
430 
431                 
432                 Else
433                     '// 单双引号非闭合状态, 符合:大于50个字符,且遇有空格/回车/tab/换行时,或超过256个字符时,发现 > 强制闭合
434                     If iCount - flag_left_FangKuoHao > 100 Then
435 
436                         Select Case True
437                                Case ((iCount - flag_left_FangKuoHao) > 512), ((flag_Space Or flag_enter Or flag_changeline Or flag_tab) <> 0)
438 
439 '                                    '// 提取标签内容, 包括左括号, 但不包括右括号
440 '                                    ReDim mulBitsBuffer(iCount - flag_left_FangKuoHao)
441 '
442 '                                    CopyMemory mulBitsBuffer(0), PG.htmlElementsOfXpath.ByteOfHtml(flag_left_FangKuoHao + 2), iCount - flag_left_FangKuoHao - 2
443 '
444 '                                    '// 解析标签
445 '                                    webpage_analyze_tag (mulBitsBuffer)
446                                     
447                                     webpage_analyze_tag PG.htmlElementsOfXpath.ByteOfHtml, flag_left_FangKuoHao, iCount
448                                     
449                                     
450                                     '// 标志位清0
451                                     webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab
452 
453                                     
454                         End Select
455                         
456                     End If
457                     
458 
459                     
460                     
461                 End If
462                 
463             Else
464                 '// 未发现过 < ,单独发现闭合符号, 直接舍弃忽略, '// 标志位清0
465                 webpage_analyze_flag_setbitszero flag_left_FangKuoHao, flag_DanYinHao, flag_ShuangYinHao, flag_Space, flag_enter, flag_changeline, flag_tab
466 
467                 
468             End If
469             
470             
471             
472         End If
473         
474     Next
475     
476     
477     
478     
479 End Function

 

最后,还有 xpath 的路径解析函数 , 由于结构体设计的不好,代码也写的比较糊弄

  1 Function xpath_Analyzer(ByVal strXpath As String) As Long
  2     '/html/body/div                         文档根目录下的body标签中的所有div标签(绝对路径)。
  3     '/html/body/div[1]
  4     '/html/body/div[@xx='cc']
  5     '/html/body/div[@xx='cc' and @xx2='cc2']
  6     '//div[1]    文档中第一个div标签(XPath 索引从 1 开始)。
  7     
  8     '//div[last()]   文档中最后一个div标签。
  9     '//div[position()<3] 文档中前两个div标签。
 10     '//div[2]/p[1]   第二个div中的第一个p标签。
 11     '//input[@type='text'][@name='username'] 所有type为text且name为username的input标签(多属性筛选)。
 12     '//div[@class='item'][2] 第二个class为item的div标签(属性与索引组合)。
 13     
 14     Debug.Print "XPath", strXpath, Timer
 15     
 16     Dim iCount                  As Long
 17     Dim arrayXPath()            As String
 18     Dim xpathPart               As String
 19     Dim currentLevel            As Long
 20     
 21     ' 分割XPath路径
 22     arrayXPath = Split(strXpath, "/")
 23     
 24     ' 至少需要 /html/body/xxx 三级路径
 25     If UBound(arrayXPath) < 3 Then Exit Function
 26     
 27     ' 初始化数组
 28 '    ReDim PG.htmlElementsOfXpath.htmlElement(UBound(arrayXPath) - 1)
 29     PG.htmlElementsOfXpath.sizeOfElement = UBound(arrayXPath) - 1
 30     ReDim PG.htmlElementsOfXpath.htmlElement(PG.htmlElementsOfXpath.sizeOfElement)
 31     
 32     ' 处理绝对路径(以/开头)
 33     If Len(arrayXPath(0)) = 0 Then
 34     
 35         ' 验证并设置html和body标签
 36         If LCase(arrayXPath(1)) = "html" Then
 37             PG.htmlElementsOfXpath.htmlElement(0).elementName = "html"
 38             PG.htmlElementsOfXpath.htmlElement(0).elementIndex = 1
 39             PG.htmlElementsOfXpath.htmlElement(0).AttribCount = 0
 40             
 41             If LCase(arrayXPath(2)) = "body" Then
 42                 PG.htmlElementsOfXpath.htmlElement(1).elementName = "body"
 43                 PG.htmlElementsOfXpath.htmlElement(1).elementIndex = 1
 44                 PG.htmlElementsOfXpath.htmlElement(1).AttribCount = 0
 45                 
 46                 ' 处理后续路径部分
 47                 For iCount = 3 To UBound(arrayXPath)
 48                     
 49                     Debug.Print "xpart", arrayXPath(iCount), Timer
 50                     
 51                     
 52                     If xpath_analyze_label(Trim(arrayXPath(iCount)), PG.htmlElementsOfXpath.htmlElement(iCount - 1)) = 0 Then Exit Function
 53                     
 54                     
 55                     
 56                 Next
 57                 
 58                 xpath_Analyzer = UBound(PG.htmlElementsOfXpath.htmlElement)
 59                 
 60             End If
 61         End If
 62     End If
 63 End Function
 64 
 65 
 66 Private Function xpath_analyze_label(ByVal strHtmllabel As String, xpathType As dom_element_in_xpath_analyzer) As Long
 67     
 68     Dim left_kuohao     As Long
 69     Dim right_kuohao    As Long
 70     
 71     '// 查找左括号
 72     left_kuohao = InStr(1, strHtmllabel, "[", vbBinaryCompare)
 73     
 74     '// 查找右括号
 75     If Right(strHtmllabel, 1) = "]" Then right_kuohao = Len(strHtmllabel)
 76     
 77     '// 如果不存在左括号, 整个条目都是标签
 78     If left_kuohao = 0 Then
 79         xpathType.elementName = strHtmllabel
 80         xpathType.elementIndex = 1
 81     Else
 82         '// 有左括号的情况下, 左括号左边的都是标签
 83         xpathType.elementName = Trim(Left(strHtmllabel, left_kuohao - 1))
 84 
 85         
 86     End If
 87     
 88     '// 存在右括号时, 从提取括号中的内存
 89     If right_kuohao Then
 90         
 91         Dim strAttribsBuffer    As String
 92         
 93         strAttribsBuffer = Trim(Mid(strHtmllabel, left_kuohao + 1, right_kuohao - left_kuohao - 1))
 94         
 95         
 96         If Left(strAttribsBuffer, 1) = "@" Then
 97             '// 检查括号中的内容, 第1个字符是否为 @ ,如果是, 说明这是个文本属性
 98             If xpath_analyze_label_attribs(strAttribsBuffer, xpathType) = 0 Then Exit Function
 99         
100         Else
101             '// 没有 @ 标志的情况下,视为数值型索引属性
102             xpathType.elementIndex = Val(strAttribsBuffer)
103             
104         End If
105         
106     End If
107     
108     xpath_analyze_label = 1
109     
110 End Function
111 
112 
113 Private Function xpath_analyze_label_attribs(ByVal strlabelAttribs As String, xpathType As dom_element_in_xpath_analyzer) As Long
114     Dim workStr As String: workStr = strlabelAttribs & " " ' 末尾加空格,避免处理边界
115     Dim pos As Long: pos = 1
116     Dim prevPos As Long: prevPos = 1
117     Dim condCount As Long: condCount = 0
118     Dim isOr As Boolean
119     
120     ' 初始化属性数组(从1开始,避免空元素)
121     ReDim xpathType.Attribs(1 To 1)
122     xpathType.AttribCount = 0
123     
124     Do While pos < Len(workStr)
125         ' 查找"and"或"or"(前后带空格的完整单词)
126         Dim andPos As Long: andPos = InStr(pos, workStr, " and ")
127         Dim orPos As Long: orPos = InStr(pos, workStr, " or ")
128         
129         ' 找到第一个逻辑运算符
130         If andPos > 0 And (orPos = 0 Or andPos < orPos) Then
131             ' 处理"and"前的条件
132             Call AddSimpleAttribute(Mid(workStr, prevPos, andPos - prevPos), 0, xpathType)
133             condCount = condCount + 1
134             pos = andPos + 5 ' 跳过" and "
135             prevPos = pos
136         ElseIf orPos > 0 Then
137             ' 处理"or"前的条件
138             Call AddSimpleAttribute(Mid(workStr, prevPos, orPos - prevPos), 1, xpathType)
139             condCount = condCount + 1
140             pos = orPos + 4 ' 跳过" or "
141             prevPos = pos
142         Else
143             ' 处理最后一个条件
144             Call AddSimpleAttribute(Mid(workStr, prevPos, Len(workStr) - prevPos), 0, xpathType)
145             condCount = condCount + 1
146             Exit Do
147         End If
148     Loop
149     
150     ' 调整数组大小为实际条件数(删除空元素)
151     If condCount > 0 Then
152         ReDim Preserve xpathType.Attribs(1 To condCount)
153         xpathType.AttribCount = condCount
154         xpath_analyze_label_attribs = 1
155     Else
156         xpath_analyze_label_attribs = 0
157     End If
158 End Function
159 
160 ' 极简版属性解析(只处理核心格式)
161 Private Sub AddSimpleAttribute(ByVal attrStr As String, ByVal op As Long, xpathType As dom_element_in_xpath_analyzer)
162     Dim trimStr As String: trimStr = Trim(attrStr)
163     If Left(trimStr, 1) <> "@" Then Exit Sub ' 必须是@开头的属性
164     
165     ' 分割属性名和值(找第一个=)
166     Dim eqPos As Long: eqPos = InStr(trimStr, "=")
167     If eqPos = 0 Then Exit Sub
168     
169     ' 提取属性名和值(简单处理引号)
170     Dim attrName As String: attrName = Trim(Mid(trimStr, 2, eqPos - 2))
171     Dim attrValue As String: attrValue = Trim(Mid(trimStr, eqPos + 1))
172     
173     ' 去除值前后的引号(只处理最外层)
174     If Left(attrValue, 1) = "'" Or Left(attrValue, 1) = """" Then
175         attrValue = Mid(attrValue, 2, Len(attrValue) - 2)
176     End If
177     
178     ' 扩容数组并赋值
179     xpathType.AttribCount = xpathType.AttribCount + 1
180     If xpathType.AttribCount > UBound(xpathType.Attribs) Then
181         ReDim Preserve xpathType.Attribs(1 To xpathType.AttribCount)
182     End If
183     
184     With xpathType.Attribs(xpathType.AttribCount)
185         .attrName = attrName
186         .attrValue = attrValue
187         .isCheckState = op ' 0=and,1=or
188     End With
189 End Sub

 

posted on 2025-08-03 18:48  PctGL  阅读(41)  评论(0)    收藏  举报