ruby基本库常见用法

我写了ruby脚本也有一段时间了。ruby某些库的api的说明不是那么全,特别是socket。我在这个博客把一些ruby基本库的用法节选下来吧。代码本身的逻辑都是能运行的,而附属的一些yaml数据没有放上来,所以代码无法直接运行。

写这个博客是希望能对某些ruby新手有所帮助。目前ruby的中文教程确实不怎么多,且重复的不少。就我看来,ruby既不是太好,也不是太坏,当你熟悉它的语法和“pass by object reference”后,就会不自然地产生对它的爱和恨。本文主要写一些方便快速完成任务的实用性的东西。

 

一、Herbern m3u8下载器

用到了ruby的:

OpenURI:http下载库,注意设置header;io = URI.open(link, hash),这里的hash约等于http发送get/post请求时的header,如果在hash里指定 :encoding = 'utf-8',则open-uri会以utf8编码读取下载的流,否则默认按ascii编码来读取。

socket (windows运行)的读写,TCPServer和TCPSocket。Socket < BasicSocket < IO,而close_write和read(Int)都是socket继承自IO的函数。

Set[] 集合: require 'set'; a = Set[3,4,5]; a.to_a # => [3,4,5];  [5,6,7].to_set #=> #<Set: {5, 6, 7}>; a.include?(5) #=> true; a << 2; #=> #<Set: {3, 4, 5, 2}>; a.delete(5)  #=> #<Set: {3, 4, 2}>

JSON库基本用法(Psych yaml库的用法类似);

Thread.new {}; 多线程; Mutex和Queue的用法参见“runoob菜鸟教程”。

类似于Python的if __name__ == "main" 的如本文件为启动脚本则执行的 if ($0 == __FILE__)用法,和ARGV:启动参数数组

简单的正则表达式用法:注意是否开启多行模式,和替换换行符

 1 text0 = '<div class="el">
 2         <p class="t1 ">
 3             <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
 4             <input class="checkbox" type="checkbox" name="delivery_jobid" value="110487870" jt="0" style="display:none" />
 5             <span>
 6                 <a target="_blank" title="美容顾问(佛山)" href="https://jobs.51job.com/foshan/110487870.html?s=01&t=0"  onmousedown="">
 7                     美容顾问(佛山)                </a>
 8             </span>
 9                                                                     </p>
10         <span class="t2"><a target="_blank" title="上海蝶翠诗商业有限公司(DHC)" href="https://jobs.51job.com/all/co298555.html">上海蝶翠诗商业有限公司(DHC)</a></span>
11         <span class="t3">佛山</span>
12         <span class="t4">6-8千/月</span>
13         <span class="t5">02-24</span>
14     </div>
15     <div class="el">
16         <p class="t1 ">
17             <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em>
18             <input class="checkbox" type="checkbox" name="delivery_jobid" value="120125665" jt="0" style="display:none" />
19             <span>
20                 <a target="_blank" title="电商业务专员" href="https://jobs.51job.com/foshan-sdq/120125665.html?s=01&t=0"  onmousedown="">
21                     电商业务专员                </a>
22             </span>
23                                                                     </p>
24         <span class="t2"><a target="_blank" title="美的集团热水器事业部" href="https://jobs.51job.com/all/co3224953.html">美的集团热水器事业部</a></span>
25         <span class="t3">佛山-顺德区</span>
26         <span class="t4">1-1.5万/月</span>
27         <span class="t5">02-24</span>
28     </div>'
29 text0 = text0.gsub("\r", '').gsub("\n", '')
30 h = '<div class="el">'
31 b = '</div>'
32 re = /#{h}.*?#{b}/
33 elems = text0.scan(re)
34 puts elems.size, elems[0]
35 a = /(\d+)(abc)de/.match('12abcdef') # => #<MatchData "12abcde" 1:"12" 2:"abc">
36 a.to_a #=>  ['12', 'abc']
regexp
View Code

 

  1 # m3u8下载器 赫本
  2 require 'open-uri'
  3 require 'fileutils'
  4 require 'socket'
  5 require 'json'
  6 #~ require 'sinatra/base'
  7 # mlink指m3u8的http路径
  8 
  9 HEADER = {
 10   'user-agent' => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
 11   'collection' => 'keep-alive'
 12 }
 13 
 14 def parse_links(mfile, mlink)
 15 
 16   fio2 = File.open(mfile, 'r')
 17   lines = fio2.readlines.map{|x| x.chomp}
 18   fio2.close
 19   lines.reject!{|x| x=='' || x[0] == '#'}
 20   raise if lines.empty?
 21   
 22   infos = []
 23   
 24   if (lines[0][0] == '/')
 25     # 模式:域名+每一行的内容
 26     pos1 = mlink.index('//')+2
 27     host = mlink[0, mlink.index('/', pos1)]
 28     lines.each do |line|
 29       elems = line.split('/')
 30       fn = elems[-1]
 31       link = host + line
 32       infos << {'fn' => fn, 'link' => link}#, :size => -1}
 33     end
 34   else
 35     # 模式:链接+文件名
 36     tmp = mlink.split('/')
 37     tmp.delete_at(-1)
 38     link_model = tmp.join('/')
 39     lines.each do |line|
 40       fn = line
 41       link = link_model + '/' + fn
 42       infos << {'fn' => fn, 'link' => link}#, :size => -1}
 43     end
 44   end
 45   
 46   return infos
 47   
 48 end
 49 
 50 $cache_file = 'cache.json'
 51 
 52 def socket_worker(upload_msg, recv = false)
 53   begin
 54     cli = TCPSocket.new('127.0.0.1', 6262)
 55     cli.print upload_msg
 56     if recv
 57       cli.close_write
 58       ret = cli.read(30)
 59       return ret
 60     end
 61     cli.close
 62      
 63   rescue StandardError 
 64     puts "无法与hb_server()通讯"
 65     exit
 66   end
 67 
 68 end
 69 
 70 def read_cache
 71   if File.exist?($cache_file)
 72     puts "从cache读取信息"
 73     fio2 = File.open($cache_file)
 74     infos = JSON.load(fio2.read)
 75     fio2.close
 76     return infos
 77   else
 78     raise StandardError, '无法读取缓存信息'
 79   end
 80 
 81 end
 82 
 83 def new_task(mlink)
 84   
 85   tmp = URI.parse(mlink)
 86   unless (tmp.is_a?(URI::HTTP) || tmp.is_a?(URI::HTTPS))
 87     raise StandardError, "mlink格式不对:'#{mlink}'"
 88   end
 89   begin
 90     io = URI.open(mlink, HEADER)
 91     fc = io.read
 92   rescue StandardError => ser
 93     puts "下载m3u8文件信息错误:'#{ser.message}'"
 94     exit
 95   end
 96   File.open('index.m3u8', 'w') do |fio|
 97     fio.puts "##{mlink}"
 98     fio.puts fc
 99   end
100   
101   infos = parse_links('index.m3u8', mlink)
102   File.open('cache.json', 'w') do |fio|
103     fio.print infos.to_json
104   end
105   download_worker(infos)  
106 end
107 
108 def generate_task_name
109   # 假设进程的pid是唯一的
110   name = "#{Process.pid}"
111   socket_worker("add_#{name}")
112   return name
113 end
114 
115 def download_worker(infos)
116   check_dir
117   
118   # 生成任务名
119   task_name = generate_task_name
120   puts "任务名:'#{task_name}'"
121 
122   infos.reject!{|x| File.exist?(x['fn'])}
123   puts "共需下载#{infos.size}个文件"
124 
125   err_count = 0
126   File.delete('err.txt') if File.exist?('err.txt')
127 
128   until infos.empty?
129     info = infos.shift
130     # 连接hb_server()以检查是否要停止任务
131     ret = socket_worker("status_#{task_name}_#{infos.size}", true)
132     #~ puts "ret=#{ret}"; STDIN.gets
133     break if (ret != 'true')
134     
135     fn = info['fn']
136     link = info['link']
137     th = Thread.new do
138       begin
139         io = URI.open(link, HEADER)
140         fc = io.read
141         File.open(fn, 'wb') do |fio|
142           fio.print fc
143         end
144         File.open('../filestats.txt', 'a') do |fio3|
145           fio3.puts "#{fn}||#{fc.size}"
146         end
147       rescue StandardError => ser
148         msg = "下载#{fn}文件错误:'#{ser.message}'"
149         err_count += 1
150         File.open('err.txt', 'a') do |fio|
151           fio.puts msg
152         end
153       end
154     end
155     time = 0
156     while (th.alive? && time <= 30)
157       sleep 1
158       time += 1
159     end
160     if (time >= 30)
161       msg = "下载#{fn}文件超时"
162       err_count += 1
163       File.open('err.txt', 'a') do |fio|
164         fio.puts msg
165       end
166     end
167     
168     th.kill
169   end  
170     
171   # 确保全部保存
172     
173   Dir.chdir('..')
174   if (err_count > 0)
175     puts "下载出错数:#{err_count}"
176   end
177   
178   # 退出时告知hb_server()
179   socket_worker("kill_#{task_name}")
180 
181   puts "下载流程完成:#{tn = Time.now; tn.min.to_s + ':' + tn.sec.to_s}"  
182   sleep(1)
183 end
184 
185 def check_dir
186   Dir.mkdir('./ts') unless Dir.exist?('./ts')
187   Dir.chdir('./ts')
188 end
189 
190 
191 def make_file
192   infos = read_cache
193 
194   # 读取下载了的文件的信息
195   fstats = {}
196   File.open('filestats.txt', 'r') do |fio|
197     lines = fio.readlines.select{|x| x.include?('||')}
198     lines.each do |ln|
199       fn, size = ln.split('||')
200       fstats[fn] = size.to_i # 后面重复的记录会覆盖掉往前面的记录
201     end
202   end
203   
204   # 合并文件
205   check_dir
206   
207   fio2 = File.open('bin_output.ts', 'wb')
208   infos.each do |x|
209     fn = x['fn']
210     unless File.exist?(fn)
211       raise "'#{fn}'文件不存在"
212     end
213     
214     if (fstats[fn] == nil)
215       puts "未记载'#{fn}'的文件信息"
216     elsif (fstats[fn] != File.size(fn))
217       raise "'#{fn}'文件大小与记录信息不符"
218     end
219     
220     File.open(fn, 'rb') do |fio|
221       fio2.print fio.read
222     end
223   end
224   fio2.close
225   
226   Dir.chdir('..')
227   FileUtils.mv('./ts/bin_output.ts', './bin_output.ts')
228   
229   File.open('ffm.cmd', 'w') do |fio|
230     fio.puts "ffmpeg -i bin_output.ts -c:v copy -c:a copy 00result.mp4"
231   end
232   
233   puts '已输出ffmpeg命令'
234 
235 end
236 
237 if ($0 == __FILE__)
238 
239   case ARGV[0]
240   when nil
241     infos = read_cache
242     download_worker(infos)
243   when 'new'
244     print '输入m3u8link>'
245     str = STDIN.gets
246     new_task(str.chomp)
247   when 'make'
248     make_file
249   else
250     raise '不支持的命令'
251   end
252 
253 end
View Code

 

 

赫本的socket服务器,用来管理各个任务的状态

 1 require 'socket'
 2 require 'set'
 3 require 'readline' # gem install rb-readline
 4 
 5 $tasks = {}
 6 
 7 server1 = TCPServer.new('127.0.0.1', 6262) 
 8 # 新建任务 'add:446', 'kill:4664'
 9 #~ server2 = TCPServer.new('127.0.0.1', 6263) 
10 # 查询任务状态
11 
12 
13 Thread.new do
14   loop do
15     cli = server1.accept
16     str = cli.read(60)
17     mt = str.split('_')
18     if (mt != nil)
19       action = mt[0]
20       task_name = mt[1]
21       case action
22       when 'add'
23         if  $tasks.include?(task_name)
24           puts "意外:重复的pid任务:#{task_name}"
25         else
26           $tasks[task_name] = 0
27         end
28       when 'kill'
29         $tasks.delete(task_name)
30       when 'status'
31         if $tasks.include?(task_name)
32           $tasks[task_name] = mt[2].to_i
33           cli.print 'true'
34         else
35           cli.print 'false'
36         end
37       end
38     #~ else
39       #~ cli.puts 'illegal'
40     end
41     cli.close
42   end
43 end
44 
45 #~ sleep(1)
46 
47 puts "输入命令:"
48 
49 while true
50   #~ print '>>'
51   #~ uip = STDIN.gets
52   #~ uip.chomp!
53   uip = Readline::readline('>>')
54   Readline::HISTORY.push(uip)
55   if (uip == 'exit')
56     server1.close
57     exit
58   elsif (uip == 'list')
59     puts "任务列表:[#{$tasks.keys.join(',')}]"
60   elsif (uip == 'irb')
61     require 'irb'
62     binding.irb
63   else
64     mt = /^(\w)\s(\d+)/.match(uip) 
65     # 'x 146' 表示结束146任务
66     # 'i 146' 表示查看任务146的进度
67     if mt
68       unless $tasks.key?(mt[2])
69         puts "不存在任务名:#{mt[2]}"
70         next
71       end
72       case mt[1]
73       when 'x'
74         $tasks.delete mt[2]
75         puts '已标记结束'
76       when 'i'
77         puts "'#{mt[2]}'=#{$tasks[mt[2]]}"
78       else
79         puts "不认识的命令:'#{uip}'"
80       end
81     else
82       puts "不认识的命令:'#{uip}'"
83     end
84   end
85 end
View Code

 

二、某招聘网站的爬虫

(这个代码写得较早,所以有点乱)

用到了ruby的:

Nokogiri:xml解析库;

SQLite3: sqlite3数据库的用法,

  .results_as_hash: 把结果以哈希形式输出;

  sql.query('begin'); sql.query('commit'):SQL事务;

  sqlite的转义用法:sql.query('update table set a = ? where b = ?', [a_value, b_value]);

 

ERB:ruby的类似于php的模板引擎的用法,ERB.new(string).result(binding);ERB的模板用法见官方文档。

String:index(pos, startpos)用法;

.encode('utf-8', 'gbk', {:invalid => :replace, :undef => :replace, :replace => '?'})用法(见代码)

ruby元编程的基本用法:Kernel/Object.send(:method, args);  require和load的用法;

获取脚本自身完整路径的方法:File.expand_path(File.dirname(__FILE__))

 

  1 require 'psych'
  2 require 'open-uri'
  3 require 'erb'
  4 require 'sqlite3'
  5 require 'nokogiri'
  6 require 'set'
  7 
  8 =begin
  9   使用方法:
 10   rw = RecruitWalker.new()
 11   rw.prepare_crawl
 12   rw.add_task('job51')
 13   rw.add_task('shundehr')
 14   rw.begin_tasks
 15   rw.write_html
 16 =end
 17 
 18 
 19 
 20 
 21 def uri_open(link, header={})
 22   # 使用open-uri
 23   begin
 24     io = URI.open(link, header)
 25     c = io.read
 26     io.close
 27     return c
 28   rescue StandardError
 29     return false
 30   end
 31 
 32 end
 33 
 34 
 35 
 36 class RecruitWalker
 37 
 38   attr_reader :header, :datadb_fn, :legacydb_fn, :joblist, :late10dates
 39 
 40   def initialize
 41     @path = File.expand_path(File.dirname(__FILE__))
 42 
 43     # 读取配置信息
 44     @datadb_fn = "#{@path}/database/store5.db"
 45     @legacydb_fn = "#{@path}/database/oldjob.db"
 46 
 47     # 读取招聘网站的配置信息
 48     @conf_fn = "#{@path}/ext/info.yml"
 49 
 50     @joblist = Psych.load(File.read(@conf_fn))
 51 
 52     @late10dates = read_dates()
 53 
 54     return 
 55   end
 56 
 57   def prepare_crawl
 58     # 准备爬虫流程
 59     # 连接sqlite数据库
 60     @data_sql = SQLite3::Database.new(@datadb_fn)
 61     @data_sql.results_as_hash = true
 62     @legacy_sql = SQLite3::Database.new(@legacydb_fn)
 63     @task_list = {}
 64     # 任务列表形式:
 65     # {
 66     #   任务英文名1 => {:htmls => [], :infos => []},
 67     #   任务英文名2 => {:htmls => [], :infos => []}
 68     # } 
 69 
 70     # 读取公司名黑名单
 71 
 72     @blacklist = Psych.load(File.read("#{@path}/ext/blacklist.yml"))
 73     @header = {'User-Agent'=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36', 'cookie'=>''}
 74     return
 75   end
 76 
 77   def read_dates
 78     # 获取最近的10个日期
 79     tn = Time.now
 80     time_array = []
 81     10.times do
 82       time_array << tn.clone
 83       tn -= 24*3600
 84     end
 85     dates_array = []
 86     time_array.each do |tm|
 87       day = tm.to_s.split(' ')[0].gsub('-', '')
 88       dates_array << day
 89     end
 90     return dates_array # 字符串数组
 91   end
 92   
 93   def add_task(name)
 94     if @joblist.has_key?(name)
 95       @task_list[name] = {:htmls => [], :infos => []}
 96     else
 97       raise StandardError, "找不到任务名:'#{name}'"
 98     end
 99   end
100   
101   def begin_tasks
102     # 开始抓取任务列表中的所有任务名
103     # 并行抓取各个任务的网页;数据整理和SQL整理则排队处理
104     return if @task_list.empty?
105     puts "任务流程开始,数量:#{@task_list.keys.size},时间'#{Time.now}'"
106     # 下载网页
107     web_threads = []
108     @task_list.each_pair do |k, v|
109       th = Thread.new do
110         v[:htmls] = scan_web_task(k, @joblist[k])
111       end
112       web_threads << th
113     end
114     # 等待所有的下载完成
115     web_threads.map{|x| x.join}
116     
117     # html分析
118     @task_list.each_pair do |k, v|
119       method = @joblist[k]['scanner'].to_sym
120       puts "加载分析库文件:#{k}"
121       load "#{@path}/ext/use-#{k}.rb"
122       puts "分析htmls:开始:'#{@joblist[k]['text']}'"
123       tmp_keys = Set[]
124       v[:htmls].each do |html|
125         tmp_list = send(method, html)
126         tmp_list.each do |x|
127           unless tmp_keys.include?(x['jobid'])
128             v[:infos] << x
129             tmp_keys << x['jobid']
130           end
131         end
132         #~ v[:infos] 
133       end
134       puts "分析htmls:完成:'#{@joblist[k]['text']}'"
135     end
136     
137     # 数据迁移
138     @task_list.each_key do |k|
139       data_migrate(@joblist[k]['table'])
140     end
141     
142     # 数据更新和写入
143     @task_list.each_pair do |k, v|
144       data_tidy(@joblist[k], v[:infos])
145     end
146     
147     # 保存抓取时间
148     update_conf
149     puts "任务流程完成."
150     return
151   end
152 
153   def scan_web_task(name, conf)
154 
155     puts "执行下载:'#{conf['text']}'"
156     
157     local_header = @header.clone
158     local_header[:encoding] = conf['encoding']
159     
160     htmls = []
161     p_begin = conf['page_begin']
162     p_end = conf['page_end']
163     
164     (p_begin .. p_end).each do |pid|
165       link = conf['link'].sub('{{pageid}}', pid.to_s)
166 
167       html = uri_open(link, local_header)
168       if (html == false)
169         puts "下载'#{name}'的第#{pid}页索引出现http错误"
170         next
171       end
172       
173       htmls << html
174       sleep(rand(2))
175     end
176     puts "下载完成:'#{conf['text']}'"
177     return htmls
178 
179   end
180 
181   def update_conf
182     # 写入抓取时间到yaml文件
183     tn = Time.now.to_s
184     @task_list.each_key do |k|
185       @joblist[k]['update_time'] = tn
186     end
187     File.open(@conf_fn, 'w') do |fio|
188       fio.print @joblist.to_yaml
189     end
190     return
191   end
192 
193   def data_migrate(table)
194     # table是表格名
195     
196     drop_cmds = [] # 从主数据库的对应表格中去除过时信息
197     update_cmds = [] # 把过时信息的jobid插入到oldjob数据库的对应表格
198     cmd = "select jobid, post_date from #{table}"
199     #~ puts "migrate cmd='#{cmd}'"
200     @data_sql.query(cmd).each do |x|
201       jobid = x['jobid']; p_date = x['post_date']
202       unless @late10dates.include?(p_date)
203         # 在主要数据库里删除本条过时的信息
204         drop_cmds << {:cmd => "delete from #{table} where jobid = ?", :params => [jobid]}
205         # 在记录过时信息的数据库里添加本条过时信息的jobid和日期
206         update_cmds << {:cmd => "insert into #{table} (jobid, post_date) values (?, ?)", :params => [jobid, p_date]}
207       end
208     end
209     
210     @data_sql.query('begin')
211     drop_cmds.each do |c|
212       @data_sql.execute(c[:cmd], c[:params])
213     end
214     @data_sql.query('commit')
215     
216     @legacy_sql.query('begin')
217     update_cmds.each do |c|
218       @legacy_sql.execute(c[:cmd], c[:params])
219     end
220     @legacy_sql.query('commit')
221     
222     return
223   end
224 
225   def data_tidy(conf, src_infos)
226     table = conf['table']
227     # 获取过时的jobid
228     outdated_jobids = []
229     @legacy_sql.query("select jobid from #{table}").each do |x|
230       outdated_jobids << x[0]
231     end
232     
233     black_count = 0 # 已用公司名黑名单屏蔽的招聘信息数
234     
235     infos = src_infos.select{  |x|
236      
237       bool1 = @late10dates.include?(x['post_date']) # 去除时间范围以外的招聘信息
238       bool2 = !@blacklist.include?(x['company']) # 去除黑名单公司的招聘信息
239       black_count += 1 if (bool2 == false)
240       bool3 = !outdated_jobids.include?(x['jobid']) # 去除在oldjob数据库对应表格里已包含jobid的招聘信息
241       (bool1 && bool2 && bool3)
242     }
243     
244     puts "已根据公司名黑名单,屏蔽了#{black_count}条招聘信息" if (black_count>0)
245     puts "数据整理:开始:'#{conf['text']}'"
246     
247      # 筛选当前的招聘嘻嘻
248     
249     lately_jobids = []
250     
251     @data_sql.query("select jobid from #{table}").each do |x|
252       lately_jobids << x['jobid']
253     end
254 
255     cmds = []
256     update_count = 0
257     insert_count = 0
258 
259     infos.each do |info|
260       # 处理每个新下载的招聘信息
261       set = {}
262       if lately_jobids.include?(info['jobid'])
263         # 时间范围内的招聘信息数据库已包含相同jobid的信息
264         set[:cmd] = "update #{table} set update_date = ? where jobid = ?"
265         set[:params] = [info['post_date'], info['jobid']]
266         update_count += 1
267       else
268         set[:cmd] = "insert into #{table} (jobid,post_date,company,link,name,srcsite,update_date) values (?, ?, ?, ?, ?, ?, ?)"
269         set[:params] = [
270           info['jobid'],
271           info['post_date'],
272           info['company'],
273           info['link'],
274           info['name'],
275           info['srcsite'],
276           info['update_date']
277         ]
278         insert_count += 1
279       end
280       cmds << set
281     end
282 
283     puts "更新数:#{update_count}", "插入数:#{insert_count}"
284     puts '向数据库添加新信息..'
285 
286     @data_sql.query('begin')
287     cmds.each do |c|
288       @data_sql.query(c[:cmd], c[:params])
289     end
290     @data_sql.query('commit')
291     
292 
293     return 
294   end
295 
296   def write_html
297     # 清除上次保存的html文件
298     Dir.chdir("#{@path}/html")
299     Dir.glob('*.html').each do |fn|
300       File.unlink(fn)
301     end
302     Dir.chdir(@path)
303     # 写入欢迎页
304     fc1 = File.read("#{@path}/template2/index.erb")
305     dates_text = @late10dates.map{|x| "'#{x}'"}.join(',')
306     sites_text = @joblist.map{|k, v|
307       "{'name':'#{k}', 'text':'#{v['text']}', 'upd_time':'#{v['update_time']}'}"
308 
309     }.join(',')
310 
311     html1 = ERB.new(fc1).result(binding)
312     File.open("#{@path}/html/index.html", 'w') do |fio|
313       fio.print html1
314     end
315 
316     # 写入各个招聘网信息按日期的详细页面
317     #~ @data_sql
318     fc2 = File.read("#{@path}/template2/details.erb")
319 
320     #~ puts "日期列表:#{dates}, joblist=#{rw.joblist}"
321     @late10dates.each do |date|
322       @task_list.each_pair do |k, v|
323         #~ puts "正在写入:#{k}, #{date}"
324         conf = @joblist[k]
325         site_text = conf['text']
326         table = conf['table']
327 
328         infos = []
329         cmd = "select * from #{table} where post_date = '#{date}'"
330         #~ puts "sql cmd=#{cmd}"
331         @data_sql.query(cmd).each do |x|
332           infos << x # 不考虑:把相同的公司的信息放在一起
333         end
334         count = infos.size # erb里用到这个变量
335         #~ puts "检查infos:共#{count}个,[0]=", infos[0].inspect, ''
336         html2 = ERB.new(fc2).result(binding)
337         fn = "#{@path}/html/info-#{k}-#{date}.html"
338         File.open(fn, 'w') do |fio|
339           fio.print html2
340         end
341       end
342     end
343 
344     puts "写入流程:完成."
345     return
346   end
347 
348 end
View Code

 

某招聘网站的解析过程:

 1 def page_analyse_sdhr(html_content)#, rw)
 2   infos = []
 3   text0 = html_content
 4   text0.gsub!("\r", '')
 5   text0.gsub!("\n", '') # 先清除换行符,方便正则表达式匹配
 6   h = '<ul class="searchPost-list" id="resultList">'
 7   b = '<div class="centerPage">'
 8   re = /#{h}(.*)#{b}/
 9   text1 = re.match(text0)[0]
10   
11   # 由于<input>标签没有终结的</input>,所以不能用Nokogiri::XML,而要用Nokogiri::HTML
12   nk = Nokogiri::HTML(text1, nil, 'utf-8')
13   elems = nk.xpath('//li')
14   
15   
16   # 假设每个页面有20条招聘信息
17 
18   elems.each do |node|
19     # 获取日期
20     post_date = node.xpath('./div/div[@class="t5"]').text
21     date_text = post_date.split('-').join
22 
23     job_name = node.xpath('./div/div[@class="t1"]/a').text
24     link = node.xpath('./div/div[@class="t1"]/a/@href').text
25     link = 'http://www.shundehr.com' + link
26     company_name = node.xpath('./div/div[@class="t2"]/a').text
27     jobid_text = node.xpath('./input/@value').text
28     
29     srcsite = 'shundehr'
30     info = {
31       'name'        => job_name,
32       'company'     => company_name,
33       'post_date'   => date_text,
34       'update_date' => date_text,
35       'jobid'       => jobid_text,
36       'link'        => link,
37       'srcsite'     => srcsite
38     }
39     
40     #~ puts info.inspect; STDIN.gets
41     
42     # Nokogiri找不到元素的时候,返回nil.所以这里有必要检查某些值是不是找不到.
43     selects = info.select{|k, v| (v==nil)} # 此时selects是个哈希
44     if selects.empty?
45       infos << info
46     else
47       puts "分析html时找不到部分元素", info.inspect, '---'
48     end
49   end
50   return infos
51 end
52 
53 # 测试
54 #~ require 'nokogiri'
55 #~ fc = File.read('../tmp/shundehr_sample.html')
56 #~ r = page_analyse_sdhr(fc)
shundehr

 

Nokogiri的部分用法见以下,注意Nokogiri::XML用来解析常规的xml,Nokogiri::HTML支持html里的<image>、<input>等没有结束标记的标签;Nokogiri的参数'utf-8'在windows上最好强制指定,要不然Nokogiri会以windows终端的GBK编码来解析。

Nokogiri有xpath('./node')的用法,识别<a href="link">text</a>是用xpath('./a/@href').text。

Nokogiri还有.to_html;.to_s;.text;.value用法,代码暂时找不到了,建议结合实际使用。

用Nokogiri的时候最好逐步调试,如果nokogiri找不到元素,有时候会抛出异常,有时候会返回nil。

 

Enumerable.any?用法:[3, 4, 5].any?{|x| x %2 == 0} # => true

posted @ 2020-06-30 12:28  uu2crain  阅读(530)  评论(0)    收藏  举报