ruby基本库常见用法
我写了ruby脚本也有一段时间了。ruby某些库的api的说明不是那么全,特别是socket。我在这个博客把一些ruby基本库的用法节选下来吧。代码本身的逻辑都是能运行的,而附属的一些yaml数据没有放上来,所以代码无法直接运行。
写这个博客是希望能对某些ruby新手有所帮助。目前ruby的中文教程确实不怎么多,且重复的不少。就我看来,ruby既不是太好,也不是太坏,当你熟悉它的语法和“pass by object reference”后,就会不自然地产生对它的爱和恨。本文主要写一些方便快速完成任务的实用性的东西。
一、Herbern m3u8下载器
用到了ruby的:
OpenURI:http下载库,注意设置header;io = URI.open(link, hash),这里的hash约等于http发送get/post请求时的header,如果在hash里指定 :encoding = 'utf-8',则open-uri会以utf8编码读取下载的流,否则默认按ascii编码来读取。
socket (windows运行)的读写,TCPServer和TCPSocket。Socket < BasicSocket < IO,而close_write和read(Int)都是socket继承自IO的函数。
Set[] 集合: require 'set'; a = Set[3,4,5]; a.to_a # => [3,4,5]; [5,6,7].to_set #=> #<Set: {5, 6, 7}>; a.include?(5) #=> true; a << 2; #=> #<Set: {3, 4, 5, 2}>; a.delete(5) #=> #<Set: {3, 4, 2}>
JSON库基本用法(Psych yaml库的用法类似);
Thread.new {}; 多线程; Mutex和Queue的用法参见“runoob菜鸟教程”。
类似于Python的if __name__ == "main" 的如本文件为启动脚本则执行的 if ($0 == __FILE__)用法,和ARGV:启动参数数组
简单的正则表达式用法:注意是否开启多行模式,和替换换行符
1 text0 = '<div class="el"> 2 <p class="t1 "> 3 <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em> 4 <input class="checkbox" type="checkbox" name="delivery_jobid" value="110487870" jt="0" style="display:none" /> 5 <span> 6 <a target="_blank" title="美容顾问(佛山)" href="https://jobs.51job.com/foshan/110487870.html?s=01&t=0" onmousedown=""> 7 美容顾问(佛山) </a> 8 </span> 9 </p> 10 <span class="t2"><a target="_blank" title="上海蝶翠诗商业有限公司(DHC)" href="https://jobs.51job.com/all/co298555.html">上海蝶翠诗商业有限公司(DHC)</a></span> 11 <span class="t3">佛山</span> 12 <span class="t4">6-8千/月</span> 13 <span class="t5">02-24</span> 14 </div> 15 <div class="el"> 16 <p class="t1 "> 17 <em class="check" name="delivery_em" onclick="checkboxClick(this)"></em> 18 <input class="checkbox" type="checkbox" name="delivery_jobid" value="120125665" jt="0" style="display:none" /> 19 <span> 20 <a target="_blank" title="电商业务专员" href="https://jobs.51job.com/foshan-sdq/120125665.html?s=01&t=0" onmousedown=""> 21 电商业务专员 </a> 22 </span> 23 </p> 24 <span class="t2"><a target="_blank" title="美的集团热水器事业部" href="https://jobs.51job.com/all/co3224953.html">美的集团热水器事业部</a></span> 25 <span class="t3">佛山-顺德区</span> 26 <span class="t4">1-1.5万/月</span> 27 <span class="t5">02-24</span> 28 </div>' 29 text0 = text0.gsub("\r", '').gsub("\n", '') 30 h = '<div class="el">' 31 b = '</div>' 32 re = /#{h}.*?#{b}/ 33 elems = text0.scan(re) 34 puts elems.size, elems[0] 35 a = /(\d+)(abc)de/.match('12abcdef') # => #<MatchData "12abcde" 1:"12" 2:"abc"> 36 a.to_a #=> ['12', 'abc']
1 # m3u8下载器 赫本 2 require 'open-uri' 3 require 'fileutils' 4 require 'socket' 5 require 'json' 6 #~ require 'sinatra/base' 7 # mlink指m3u8的http路径 8 9 HEADER = { 10 'user-agent' => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36', 11 'collection' => 'keep-alive' 12 } 13 14 def parse_links(mfile, mlink) 15 16 fio2 = File.open(mfile, 'r') 17 lines = fio2.readlines.map{|x| x.chomp} 18 fio2.close 19 lines.reject!{|x| x=='' || x[0] == '#'} 20 raise if lines.empty? 21 22 infos = [] 23 24 if (lines[0][0] == '/') 25 # 模式:域名+每一行的内容 26 pos1 = mlink.index('//')+2 27 host = mlink[0, mlink.index('/', pos1)] 28 lines.each do |line| 29 elems = line.split('/') 30 fn = elems[-1] 31 link = host + line 32 infos << {'fn' => fn, 'link' => link}#, :size => -1} 33 end 34 else 35 # 模式:链接+文件名 36 tmp = mlink.split('/') 37 tmp.delete_at(-1) 38 link_model = tmp.join('/') 39 lines.each do |line| 40 fn = line 41 link = link_model + '/' + fn 42 infos << {'fn' => fn, 'link' => link}#, :size => -1} 43 end 44 end 45 46 return infos 47 48 end 49 50 $cache_file = 'cache.json' 51 52 def socket_worker(upload_msg, recv = false) 53 begin 54 cli = TCPSocket.new('127.0.0.1', 6262) 55 cli.print upload_msg 56 if recv 57 cli.close_write 58 ret = cli.read(30) 59 return ret 60 end 61 cli.close 62 63 rescue StandardError 64 puts "无法与hb_server()通讯" 65 exit 66 end 67 68 end 69 70 def read_cache 71 if File.exist?($cache_file) 72 puts "从cache读取信息" 73 fio2 = File.open($cache_file) 74 infos = JSON.load(fio2.read) 75 fio2.close 76 return infos 77 else 78 raise StandardError, '无法读取缓存信息' 79 end 80 81 end 82 83 def new_task(mlink) 84 85 tmp = URI.parse(mlink) 86 unless (tmp.is_a?(URI::HTTP) || tmp.is_a?(URI::HTTPS)) 87 raise StandardError, "mlink格式不对:'#{mlink}'" 88 end 89 begin 90 io = URI.open(mlink, HEADER) 91 fc = io.read 92 rescue StandardError => ser 93 puts "下载m3u8文件信息错误:'#{ser.message}'" 94 exit 95 end 96 File.open('index.m3u8', 'w') do |fio| 97 fio.puts "##{mlink}" 98 fio.puts fc 99 end 100 101 infos = parse_links('index.m3u8', mlink) 102 File.open('cache.json', 'w') do |fio| 103 fio.print infos.to_json 104 end 105 download_worker(infos) 106 end 107 108 def generate_task_name 109 # 假设进程的pid是唯一的 110 name = "#{Process.pid}" 111 socket_worker("add_#{name}") 112 return name 113 end 114 115 def download_worker(infos) 116 check_dir 117 118 # 生成任务名 119 task_name = generate_task_name 120 puts "任务名:'#{task_name}'" 121 122 infos.reject!{|x| File.exist?(x['fn'])} 123 puts "共需下载#{infos.size}个文件" 124 125 err_count = 0 126 File.delete('err.txt') if File.exist?('err.txt') 127 128 until infos.empty? 129 info = infos.shift 130 # 连接hb_server()以检查是否要停止任务 131 ret = socket_worker("status_#{task_name}_#{infos.size}", true) 132 #~ puts "ret=#{ret}"; STDIN.gets 133 break if (ret != 'true') 134 135 fn = info['fn'] 136 link = info['link'] 137 th = Thread.new do 138 begin 139 io = URI.open(link, HEADER) 140 fc = io.read 141 File.open(fn, 'wb') do |fio| 142 fio.print fc 143 end 144 File.open('../filestats.txt', 'a') do |fio3| 145 fio3.puts "#{fn}||#{fc.size}" 146 end 147 rescue StandardError => ser 148 msg = "下载#{fn}文件错误:'#{ser.message}'" 149 err_count += 1 150 File.open('err.txt', 'a') do |fio| 151 fio.puts msg 152 end 153 end 154 end 155 time = 0 156 while (th.alive? && time <= 30) 157 sleep 1 158 time += 1 159 end 160 if (time >= 30) 161 msg = "下载#{fn}文件超时" 162 err_count += 1 163 File.open('err.txt', 'a') do |fio| 164 fio.puts msg 165 end 166 end 167 168 th.kill 169 end 170 171 # 确保全部保存 172 173 Dir.chdir('..') 174 if (err_count > 0) 175 puts "下载出错数:#{err_count}" 176 end 177 178 # 退出时告知hb_server() 179 socket_worker("kill_#{task_name}") 180 181 puts "下载流程完成:#{tn = Time.now; tn.min.to_s + ':' + tn.sec.to_s}" 182 sleep(1) 183 end 184 185 def check_dir 186 Dir.mkdir('./ts') unless Dir.exist?('./ts') 187 Dir.chdir('./ts') 188 end 189 190 191 def make_file 192 infos = read_cache 193 194 # 读取下载了的文件的信息 195 fstats = {} 196 File.open('filestats.txt', 'r') do |fio| 197 lines = fio.readlines.select{|x| x.include?('||')} 198 lines.each do |ln| 199 fn, size = ln.split('||') 200 fstats[fn] = size.to_i # 后面重复的记录会覆盖掉往前面的记录 201 end 202 end 203 204 # 合并文件 205 check_dir 206 207 fio2 = File.open('bin_output.ts', 'wb') 208 infos.each do |x| 209 fn = x['fn'] 210 unless File.exist?(fn) 211 raise "'#{fn}'文件不存在" 212 end 213 214 if (fstats[fn] == nil) 215 puts "未记载'#{fn}'的文件信息" 216 elsif (fstats[fn] != File.size(fn)) 217 raise "'#{fn}'文件大小与记录信息不符" 218 end 219 220 File.open(fn, 'rb') do |fio| 221 fio2.print fio.read 222 end 223 end 224 fio2.close 225 226 Dir.chdir('..') 227 FileUtils.mv('./ts/bin_output.ts', './bin_output.ts') 228 229 File.open('ffm.cmd', 'w') do |fio| 230 fio.puts "ffmpeg -i bin_output.ts -c:v copy -c:a copy 00result.mp4" 231 end 232 233 puts '已输出ffmpeg命令' 234 235 end 236 237 if ($0 == __FILE__) 238 239 case ARGV[0] 240 when nil 241 infos = read_cache 242 download_worker(infos) 243 when 'new' 244 print '输入m3u8link>' 245 str = STDIN.gets 246 new_task(str.chomp) 247 when 'make' 248 make_file 249 else 250 raise '不支持的命令' 251 end 252 253 end
赫本的socket服务器,用来管理各个任务的状态
1 require 'socket' 2 require 'set' 3 require 'readline' # gem install rb-readline 4 5 $tasks = {} 6 7 server1 = TCPServer.new('127.0.0.1', 6262) 8 # 新建任务 'add:446', 'kill:4664' 9 #~ server2 = TCPServer.new('127.0.0.1', 6263) 10 # 查询任务状态 11 12 13 Thread.new do 14 loop do 15 cli = server1.accept 16 str = cli.read(60) 17 mt = str.split('_') 18 if (mt != nil) 19 action = mt[0] 20 task_name = mt[1] 21 case action 22 when 'add' 23 if $tasks.include?(task_name) 24 puts "意外:重复的pid任务:#{task_name}" 25 else 26 $tasks[task_name] = 0 27 end 28 when 'kill' 29 $tasks.delete(task_name) 30 when 'status' 31 if $tasks.include?(task_name) 32 $tasks[task_name] = mt[2].to_i 33 cli.print 'true' 34 else 35 cli.print 'false' 36 end 37 end 38 #~ else 39 #~ cli.puts 'illegal' 40 end 41 cli.close 42 end 43 end 44 45 #~ sleep(1) 46 47 puts "输入命令:" 48 49 while true 50 #~ print '>>' 51 #~ uip = STDIN.gets 52 #~ uip.chomp! 53 uip = Readline::readline('>>') 54 Readline::HISTORY.push(uip) 55 if (uip == 'exit') 56 server1.close 57 exit 58 elsif (uip == 'list') 59 puts "任务列表:[#{$tasks.keys.join(',')}]" 60 elsif (uip == 'irb') 61 require 'irb' 62 binding.irb 63 else 64 mt = /^(\w)\s(\d+)/.match(uip) 65 # 'x 146' 表示结束146任务 66 # 'i 146' 表示查看任务146的进度 67 if mt 68 unless $tasks.key?(mt[2]) 69 puts "不存在任务名:#{mt[2]}" 70 next 71 end 72 case mt[1] 73 when 'x' 74 $tasks.delete mt[2] 75 puts '已标记结束' 76 when 'i' 77 puts "'#{mt[2]}'=#{$tasks[mt[2]]}" 78 else 79 puts "不认识的命令:'#{uip}'" 80 end 81 else 82 puts "不认识的命令:'#{uip}'" 83 end 84 end 85 end
二、某招聘网站的爬虫
(这个代码写得较早,所以有点乱)
用到了ruby的:
Nokogiri:xml解析库;
SQLite3: sqlite3数据库的用法,
.results_as_hash: 把结果以哈希形式输出;
sql.query('begin'); sql.query('commit'):SQL事务;
sqlite的转义用法:sql.query('update table set a = ? where b = ?', [a_value, b_value]);
ERB:ruby的类似于php的模板引擎的用法,ERB.new(string).result(binding);ERB的模板用法见官方文档。
String:index(pos, startpos)用法;
.encode('utf-8', 'gbk', {:invalid => :replace, :undef => :replace, :replace => '?'})用法(见代码)
ruby元编程的基本用法:Kernel/Object.send(:method, args); require和load的用法;
获取脚本自身完整路径的方法:File.expand_path(File.dirname(__FILE__))
1 require 'psych' 2 require 'open-uri' 3 require 'erb' 4 require 'sqlite3' 5 require 'nokogiri' 6 require 'set' 7 8 =begin 9 使用方法: 10 rw = RecruitWalker.new() 11 rw.prepare_crawl 12 rw.add_task('job51') 13 rw.add_task('shundehr') 14 rw.begin_tasks 15 rw.write_html 16 =end 17 18 19 20 21 def uri_open(link, header={}) 22 # 使用open-uri 23 begin 24 io = URI.open(link, header) 25 c = io.read 26 io.close 27 return c 28 rescue StandardError 29 return false 30 end 31 32 end 33 34 35 36 class RecruitWalker 37 38 attr_reader :header, :datadb_fn, :legacydb_fn, :joblist, :late10dates 39 40 def initialize 41 @path = File.expand_path(File.dirname(__FILE__)) 42 43 # 读取配置信息 44 @datadb_fn = "#{@path}/database/store5.db" 45 @legacydb_fn = "#{@path}/database/oldjob.db" 46 47 # 读取招聘网站的配置信息 48 @conf_fn = "#{@path}/ext/info.yml" 49 50 @joblist = Psych.load(File.read(@conf_fn)) 51 52 @late10dates = read_dates() 53 54 return 55 end 56 57 def prepare_crawl 58 # 准备爬虫流程 59 # 连接sqlite数据库 60 @data_sql = SQLite3::Database.new(@datadb_fn) 61 @data_sql.results_as_hash = true 62 @legacy_sql = SQLite3::Database.new(@legacydb_fn) 63 @task_list = {} 64 # 任务列表形式: 65 # { 66 # 任务英文名1 => {:htmls => [], :infos => []}, 67 # 任务英文名2 => {:htmls => [], :infos => []} 68 # } 69 70 # 读取公司名黑名单 71 72 @blacklist = Psych.load(File.read("#{@path}/ext/blacklist.yml")) 73 @header = {'User-Agent'=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36', 'cookie'=>''} 74 return 75 end 76 77 def read_dates 78 # 获取最近的10个日期 79 tn = Time.now 80 time_array = [] 81 10.times do 82 time_array << tn.clone 83 tn -= 24*3600 84 end 85 dates_array = [] 86 time_array.each do |tm| 87 day = tm.to_s.split(' ')[0].gsub('-', '') 88 dates_array << day 89 end 90 return dates_array # 字符串数组 91 end 92 93 def add_task(name) 94 if @joblist.has_key?(name) 95 @task_list[name] = {:htmls => [], :infos => []} 96 else 97 raise StandardError, "找不到任务名:'#{name}'" 98 end 99 end 100 101 def begin_tasks 102 # 开始抓取任务列表中的所有任务名 103 # 并行抓取各个任务的网页;数据整理和SQL整理则排队处理 104 return if @task_list.empty? 105 puts "任务流程开始,数量:#{@task_list.keys.size},时间'#{Time.now}'" 106 # 下载网页 107 web_threads = [] 108 @task_list.each_pair do |k, v| 109 th = Thread.new do 110 v[:htmls] = scan_web_task(k, @joblist[k]) 111 end 112 web_threads << th 113 end 114 # 等待所有的下载完成 115 web_threads.map{|x| x.join} 116 117 # html分析 118 @task_list.each_pair do |k, v| 119 method = @joblist[k]['scanner'].to_sym 120 puts "加载分析库文件:#{k}" 121 load "#{@path}/ext/use-#{k}.rb" 122 puts "分析htmls:开始:'#{@joblist[k]['text']}'" 123 tmp_keys = Set[] 124 v[:htmls].each do |html| 125 tmp_list = send(method, html) 126 tmp_list.each do |x| 127 unless tmp_keys.include?(x['jobid']) 128 v[:infos] << x 129 tmp_keys << x['jobid'] 130 end 131 end 132 #~ v[:infos] 133 end 134 puts "分析htmls:完成:'#{@joblist[k]['text']}'" 135 end 136 137 # 数据迁移 138 @task_list.each_key do |k| 139 data_migrate(@joblist[k]['table']) 140 end 141 142 # 数据更新和写入 143 @task_list.each_pair do |k, v| 144 data_tidy(@joblist[k], v[:infos]) 145 end 146 147 # 保存抓取时间 148 update_conf 149 puts "任务流程完成." 150 return 151 end 152 153 def scan_web_task(name, conf) 154 155 puts "执行下载:'#{conf['text']}'" 156 157 local_header = @header.clone 158 local_header[:encoding] = conf['encoding'] 159 160 htmls = [] 161 p_begin = conf['page_begin'] 162 p_end = conf['page_end'] 163 164 (p_begin .. p_end).each do |pid| 165 link = conf['link'].sub('{{pageid}}', pid.to_s) 166 167 html = uri_open(link, local_header) 168 if (html == false) 169 puts "下载'#{name}'的第#{pid}页索引出现http错误" 170 next 171 end 172 173 htmls << html 174 sleep(rand(2)) 175 end 176 puts "下载完成:'#{conf['text']}'" 177 return htmls 178 179 end 180 181 def update_conf 182 # 写入抓取时间到yaml文件 183 tn = Time.now.to_s 184 @task_list.each_key do |k| 185 @joblist[k]['update_time'] = tn 186 end 187 File.open(@conf_fn, 'w') do |fio| 188 fio.print @joblist.to_yaml 189 end 190 return 191 end 192 193 def data_migrate(table) 194 # table是表格名 195 196 drop_cmds = [] # 从主数据库的对应表格中去除过时信息 197 update_cmds = [] # 把过时信息的jobid插入到oldjob数据库的对应表格 198 cmd = "select jobid, post_date from #{table}" 199 #~ puts "migrate cmd='#{cmd}'" 200 @data_sql.query(cmd).each do |x| 201 jobid = x['jobid']; p_date = x['post_date'] 202 unless @late10dates.include?(p_date) 203 # 在主要数据库里删除本条过时的信息 204 drop_cmds << {:cmd => "delete from #{table} where jobid = ?", :params => [jobid]} 205 # 在记录过时信息的数据库里添加本条过时信息的jobid和日期 206 update_cmds << {:cmd => "insert into #{table} (jobid, post_date) values (?, ?)", :params => [jobid, p_date]} 207 end 208 end 209 210 @data_sql.query('begin') 211 drop_cmds.each do |c| 212 @data_sql.execute(c[:cmd], c[:params]) 213 end 214 @data_sql.query('commit') 215 216 @legacy_sql.query('begin') 217 update_cmds.each do |c| 218 @legacy_sql.execute(c[:cmd], c[:params]) 219 end 220 @legacy_sql.query('commit') 221 222 return 223 end 224 225 def data_tidy(conf, src_infos) 226 table = conf['table'] 227 # 获取过时的jobid 228 outdated_jobids = [] 229 @legacy_sql.query("select jobid from #{table}").each do |x| 230 outdated_jobids << x[0] 231 end 232 233 black_count = 0 # 已用公司名黑名单屏蔽的招聘信息数 234 235 infos = src_infos.select{ |x| 236 237 bool1 = @late10dates.include?(x['post_date']) # 去除时间范围以外的招聘信息 238 bool2 = !@blacklist.include?(x['company']) # 去除黑名单公司的招聘信息 239 black_count += 1 if (bool2 == false) 240 bool3 = !outdated_jobids.include?(x['jobid']) # 去除在oldjob数据库对应表格里已包含jobid的招聘信息 241 (bool1 && bool2 && bool3) 242 } 243 244 puts "已根据公司名黑名单,屏蔽了#{black_count}条招聘信息" if (black_count>0) 245 puts "数据整理:开始:'#{conf['text']}'" 246 247 # 筛选当前的招聘嘻嘻 248 249 lately_jobids = [] 250 251 @data_sql.query("select jobid from #{table}").each do |x| 252 lately_jobids << x['jobid'] 253 end 254 255 cmds = [] 256 update_count = 0 257 insert_count = 0 258 259 infos.each do |info| 260 # 处理每个新下载的招聘信息 261 set = {} 262 if lately_jobids.include?(info['jobid']) 263 # 时间范围内的招聘信息数据库已包含相同jobid的信息 264 set[:cmd] = "update #{table} set update_date = ? where jobid = ?" 265 set[:params] = [info['post_date'], info['jobid']] 266 update_count += 1 267 else 268 set[:cmd] = "insert into #{table} (jobid,post_date,company,link,name,srcsite,update_date) values (?, ?, ?, ?, ?, ?, ?)" 269 set[:params] = [ 270 info['jobid'], 271 info['post_date'], 272 info['company'], 273 info['link'], 274 info['name'], 275 info['srcsite'], 276 info['update_date'] 277 ] 278 insert_count += 1 279 end 280 cmds << set 281 end 282 283 puts "更新数:#{update_count}", "插入数:#{insert_count}" 284 puts '向数据库添加新信息..' 285 286 @data_sql.query('begin') 287 cmds.each do |c| 288 @data_sql.query(c[:cmd], c[:params]) 289 end 290 @data_sql.query('commit') 291 292 293 return 294 end 295 296 def write_html 297 # 清除上次保存的html文件 298 Dir.chdir("#{@path}/html") 299 Dir.glob('*.html').each do |fn| 300 File.unlink(fn) 301 end 302 Dir.chdir(@path) 303 # 写入欢迎页 304 fc1 = File.read("#{@path}/template2/index.erb") 305 dates_text = @late10dates.map{|x| "'#{x}'"}.join(',') 306 sites_text = @joblist.map{|k, v| 307 "{'name':'#{k}', 'text':'#{v['text']}', 'upd_time':'#{v['update_time']}'}" 308 309 }.join(',') 310 311 html1 = ERB.new(fc1).result(binding) 312 File.open("#{@path}/html/index.html", 'w') do |fio| 313 fio.print html1 314 end 315 316 # 写入各个招聘网信息按日期的详细页面 317 #~ @data_sql 318 fc2 = File.read("#{@path}/template2/details.erb") 319 320 #~ puts "日期列表:#{dates}, joblist=#{rw.joblist}" 321 @late10dates.each do |date| 322 @task_list.each_pair do |k, v| 323 #~ puts "正在写入:#{k}, #{date}" 324 conf = @joblist[k] 325 site_text = conf['text'] 326 table = conf['table'] 327 328 infos = [] 329 cmd = "select * from #{table} where post_date = '#{date}'" 330 #~ puts "sql cmd=#{cmd}" 331 @data_sql.query(cmd).each do |x| 332 infos << x # 不考虑:把相同的公司的信息放在一起 333 end 334 count = infos.size # erb里用到这个变量 335 #~ puts "检查infos:共#{count}个,[0]=", infos[0].inspect, '' 336 html2 = ERB.new(fc2).result(binding) 337 fn = "#{@path}/html/info-#{k}-#{date}.html" 338 File.open(fn, 'w') do |fio| 339 fio.print html2 340 end 341 end 342 end 343 344 puts "写入流程:完成." 345 return 346 end 347 348 end
某招聘网站的解析过程:
1 def page_analyse_sdhr(html_content)#, rw) 2 infos = [] 3 text0 = html_content 4 text0.gsub!("\r", '') 5 text0.gsub!("\n", '') # 先清除换行符,方便正则表达式匹配 6 h = '<ul class="searchPost-list" id="resultList">' 7 b = '<div class="centerPage">' 8 re = /#{h}(.*)#{b}/ 9 text1 = re.match(text0)[0] 10 11 # 由于<input>标签没有终结的</input>,所以不能用Nokogiri::XML,而要用Nokogiri::HTML 12 nk = Nokogiri::HTML(text1, nil, 'utf-8') 13 elems = nk.xpath('//li') 14 15 16 # 假设每个页面有20条招聘信息 17 18 elems.each do |node| 19 # 获取日期 20 post_date = node.xpath('./div/div[@class="t5"]').text 21 date_text = post_date.split('-').join 22 23 job_name = node.xpath('./div/div[@class="t1"]/a').text 24 link = node.xpath('./div/div[@class="t1"]/a/@href').text 25 link = 'http://www.shundehr.com' + link 26 company_name = node.xpath('./div/div[@class="t2"]/a').text 27 jobid_text = node.xpath('./input/@value').text 28 29 srcsite = 'shundehr' 30 info = { 31 'name' => job_name, 32 'company' => company_name, 33 'post_date' => date_text, 34 'update_date' => date_text, 35 'jobid' => jobid_text, 36 'link' => link, 37 'srcsite' => srcsite 38 } 39 40 #~ puts info.inspect; STDIN.gets 41 42 # Nokogiri找不到元素的时候,返回nil.所以这里有必要检查某些值是不是找不到. 43 selects = info.select{|k, v| (v==nil)} # 此时selects是个哈希 44 if selects.empty? 45 infos << info 46 else 47 puts "分析html时找不到部分元素", info.inspect, '---' 48 end 49 end 50 return infos 51 end 52 53 # 测试 54 #~ require 'nokogiri' 55 #~ fc = File.read('../tmp/shundehr_sample.html') 56 #~ r = page_analyse_sdhr(fc)
Nokogiri的部分用法见以下,注意Nokogiri::XML用来解析常规的xml,Nokogiri::HTML支持html里的<image>、<input>等没有结束标记的标签;Nokogiri的参数'utf-8'在windows上最好强制指定,要不然Nokogiri会以windows终端的GBK编码来解析。
Nokogiri有xpath('./node')的用法,识别<a href="link">text</a>是用xpath('./a/@href').text。
Nokogiri还有.to_html;.to_s;.text;.value用法,代码暂时找不到了,建议结合实际使用。
用Nokogiri的时候最好逐步调试,如果nokogiri找不到元素,有时候会抛出异常,有时候会返回nil。
Enumerable.any?用法:[3, 4, 5].any?{|x| x %2 == 0} # => true

浙公网安备 33010602011771号