2010 年 2月随笔档案 - 刘超觉先

Notes for Hadoop the definitive guide

摘要：1. Introduction to HDFS 1.1. HDFS Concepts 1.1.1. Blocks l HDFS too has the concept of a block, but it is a much larger unit 64 MB by default. l Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. l Unlike a filesystem for. 阅读全文

posted @ 2010-02-27 23:01 刘超觉先阅读(5496) 评论(0) 推荐(3)

Java面试题每日五题(2010/02/26)

摘要：问题1. public static void append(String str){ str += " Append!"; } public static void append(StringBuffer sBuffer){ sBuffer.append(" Append!"); } public void test(){ String str = "Nothing"; append(str); System.out.println(str); StringBuffer sBuffer = new StringBuffer(&quo 阅读全文

posted @ 2010-02-27 15:47 刘超觉先阅读(1574) 评论(1) 推荐(1)

Notes for Advanced Linux Programming

摘要：由于原书是英文的，因而笔记是英文的，大家敬请谅解吧。 1. Getting Started http://www.cnblogs.com/forfuture1978/archive/2010/02/11/1667457.html 2. Writing Good GNU/Linux Software http://www.cnblogs.com/forfuture1978/archive/2010/02/11/1667458.html 3. Processes http://www.cnblogs.com/forfuture1978/archive/2010/02/12/1667789.ht.. 阅读全文

posted @ 2010-02-25 13:08 刘超觉先阅读(1285) 评论(0) 推荐(1)

Lucene 3.0 原理与代码分析

摘要：本系列文章将详细描述几乎最新版本的Lucene的基本原理和代码分析。其中总体架构和索引文件格式是Lucene 2.9的，索引过程分析是Lucene 3.0的。鉴于索引文件格式没有太大变化，因而原文没有更新，原理和架构的文章中引用了前辈的一些图，可能属于早期的Lucene，但不影响对原理和架构的理解。本系列文章尚在撰写之中，将会有分词器，段合并，QueryParser，查询语句与查询对象，搜索过程，打分公式的推导等章节。提前给大家分享，希望大家批评指正。Lucene学习总结之一：全文检索的基本原理http://www.cnblogs.com/forfuture1978/archive/2009/ 阅读全文

posted @ 2010-02-22 20:25 刘超觉先阅读(9036) 评论(8) 推荐(7)

O'Reilly总裁提姆-奥莱理：什么是Web 2.0

摘要：译者序：Web 2.0这一概念，由O'Reilly媒体公司总裁兼CEO提姆·奥莱理提出。他是美国IT业界公认的传奇式人物，是“开放源码”概念的缔造者，一直倡导开放标准，并活跃在开放源码运动的最前沿。这篇由提姆·奥莱理亲自执笔、创作于上个月由他主办的Web 2.0会议前夕的文章，一经发出就引发了热烈的讨论，被视为Web 2.0迄今为止的经典之作。 Web2.0的一个关键原则是用户越多，服务越好 (作者｜提姆·奥莱理(Tim O'Reilly) 翻译作者｜玄伟剑) 2001年秋天互联网公司（dot-com)泡沫的破灭标志着互联网的一个转折点。许多人断阅读全文

posted @ 2010-02-15 20:49 刘超觉先阅读(1202) 评论(0) 推荐(1)

Notes for Advanced Linux Programming - 6. Devices

摘要：6. Devices A device driver hides the hardware device’s communication protocols from the operating system and allows the system to interact with the device through a standardized interface. Processes can communicate with a device driver via file-like objects. 6.1 Device Types A character device re... 阅读全文

posted @ 2010-02-12 11:10 刘超觉先阅读(838) 评论(0) 推荐(0)

Notes for Advanced Linux Programming - 5. Interprocess Communication

摘要：5. Interprocess Communication Five types of interprocess communication: Shared memory permits processes to communicate by simply reading and writing to a specified memory location. Mapped memory is similar to shared memory, except that it is associated with a file in the filesystem. Pipes permit... 阅读全文

posted @ 2010-02-12 11:06 刘超觉先阅读(863) 评论(0) 推荐(0)

Notes for Advanced Linux Programming - 4. Threads

摘要：4. Threads To use the POSIX standard thread API (pthreads), link libpthread.so to your program. 4.1. Thread Creation Each thread in a process is identified by a thread ID, pthread_t. The pthread_self function returns the thread ID of the current thread. This thread IDs can be compared with the p... 阅读全文

posted @ 2010-02-12 11:00 刘超觉先阅读(1324) 评论(0) 推荐(0)

Notes for Advanced Linux Programming - 3. Processes

摘要：3. Processes Each process is identified by its unique process ID Every process has a parent process. Processes are arranged in a tree, with the init process at its root A program can obtain the process ID with getpid() and can obtain the process ID of its parent process with the getppid(). #incl... 阅读全文

posted @ 2010-02-12 10:48 刘超觉先阅读(959) 评论(0) 推荐(0)

Notes for Advanced Linux Programming - 1. Getting Started

摘要：1. Getting Started 1.1. Compiling with GCC 1.1.1. Create the source code files (main.c) C source file—main.c #include <stdio.h> #include “reciprocal.hpp” int main (int argc, char **argv) { int i; i = atoi (argv[1]); printf (“The reciprocal of %d is %g\n”, i, reciprocal (i)); return 0; } (rec.. 阅读全文

posted @ 2010-02-11 11:52 刘超觉先阅读(986) 评论(0) 推荐(0)

Notes for Advanced Linux Programming - 2. Writing Good GNU/Linux Software

摘要：2. Writing Good GNU/Linux Software 2.1. Interaction With the Execution Environment 2.1.1. Command Line When a program is invoked from the shell, the argument list contains the entire both the name of the program and any command-line arguments provided. % ls -s / The argument list has three element.. 阅读全文

posted @ 2010-02-11 11:52 刘超觉先阅读(732) 评论(0) 推荐(0)

有关Lucene的问题(4):影响Lucene对文档打分的四种方式

摘要：在索引阶段设置Document Boost和Field Boost，存储在(.nrm)文件中。如果希望某些文档和某些域比其他的域更重要，如果此文档和此域包含所要查询的词则应该得分较高，则可以在索引阶段设定文档的boost和域的boost值。这些值是在索引阶段就写入索引文件的，存储在标准化因子(.nrm)文件中，一旦设定，除非删除此文档，否则无法改变。如果不进行设定，则Document Boost和Field Boost默认为1。 Document Boost及FieldBoost的设定方式如下： Document doc = new Document(); Field f = n... 阅读全文

posted @ 2010-02-08 23:44 刘超觉先阅读(5509) 评论(2) 推荐(0)

有关Lucene的问题(3): 向量空间模型与Lucene的打分机制

摘要：问题：在你的文章中提到了：于是我们把所有此文档中词(term)的权重(term weight) 看作一个向量。 Document = {term1, term2, …… ,term N} Document Vector = {weight1, weight2, …… ,weight N} 同样我们把查询语句看作一个简单的文档，也用向量来表示。 Query = {term1, term 2, …… , term N} Query Vector = {weight1, weight2, …… , weight N} 于是我们把所有此文档中词(term)的权重(term weight... 阅读全文

posted @ 2010-02-06 13:05 刘超觉先阅读(5276) 评论(0) 推荐(1)

有关Lucene的问题(2):stemming和lemmatization

摘要：问题：我试验了一下文章中提到的 stemming 和 lemmatization 将单词缩减为词根形式，如“cars”到“car”等。这种操作称为：stemming。将单词转变为词根形式，如“drove”到“drive”等。这种操作称为：lemmatization。试验没有成功代码如下： public class TestNorms { public void createIndex() throws IOException { Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/... 阅读全文

posted @ 2010-02-06 13:04 刘超觉先阅读(6148) 评论(1) 推荐(0)

算法之一：老掉牙的问题

摘要：搜索有以下几种算法：枚举算法：也即列举问题的所有状态从而寻找符合问题的解的方法。适合用于状态较少，比较简单的问题上。广度优先搜索：从初始点开始，根据规则展开第一层节点，并检查目标节点是否在这些节点上，若没有，再将所有的第一层的节点逐一展开，得到第二层节点，如没有，则扩展下去，直到发现目标节点为止。比较适合求最少步骤或最短解序列的题目。一般设置一个队列queue，将起始节点放入队列中，然后从队列头取出一个节点，检查是否是目标节点，如不是则进行扩展，将扩展出的所有节点放到队尾，然后再从队列头取出一个节点，直至找到目标节点。深度优先搜索：一般设置一个栈sta... 阅读全文

posted @ 2010-02-03 00:31 刘超觉先阅读(3121) 评论(0) 推荐(1)

Lucene学习总结之四：Lucene索引过程分析(4)

摘要：6、关闭IndexWriter对象代码： writer.close(); --> IndexWriter.closeInternal(boolean) --> (1) 将索引信息由内存写入磁盘: flush(waitForMerges, true, true); --> (2) 进行段合并: mergeScheduler.merge(this); 对段的合并将在后面的章节进行讨论，此处仅仅讨论将索引信息由写入磁盘的过程。代码： IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boole... 阅读全文

posted @ 2010-02-02 02:02 刘超觉先阅读(6392) 评论(5) 推荐(3)

Lucene学习总结之四：Lucene索引过程分析(3)

摘要：5、DocumentsWriter对CharBlockPool，ByteBlockPool，IntBlockPool的缓存管理在索引的过程中，DocumentsWriter将词信息(term)存储在CharBlockPool中，将文档号(doc ID)，词频(freq)和位置(prox)信息存储在ByteBlockPool中。在ByteBlockPool中，缓存是分块(slice)分配的，块(slice)是分层次的，层次越高，此层的块越大，每一层的块大小事相同的。 nextLevelArray表示的是当前层的下一层是第几层，可见第9层的下一层还是第9层，也就是说最高有9层。 le... 阅读全文

posted @ 2010-02-02 02:01 刘超觉先阅读(6588) 评论(1) 推荐(2)

Lucene学习总结之四：Lucene索引过程分析(2)

摘要：3、将文档加入IndexWriter 代码： writer.addDocument(doc); -->IndexWriter.addDocument(Document doc, Analyzer analyzer) -->doFlush = docWriter.addDocument(doc, analyzer); --> DocumentsWriter.updateDocument(Document, Analyzer, Term) 注：--> 代表一级函数调用 IndexWriter继而调用DocumentsWriter.addDocument，其又调用Docume 阅读全文

posted @ 2010-02-02 01:59 刘超觉先阅读(11002) 评论(1) 推荐(2)

Lucene学习总结之四：Lucene索引过程分析(1)

摘要：对于Lucene的索引过程，除了将词(Term)写入倒排表并最终写入Lucene的索引文件外，还包括分词(Analyzer)和合并段(merge segments)的过程，本次不包括这两部分，将在以后的文章中进行分析。 Lucene的索引过程，很多的博客，文章都有介绍，推荐大家上网搜一篇文章：《Annotated Lucene》，好像中文名称叫《Lucene源码剖析》是很不错的。想要真正了解Lucene索引文件过程，最好的办法是跟进代码调试，对着文章看代码，这样不但能够最详细准确的掌握索引过程(描述都是有偏差的，而代码是不会骗你的)，而且还能够学习Lucene的一些优秀的实现，能够在以后的工阅读全文

posted @ 2010-02-02 01:58 刘超觉先阅读(21024) 评论(3) 推荐(2)

Lucene学习总结之三：Lucene的索引文件格式(3)

摘要：四、具体格式 4.2. 反向信息反向信息是索引文件的核心，也即反向索引。反向索引包括两部分，左面是词典(Term Dictionary)，右面是倒排表(Posting List)。在Lucene中，这两部分是分文件存储的，词典是存储在tii，tis中的，倒排表又包括两部分，一部分是文档号及词频，保存在frq中，一部分是词的位置信息，保存在prx中。 Term Dictionary (tii, tis) –> Frequencies (.frq) –> Positions (.prx) 4.2.1. 词典(tis)及词典索引(tii)信息在词典中，所有的词是按照字典顺序... 阅读全文

posted @ 2010-02-02 01:43 刘超觉先阅读(15554) 评论(2) 推荐(3)

刘超觉先

02 2010 档案

公告