QA MichaelPeng

一个QA的零零碎碎

2011年5月13日

用马尔科夫模型做拼写检查

原理和原文见 Peter Norvig的这篇文章

原文是基于词频,在文后提到可以通过上下文来提高准确率

 

下面这段代码只考虑了待纠正词在序列末尾的情况,应当还要考虑其在序列中和序列首的情况

 

import re, collections, sys, random

def words(text): return re.findall('[a-z]+', text.lower()) 

        
def defaultdict_factoryn(n, default):
    
if n == 1return lambda: default
    
return lambda: collections.defaultdict(defaultdict_factoryn(n-1, default))

def multidict_set(d, l, v):
    curd 
= d;
    i 
= 0
    
for ele in l:
        i 
+= 1
        
if i == len(l):
            curd[ele] 
= v
        
else:
            curd 
= curd[ele]

def multidict_add(d, l, v):
    curd 
= d;
    i 
= 0
    
for ele in l:
        i 
+= 1
        
if i == len(l):
            curd[ele] 
+= v
        
else:
            curd 
= curd[ele]

def multidict_get(d, l):
    curd 
= d;
    i 
= 0
    
for ele in l:
        curd 
= curd[ele]
    
return curd

def train(features, n):
    model 
= collections.defaultdict(defaultdict_factoryn(n, 1))
    prevlen 
= n
    prev 
= collections.deque()
    
for f in features:
        
if (len(prev) < prevlen):
            prev.append(f)
            
continue
        multidict_add(model, prev, 
1)
        prev.popleft()
        prev.append(f)
    
return model



def most_likely(prev):
    l 
= multidict_get(Model, prev)
    
if not l:
        
return ""
    l 
= l.items()
    
if len(l) == 0:
        
return ""
    l 
= sorted(l, cmp=lambda x, y:y[1- x[1])
    count 
= min(len(l) - 110)
    
return l[random.randint(0, count)][0]

stage 
= 1
Model 
= train(words(file('big.txt').read()), stage + 1)

def train_1(features):
    model 
= collections.defaultdict(lambda1)
    
for f in features:
        model[f] 
+= 1
    
return model

NWORDS 
= train_1(words(file('big.txt').read()))

alphabet 
= 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     
= [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    
= [a + b[1:] for a, b in splits if b]
   transposes 
= [a + b[1+ b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   
= [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    
= [a + c + b     for a, b in splits for c in alphabet]
   
return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(prev, word):
    
# print [(c, NWORDS.get(c)) for c in candidates]
    if len(prev) < stage:
        candidates 
= known([word]) or known(edits1(word)) or known_edits2(word) or set([word])
        
return max(candidates, key=NWORDS.get)
    
else:
        candidates 
= known([word]) | known(edits1(word)) | known_edits2(word) | set([word])
        
return max(candidates, key=lambda x:multidict_get(Model, prev + [x]))


print correct(["i"], "ove")
print correct([], "ove")

 

posted @ 2011-05-13 17:12 Michael Peng 阅读(69) 评论(0) 编辑

2011年4月29日

商业软件编程很无聊(转载)

原文是ThoughtWorks一哥们在06年写的But Martin, Enterprise Software IS Boring,中文世界里Google前几页主要都是g9的那篇 商业软件编程很无聊 ,没找到原文译本

因原文在墙外,就拷贝到墙内一份

 

But Martin, Enterprise Software IS Boring

Martin Fowler writes about Customer Affinity, a factor he believes distinguishes a good enterprise developer from a bad one. So far,so good.

He says "I've often heard it said that enterprise software is boring, just shuffling data around, that people of talent will do "real" software that requires fancy algorithms, hardware hacks, or plenty of math."

This is almost exactly what I say.(Just remove the quotes around "real" ;-) ).

Martin goes on to disagree with this idea (which is fine) but then he says "I feel that this usually happens due to a lack of customer affinity." And this is where I disagree.

People who work on things like compilers and hardware hacks and "tough" algorithmic problems can have customers, just as the enterprise folks do. I know I have. The compiler I am writing now has a customer. The massively parallel neural network classifier framework I wrote a couple of years ago had a customer. The telecom fraud detection classifier cluster I worked on last year (along with some other talented programmers) is being used by a company which is very business (and customer) oriented. So the idea that a "customer" (and thus customer affinity) is restricted only to those folks who write database backed web apps is simply not true.

Math/algorithms/hardware hacks and the presence or absence of a "customer" (and thus, customer affinity) are orthogonal (you know, two axes at 90 degrees to each other) issues.

With the customer issue out of the way, let us examine the core issue - the notion that talented developers would preferto do stuff like compilers and algorithms instead of a business app (by which I mean something like automating the loan disbursal processes of a bank).

The best way to get a sense of the truth is to examine the (desired) flow of people in both directions. I know dozens of people who are very very good at writing business software who yearn wistfully for a job doing "plumbing" like compilers and tcp/ip stacks, but I've never yet seen someone who is very very good at writing a compiler or operating system (and can make money doing so) desperately trying to get back to the world of banking software. A programmer might code enterprise apps for money, but at night, at home, he'll still hack on a compiler.

It's kind of a "Berlin Wall" effect. In the days of the cold war, to cut through the propoganda of whether Communism was better than Capitalism , all you had to do was to observe in which direction people were trying to breach the Berlin Wall. Hundreds of people risked their lives to cross from the East to the West but practically no one went the other way. Of course this could just mean that the folks didn't appreciate the virtues of people's power and the harangues of the comissars but I somehow doubt it.

To see whether living in India is better than living In Bangladesh (or whether living in the United States is better than living in India) look at who is trying to cross the barriers and in which direction.

Economic Theory suggests another way of finding an answer. Suppose you were offered say 10,000$ a month for say 2 years, to develop the best business app you can imagine and say, 8000 $ a month, for the same two years to work on the best systems/research project you can imagine. Which one would you choose? Now invert the payment for each type of work. Change the amounts till you can't decide one way or another and your preference for either job is equal. Finding this point of indifference (actually it is an "indifference curve") will teach you about your preference for "enterprise over systems". I believe that while individuals will choose many possible points as their "points of indifference" on that curve, a majority of the most talented debvelopers would prefer a lower pay for a good systems/research project in preference to an enterprise project, no matter how interesting. And to most developers, an enterprise project becomes interesting to the degree it needs "tough programming" skills (massively scalable databases say.. there is a good reason Amazon and Google emphasize algorithms , maths and other "plumbing" skills in their interviews and an equally good reason why a Wipro or a TCS doesn't bother testing for these skills).

The widespread (though not universal) preference of the very best software developers for "plumbing" over "enterprise" holds even inside Thoughtworks (where Martin works and where I once worked). I know dozens of Thoughtworkers who'd prefer to work in "core tech". While I was working for TW, the CEO,Roy Singham, periodically raised the idea of venturing into embedded and other non enterprise software and a large majority of developers (which invariably included most of the best devs) wanted this to happen. Sadly, this never actually came to pass.

A good number of thoughtworkers do business software to put bread on the table, or while working towards being good enough to write "plumbing", but in their heart, they yearn to hack a kernel or program a robot. (Another group of people dream about starting their own web appp companies). For most of these folks tomorrow never quite arrives, but some of the most "businessy" developers in Thoughtworks dream of writing game engines one day or learning deep math vodoo or earning an MS or PhD. And good people have left TW for all these reasons. And if this is the situation at arguably the best enterprise app development company (at least in the "consultants" subspace) one can imagine the situation in "lesser" companies.

The programmer who has the ability to develop business software and "plumbing" software and chooses to do business app development is so rare as to be almost non existent. Of course, most business app dev types are in no condition to even contemplate writing "hard stuff" but that is a topic for another day.

Paul Graham (admittedly a biased source) says (emphasis mine),

"It's pretty easy to say what kinds of problems are not interesting: those where instead of solving a few big, clear, problems, you have to solve a lot of nasty little ones....Another is when you have to customize something for an individual client's complex and ill-defined needs. To hackers these kinds of projects are the death of a thousand cuts.

The distinguishing feature of nasty little problems is that you don't learn anything from them. Writing a compiler is interesting because it teaches you what a compiler is. But writing an interface to a buggy piece of software doesn't teach you anything, because the bugs are random. [3] So it's not just fastidiousness that makes good hackers avoid nasty little problems. It's more a question of self-preservation. Working on nasty little problems makes you stupid. Good hackers avoid it for the same reason models avoid cheeseburgers."

Strangely enough, I heard Martin talk about the same concept (from a more positive view point, of course) when he last spoke in Bangalore. He explained how much of the complexity of business software is essentially arbitrary - how an arcane union regulation about paying overtime for one man going bear hunting on Thanksgiving can wreck the beauty of a software architecture. Martin said that this is what is fascinating about business software. But he and Paul essentially agree about business software being about arbitrariness. It is a strange sense of aesthetics that finds beauty in arbitrariness, but who am I to say it isn't valid?

Modulo these ideas, I agree with a lot of what Martin says in his blog post. The most significant sentence is

"The real intellectual challenge of business software is figuring out where what the real contribution of software can be to a business. You need both good technical and business knowledge to find that."

This is so true it needs repeating. And I have said this before.To write good banking sofwtare, for e.g. you need a deep knowledge of banking AND (say) the j2ee stack. Unfortunately the "projects and consultants" part of business software development doesn't quite work this way. I will go out on a limb here and say that software consulting companies *in general* (with the rare honorable exception blah blah ) have "business analysts" who are not quite good enough to be managers in the enterprises they consult for and "developers" who are not quite good enough to be "hackers" (in the Paul Graham sense).

Of course factoring in "outsourcing" makes the picture worse. By the time a typical business app project comes to India, most, if not all of the vital decisions have been made and the project moves offshore only to take advantage of low cost programmers, no matter what the company propoganda says about "worldwide talent", so at least in India this kind of "figuring out" is almost non existent and is replaced by an endless grind churning out jsp pages or database tables or whatever. It is hard to figure out the contribution of your software to people and businesses located half a world away, no matter how many "distributed agilists" or "offshore business analysts" you throw into the equation.

This is true for non businessy software also,though to a much lesser degree. The kind of work outsourced to India is still the non essential "grind" type (though often way less boring than their "enterprise" equivalents). The core of Oracle's database software for .e.g. is not designed or implemented in India. Neither is the core of Yahoo's multiple software offerings - the maintenance is, development is not. And contary to what many Indians say, this is NOT about the "greedy white man's" exploiting the cheap brown skins. You need to do hard things and prove yourself before you are taken seriously. Otherwise people WILL see you as cheap drone workers. That's just the way of the world).

And I say this as someone who is deeply interested in business. T'is not that I loved business less but I loved programming more :-).I'd rather read code than abusiness magazine but I prefer a business magazine to say, fashion news. I actually LIKE reading about new business models. I read every issue of The Harvard Business Review (thanks Mack) and have dozens of businessy books on my book shelf along with all the technical books. (I gave away all (300 + books) of my J2EE/dotNet/agile/enterpise dev type book collection , but that is a different story). I am working through books on finance and investing and logistics. I have friends who did their MBA in Corporate Strategy from Dartmouth or are studying in Harvard (or plan to do so). I have friends who work for McKinsey, and Bain and Co, and Booz and Allen, and Lehman Brothers. If I were 10 years younger (and thus had a few extra years to live), I'd probably do an MBA myself. And of course I have spent 10 years (too long, oh, too long) doing "enterprise" software. In my experience (which could turn out to be atypical),the "plumbing" software is way tougher and waaaaay more intellectually meaningful than enterprise software, even of the hardest kind. I might enjoy an MBA, but I will NEVER work in the outsourced business app development industry again even if I have to starve on the streets of Bangalore. I'll kill myself first. So, yeah I am prejudiced. :-). Take everything I say with appropriate dosages of salt.

And just so I am clear, I do NOT think Martin is deliberately obfuscating issues. I think Martin is the rare individual whoactually chooses enterprise software over systems software and is lucky enough to consistently find himself working with the best minds and in a position to figure out the business impact of the code he writes. This is very different from the position of the average "coding body", especially when he is selected more for being cheap than because he knows anything useful ("worldwide talent", remember :-) ).

Btw, Martin's use of the word "plumbing" is not used in a derogatory sense (though I've seen some pseudo "hackers" take it that way, especially since the word makes a not-so-occasional appearnce in many of his speeches) but more in the sense of "infrastructure". If you are such a "hacker" offended by the use of the term, substitute "infrastructure" for "plumbing" in his article. That takes the blue collar imagery and sting out of that particular word, fwiw.

I think Martin has very many valuable things to say. I just think he is slightly off base in this blog post and so I respectfully disagree with some of his conclusions.

But then again, it could all be me being crazy. Maybe enterprise software is really as fascinating as systems software, and writing a banking or leasing or insurance app really as interesting as writing a compiler or operating system.(and Unicorns exist , and pigs can fly).

Hmmmm. Maybe. Meanwhile I'm halfway under the wall and have to come up on the other side before dawn. Back to digging.

 

posted @ 2011-04-29 11:07 Michael Peng 阅读(201) 评论(0) 编辑

2011年3月16日

VS2010 Debugger bug

版本:vs2010,vs2010 sp1

现象: debugger不能正确处理局部变量的作用域

示例代码:

int _tmain(int argc, _TCHAR* argv[])
{
    
int i = 5;
    
int sum = 0;
    
for (int i = 0; i < 10++i)
    {
        sum 
+= i;
    }
    printf(
"%d\n", i);
    
return 0;
}

 

 在return语句处设置断点,可以看到printf输入为5,而watch中i值为10,10为循环语句中i的值,此处的i应为main函数中的i,值为5

 

 

posted @ 2011-03-16 15:03 Michael Peng 阅读(142) 评论(3) 编辑

2011年3月10日

编程之初

大一才摸计算机。计算机概论由谢柏青老师主讲。当时很多理科院系的计算机概论课讲了计算机组成和二进制后就讲fortran,谢老师则不然,讲完基本的计算机原理,先给我们讲了office应用,复制、粘贴、格式刷、excel图表和函数,这些现在司空见惯的东西当时让我们大开眼界,原来计算机还可以这么玩,在宿舍里就成了高手了,学习兴趣大增。此外谢老师没有习惯性的讲fortran,而是由学生自己选学fortran还是c,我们虽然什么都不懂,只是觉得cfortran时髦一些,就学了c.从以后的经验来看,培养学生的兴趣和尊重学生的选择,谢老师的这两点就直接影响了偶这个码农的命运。

 

大一下上裘宗燕老师的算法和数据结构课,把所有编程题都做了一遍,大作业拿TruboC写了一个dos下的编辑器,为写作业还熬过通宵,但学到的东西得到了应用,收获很大。指针不再是C语言课上枯燥的东西,而是直接构建程序功能的利器。链接和数组也不仅仅是课本上的ADT,而会直接影响程序各种操作的性能。

 

除了上课,实践是很重要的一点,编程不是苦差事,而是件好玩的事,写程序可以不带任何功利色彩,只为好玩,能给人带来一种简单而纯粹的快乐。回想上大学时干过的事,用TurboC绘制分形,看见一个美丽的图案一点点在你眼前展开,这种震撼是其它快乐无法替代的。写过游戏修改工具,在游戏通关的同时顺便熟悉了windows编程。通过动态跟踪和分析汇编代码,找到注册码的那一刻,除了破坏的快感,对汇编和程序执行的运行时也有了深入了解。想玩象棋,解出《桔中秘》时同时知道了位运算的技巧、alpha-beta剪枝和hash表。

 

编程之初,不外乎一个玩字,好玩而已,其它都是副产品。

posted @ 2011-03-10 12:00 Michael Peng 阅读(423) 评论(0) 编辑

2011年2月5日

编程之美 1.4买书问题常数时间空间解法

题目: 

在 节假日的时候,书店一般都会做促销活动。由于《哈利波特》系列相当畅销,店长决定通过促销活动来回馈读者。在销售的《哈利波特》平装本系列中,一共有五 卷,用编号0, 1, 2, 3, 4来表示。假设每一卷单独销售均需要8欧元。如果读者一次购买不同的两卷,就可以扣除5%的费用,三卷则更多。假设具体折扣的情况如下:

本数          折扣
2          5%
3          10%
4           20%
5           25%
在一份订单中,根据购买的卷数以及本书,就会出现可以应用不同折扣规则的情况。但是,一本书只会应用一个折扣规则。比如,读者一共买了两本卷一,一本卷二。那么,可以享受到5%的折扣。另外一本卷一则不能享受折扣。如果有多种折扣,希望能够计算出的总额尽可能的低。
要求根据这样的需求,设计出算法,能够计算出读者所购买一批书的最低价格。

 

分析:

首先假设五册书分别为A,B,C,D,E,不失一般性可以假设Na>=Nb>=Nc>=Nd>=Ne

设所有书可以划分为k,每组中书不重复,

则在最优解中有如下性质:

1 所有只包含一本书的组均只包含同一本书

   若有两组包含一本书的组包含的书不同,则这两组合并,能得到更优解,与最优解矛盾.

2包含一本书的组只包含A

 包含书为A',Na'==Na,A,A'对调即可

 Na'<Na,则必存在某组包含A不包含A',则将只包含A'的组并入该组能得到更优解

 

3所有只包含两本书的组均包含相同两本书

(A', B'), (A',C')折扣0.2重组为(A',B',C')(A)折扣0.3

 

4包含两本书的组必包含A

若存在包含一本A的组,而两本书的组不包含A,则合并能得更优解

若不存在包含一本A的组,而两本书组包含为A',A'',则包含三本书,四本组必含A而缺A'

(A', A''), (AXY), 折扣0.4不如(A''),(AXYA')0.8

(A',A''),(AXYZ)折扣0.9不如(A''),(AXYZA')1.25

 

5 包含两本书的组必包含B

同上证

 

6所有只包含三本书的组均包含相同三本书

(A,B,C),(A,B,D)折扣0.6,(A,B),(A,B,C,D)折扣0.9

 

7所有包含三本书的组均包含A

若存在只包含A的组,合并得更优

若存在只包含A, B的组(A,B)(XYZ)折扣0.4,不如(B)(AXYZ)折扣0.8

若不存在一,二本组,假设三本书组为(X,Y,Z),则必有包含A的四本组缺X,(X,Y,Z)(A,A',A'',A''')折扣1.1,不如

(Y,Z),(A,A',A'',A''',X)折扣1.35

 

8所有包含三本的组均包含B,C

同上可证

 

9所有包含四本书的组均包含A,B,C

XYZW

(ABC)(BCDE)折扣1.1,不如(B,C),(A,B,C,D,E)折扣1.35

其它情况同理可证

 

10包含五本书与包含三本书情况不会同时出现

(A,B,C),(A,B,C,D,E)折扣1.55,不如(A,B,C,D),(A,B,C,E)折扣1.6

 

由以上证明可得如下结论

  1. 每组均包含A,所有组数与A相同
  2. 所有包含两本及以上的组均包含B,组数与B
  3. 所有包含三本及以上的组均包含C,组数与C
  4. ,五不并存

 由此可得解法如下:

代码
const double BuyBook::UNIT_PRICE = 8;
const double BuyBook::DISCOUNTS[5= {10.950.90.80.75};
static const int BOOK_KINDS = 5;
double BuyBook::SearchFast(int* books)
{
    Sort(books);
    
int g[5];
    g[
0= books[0- books[1];
    g[
1= books[1- books[2];
    g[
2= books[2- books[3];
    g[
3= books[3- books[4];
    g[
4= books[4];
    
int t = min(g[2], g[4]);
    
if (t > 0)
    {
        g[
2-= t;
        g[
4-= t;
        g[
3+= 2 * t;
    }
    
double sum = 0;
    
for (int i = 0; i < BOOK_KINDS; ++i)
    {
        sum 
+= g[i] * (i+1* UNIT_PRICE * DISCOUNTS[i];
    }
    
return sum;
}

 

 

  

 

posted @ 2011-02-05 17:22 Michael Peng 阅读(1604) 评论(0) 编辑

2010年12月27日

vc2010 std::tr1 bind库捉虫记

摘要: 1 vc2010 std::tr1 bind库一个bug的剖析。 2 用gtest编写测试用例 3 QA在团队中可以起的作用阅读全文

posted @ 2010-12-27 22:54 Michael Peng 阅读(1536) 评论(1) 编辑

2010年12月25日

用vs2010编译kigg 3.0遇到的问题

摘要: 今天玩kigg时碰到了一些问题, 在google帮助下总算搞定了,在这里记录下解决步骤,供日后参考,也给碰到相似问题的朋友一些参考.阅读全文

posted @ 2010-12-25 22:56 Michael Peng 阅读(209) 评论(0) 编辑

2010年12月22日

在vs中获得当前所有快捷键代码

摘要: 在 vs中获得当前所有快捷键的代码阅读全文

posted @ 2010-12-22 16:14 Michael Peng 阅读(2261) 评论(0) 编辑

2010年12月16日

最近的一些面试感悟

摘要: 一年来招人的一些感受阅读全文

posted @ 2010-12-16 14:57 Michael Peng 阅读(8365) 评论(159) 编辑

2010年12月8日

这两天被vs2010的std::tr1::bind郁闷了

摘要: 这两天被vs2010的std::tr1::bind郁闷了阅读全文

posted @ 2010-12-08 21:36 Michael Peng 阅读(663) 评论(3) 编辑

<2012年2月>
2930311234
567891011
12131415161718
19202122232425
26272829123
45678910

导航

统计

公告

昵称:Michael Peng
园龄:3年11个月
粉丝:20
关注:12

搜索

 
 

常用链接

我的标签

随笔档案

最新评论

阅读排行榜

评论排行榜

推荐排行榜