程序控

IPPP (Institute of Penniless Peasent-Programmer) Fellow

  博客园 :: 首页 :: 博问 :: 闪存 :: 新随笔 :: 联系 :: :: 管理 ::
  77 随笔 :: 0 文章 :: 442 评论 :: 0 引用

Time limit: 3.000 seconds
限时:3.000秒

 

Background
背景

Searching and sorting are part of the theory and practice of computer science. For example, binary search provides a good example of an easy-to-understand algorithm with sub-linear complexity. Quicksort is an efficient O(nlogn) [average case] comparison based sort.
查找和排序是计算机科学实践的组成部分。比如折半查找算法就是一个易于理解的“次线性”复杂度的例子。快速排序是一个效率为O(nlogn)【平均情况下】的比较排序法。

KWIC-indexing is an indexing method that permits efficient "human search" of, for example, a list of titles.
KWIC索引是一种用于找人的高效索引方法,比如可以用来查找称呼的列表。

 

The Problem
问题

Given a list of titles and a list of "words to ignore", you are to write a program that generates a KWIC (Key Word In Context) index of the titles. In a KWIC-index, a title is listed once for each keyword that occurs in the title. The KWIC-index is alphabetized by keyword.
给定一组称呼和一组“忽略单词”,你要写一个程序为该列表生成KWIC索引。在KWIC索引中,对于每个关键字,含有该关键字的称呼都要出现一次。KWIC索引按关键字的字母表顺序排列。

Any word that is not one of the "words to ignore" is a potential keyword.
没有在“忽略单词”中出现的任何单词都可能是关键字。

For example, if words to ignore are "the, of, and, as, a" and the list of titles is:
比方说,“the, of, and, as, a”是忽略单词表,下面是称呼列表:

Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man

A KWIC-index of these titles might be given by:
那么这些称呼的KWIC索引就应该是(中间对齐的大写单词指出其关键字):

 

a portrait of the ARTIST as a young man
the ASCENT of man
  DESCENT of man
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man

 

The Input
输入

The input is a sequence of lines, the string :: is used to separate the list of words to ignore from the list of titles. Each of the words to ignore appears in lower-case letters on a line by itself and is no more than 10 characters in length. Each title appears on a line by itself and may consist of mixed-case (upper and lower) letters. Words in a title are separated by whitespace. No title contains more than 15 words.
输入由很多行组成,字符串“::”作为忽略单词表和称呼列表之间的区分。每个忽略单词都以小写字母表示,并独占一行,且不会超过10个字符长度。每个称呼也独占一行,且由大写或小写字母混合表示。称呼中的单词用空格隔开,所有称呼都不会超过15个单词。

There will be no more than 50 words to ignore, no more than than 200 titles, and no more than 10,000 characters in the titles and words to ignore combined. No characters other than 'a'-'z', 'A'-'Z', and white space will appear in the input.
最多有50个忽略单词和200个称呼,整个输入数据(包括忽略单词表和称呼列表)不会超过10,000个字符长度。不会出现除'a'-'z'和'A'-'Z'以及空格之外的任何字母。

 

The Output
输出

The output should be a KWIC-index of the titles, with each title appearing once for each keyword in the title, and with the KWIC-index alphabetized by keyword. If a word appears more than once in a title, each instance is a potential keyword.
要输出的是称呼的KWIC索引,对于每个称呼关键字,含有该关键字的称呼都要出现且仅出现一次,关键字按字母表顺序排列。如果一个单词在称呼中出现多于一次,那么每一次都将成为一个可能的关键字。

The keyword should appear in all upper-case letters. All other words in a title should be in lower-case letters. Titles in the KWIC-index with the same keyword should appear in the same order as they appeared in the input file. In the case where multiple instances of a word are keywords in the same title, the keywords should be capitalized in left-to-right order.
关键字要用大写字母输出,所有其它的字母都用小写字母。若KWIC索引中的多个称呼具有相同的关键字,则应按其在输入数据中出现的顺序输出。如果一个称呼中出现了多个相同的关键字,那么应按照大写关键字从左至右的顺序输出每一个称呼。

Case (upper or lower) is irrelevant when determining if a word is to be ignored.
在确定单词是否要被忽略时不考虑大小写。

The titles in the KWIC-index need NOT be justified or aligned by keyword, all titles may be listed left-justified.
KWIC索引中的称呼无需按索引对齐,所有称呼都居左对齐即可。

 

Sample Input
示例输入

is
the
of
and
as
a
but
::
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man
A Man is a Man but Bubblesort IS A DOG

 

Sample Output
示例输出

a portrait of the ARTIST as a young man
the ASCENT of man
a man is a man but BUBBLESORT is a dog
DESCENT of man
a man is a man but bubblesort is a DOG
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
a MAN is a man but bubblesort is a dog
a man is a MAN but bubblesort is a dog
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man

 

Analysis
分析

这道题比较快的做法是为每个关键字维护一个“包含该关键字的称呼”列表。依次处理所有称呼,在每个称呼中搜索关键字(要排除忽略单词),发现新的关键字就加入关键字列表。然后将该称呼加入到与当前关键字对应的“包含该关键字的称呼”列表的最后,注意加入称呼时要注意按格式处理(该关键字为大写,其它为小写)。最后输出的时候只需要将关键字排序,然后按顺序输出所有关键字对应的“包含该关键字的称呼”列表即可。该算法用文字解释起来很绕口,不过看下面的代码就简单多了。

本来是很简单的一道题,结果因为语言的问题载了大跟头。一直卡在那里花了两个多小时,提交了近10个WA。后来才发现我居然犯了一个非常低级的错误——string::c_str()返回的只是一个临时地址,不能期望该内容在string对象的生存期内一直有效。

 

Solution
解答

#include <algorithm>
#include <iostream>
#include <functional>
#include <vector>
#include <string>
#include <sstream>
using namespace std;
//关键字结构体,记录关键字字符串以及含有该关键字的称呼字符串列表
//在算法中,称呼自符串列表应按输入的顺序排列,且已处理过大小写格式
struct KEYWORD {string Key; vector<string> Titles;};
//比较关键字的大小,以便为整个关键字列表排序
bool LessKey(KEYWORD const *k1, KEYWORD const *k2) {
	return k1->Key < k2->Key;
}
//比较关键字与给定字符串是否相同,用于查找关键字
bool SameKey(string str, const KEYWORD *p2) {
	return p2->Key == str;
}
//将字符串转为小写
void ToLower(char *pStr, size_t nCnt) {
	for (char *pEnd = pStr + nCnt; pStr != pEnd; ++pStr) {
		*pStr = tolower(*pStr);
	}
}
//将字符串转为大写
void ToUpper(char *pStr, size_t nCnt) {
	for (char *pEnd = pStr + nCnt; pStr != pEnd; ++pStr) {
		*pStr = toupper(*pStr);
	}
}
//主函数
int main(void) {
	vector<KEYWORD*> Keywords;
	vector<string> Ignore;
	//szStr为输入的一行称呼,pStr用来标记称呼中每个单词的起点
	char szStr[1000], *pStr = szStr;
	//循环输入所有的忽略单词,直到遇到"::"时结束
	for (string str; cin >> str && str != "::"; Ignore.push_back(str));
	//循环输入并处理所有的每一行称呼
	for (string strLine; getline(cin, strLine); pStr = szStr) {
		//字串为空不处理
		if (strLine.empty()) {
			continue;
		}
		//将strLine复制转存到另外的数组中。
		strcpy(szStr, strLine.c_str());
		//先将输入的称呼全部转为小写字母,并获得其字符串指针
		ToLower(szStr, strLine.size());
		stringstream ss(pStr);
		//循环读入称呼中的每一个单词,pStr为当前单词在原数组中的起点
		for(string strWord; ss >> strWord; pStr = &szStr[ss.tellg()]) {
			//在忽略单词表中查找,仅在没有找到时才处理关键字
			if (find(Ignore.begin(), Ignore.end(), strWord) != Ignore.end()) {
				continue;
			}
			//在关键字列表中查找该单词
			vector<KEYWORD*>::iterator iKey = find_if(Keywords.begin(),
				Keywords.end(), bind1st(ptr_fun(&SameKey), strWord));
			//如果关键字列表中尚无该单词,则添加为关键字
			if (iKey == Keywords.end()) {
				KEYWORD *pNewKey = new KEYWORD;
				pNewKey->Key = strWord;
				iKey = Keywords.insert(Keywords.end(), pNewKey);
			}
			//计算该单词在原称呼字符串中的位置,并将关键字转为大写
			int nLen = (&szStr[ss.tellg()]) - pStr;
			ToUpper(pStr, nLen);
			//在转为要求的关键字大写格式后,加入关键字的称呼列表
			(*iKey)->Titles.push_back(szStr);
			//恢复原称呼字符串的关键字为小写
			ToLower(pStr, nLen);
		}
	}
	//将关键字按字母大小排序,并循环输出
	sort(Keywords.begin(), Keywords.end(), LessKey);
	//循环输出所有关键字
	for (vector<KEYWORD*>::iterator i = Keywords.begin();
		i != Keywords.end(); ++i) {
		//循环输出该关键字的所有称呼
		for (vector<string>::iterator j = (*i)->Titles.begin();
			j != (*i)->Titles.end(); cout << *j++ << endl);
		//删除该关键字,避免内存汇露
		delete *i;
	}
	return 0;
}
posted on 2010-08-16 14:44  Devymex  阅读(...)  评论(...编辑  收藏