TUP Masters系列第四期 搜索与云计算首席科学家Raghu Ramakrishnan:深入云计算实战 现场QA实录

Q:So what is the performance if you have lots of updates, very intensive data mining queries that would run a long time?

A:If you look at the workloads of yahoo i have mentioned, data mining kinds of queries would not be on PNUTS.What you would do is analyze them on Hadoop, data in hadoop can be moved into sherpa or from sherpa into hadoop effieciently, and you would do that and anlyzed data that way. In PNUTS it supports insert delete modify of single records. Time-line consistensy is important. The main factor is avalibility instead of performance.


Q:PNUTS是如何处理那些有很多更新,或者是需要运行很长时间的密集数据挖掘查询的呢?
A:我在刚才已经提到了,类似数据挖掘这样需要长时间运行的查询并不会在PNUTS上进行的。我们会在Hadoop上执行这样的分析操作,如我刚才所展示的,Hadoop里面的数据可以很容易的迁移到Sherpa上,反之亦然,因此你可以利用Hadoop提供的强大的计算能力对数据进行分析。
PNUTS主要提供单条数据上的查询,删除,以及修改操作,当然性能也很重要,但是在实际中,稳定的可用性要比性能更重要。

Q:Why Cata use two clusters, one cluster for hadoop, on cluster for indexing share?
A:I'm not famliar of Cata. So I could not answer this question.
Q:我在工作中用到Cata数据库,我有一个疑问,为什么Cata里面有两个cluster,一个Hadoop的cluster,一个用于索引共享的cluster?
A:很抱歉,我对Cata数据库了解不多,因此无法回答你这个问题。
还有其它的问题吗,希望是我能回答的问题(笑)。

Q:I want to ask a general question. What is the impact of cloud computing on databases?
A:If you look at some of the things we talked about today, hopefully answer to your question. So Cloud computing essentially means you are building a system that you can multitenant. And the tenants maybe users or developers are using the cloud. They can ask for more capacity at any time and they could get it instantly.
And this is your goal: You need to be able to add capacity to your systems and your systems could automatically distribute them all.
If you want to that, that requires many kinds of protocols i have talked about. You need high-availability, and that means many kinds of failures in very large distributed system, how could they be masked. That require some of the protocols i have talked about.
Think about this way: a conventional database system that is very large, it has a dozen notes. To date a failure of a replication. At this scale, it requires a different think.

Q:你好,Ramgu先生,我想问一下,云计算对数据库产生了很大的影响,你是如何看这种影响的呢?
A:我今天演讲的内容已经讲了很多这方面的问题。希望能对这个问题有所帮助。
从本质上来说,云计算意味着你需要搭建一个多用户的系统,这些用户可能是开发者,也可能是使用者,他们都通过云来使用你的系统。他们可以在任何时间要求更多的空间(或是性能),而你的系统需要即时相应他们的需求。
因此,你需要建立这样的系统:它允许你动态的添加各种能力(空间,性能等),而且你的系统可以自动的把你添加的这些能力有效的分配到系统的用户身上。
这样的系统需要很强的可用性,这意味着需要各种机制来处理这种大型分布式系统上面出现的异常。我在演讲中提到的系统(PNUTS)提供了很多这样的机制。


Q:One questions for the database serving in cloud. I wonder that hadoop only store the data. I mean do you store any like stored procedures on database.
A:The bottomline is at today, we don't provide any special support for this. Some developers had tried that. As you know, we could store a code block and we could execute it some way. We may do something to support that in the future.
Other things that people had asked for, is be able to date make sure arounds in the location that as same as it established. That is a feature we don't support yet.
So things like that we know we need to do, but we don't get it.


Q:您好,我想提问一个关于云中数据服务的问题。我发现Hadoop只负责存储数据,我想问您,可以在其中存储一些过程吗(比如说数据库中的存储过程)。
A:我们现在还没有提供专门的机制来支持你所提到的操作。据我所知,一些开发者可能会使用一些小技巧,比如说存储一些代码片段,然后通过某种方法执行它。在以后我们可能会采取一些机制来这些操作。

Q:You mentioned hadoop and PNUTS, do you think traditional database management systems have any chance in cloud computing environment?
A:There are many kinds of application scenarios. The scenario I had talked about today is Yahoo scenario. Take yahoo login system as an example. It is a logging database that supporting 640 million users. Support something like this large-distributed system requires kinds of things i have talked about today.
Let me give you a very very different example of cloud: There is a company called ADP(Associated Data Processing). They provide database services for other companies. For them, what they actually doing is managing and running hundreds thousands of small databases. Any of these databases may run on a single box. They want transactions, ACID, for-SQL, but they don't need anything more than simple mysql style asynchrounous application.  The design of cloud system for that would be completely different.
It would still contains some features i have talked about. In that availibility is important, in that multitenant is important, in that elasticity is important. But they need a very different design. They will try to run lots of copies of small traditional relational databases on a farm of thousand servers, and be able to move this to one farm to another but the unit is one entire database.
Many of commercial vendors are thinking about how to cloudify their stuff. If you take Microsoft Azure, it is a way of adapting SQL server to support simple database deployments on the cloud.
So the short answer is yes, it is a long story.

Q:您提到了很多关于PNUTS和Hadoop的内容,我想问一下,您认为传统的DBMS(数据库管理系统)在云计算环境中还有立足之地吗?
A:现实中存在很多的应用场景。今天我所演讲的是Yahoo的应用场景,就拿Yahoo的登录系统来举例。它有多达六亿四千万的用户,需要对这些用户的请求进行即时的处理。为了支持这个数量级的操作,需要使用我今天提到的一些技术(指PNUTS和Hadoop)。
我在这里举另外一个很不同的云应用场景为例:有一家公司叫ADP(Associated Data Processing),它们为其它一些小公司提供数据服务。ADP所做的工作就是管理长千上万个小规模数据库并维护使其正常运转,这些数据库可能在同一台机器上,也可能不是。这些数据库的用户的需求很简单:事务,ACID,SQL操作,最多也就是用MYSQL风格的异步数据访问程序。设计这样的云系统于Yahoo的登录系统截然不同。
这样的系统需要在大型服务器场中运行成千上万个传统的关系型数据库,而且需要支持在服务器间以数据库为基本单位进行数据迁移。为了支持这些操作,可用性、多用户操作、灵活性都是必须的特性。
此外,一些大型软件提供商也在试图把他们的产品"云化"(Cloudify),就拿微软的Azure为例,实际上它就是一个支持在云中进行简单数据库部署的SQL Server。
你问我传统的DBMS在云计算环境中是否还有立足之地,我的回答是肯定的。

Q:I have two questions.
First is I want to know does PNUTS support multi-key query?
And the second is, have you heard of Greenplum, how do you say about the difference between Hadoop and Greenplum?
A:When you say multi-key query, it means I give you several keys and aske you to give me all the matched records?
Q:Yes.
A:So that is the answer of the first question.(Laugh).
The second question, greenplum fundamentally is a OLAP system——Traditional relational OLAP system, but likely it has also started supporting implementations of mapreduce. Because it's popular and it got customers asking for it. implementation of mapreduce with some OLAP capabilities.
Easy way to summarize. Hadoop is a particular implementation of mapreduce. Greenplum is another implementation of mapreduce, and it also have traditional OLAP capabilities.

Q:我有两个问题。
第一个问题是,PNUTS支持多键查询吗?
第二个问题是,您听说过Greenplum吗,您是如何看待Hadoop和Greenplum这两者之间的关系的?
A:你提到的多键查询,是指输入一些键值,然后返回所匹配的记录吗?
Q:是这样的。
A:那这个问题已经解决了(笑)。
对第二个问题,从本质上来说,Greenplum是一个OLAP(Online Analytical Processing)系统——一个传统的关系OLAP系统,当然它也开始支持实现MapReduce。一方面是因为MapReduce越来越流行,另外一方面来自客户的要求。总的来说,Greenplum以一个带有OLAP能力的MapReduce实现。
简单的来说:Hadoop是一个专门的MapReduce实现,而Greenplum是一个实现了MapReduce,并带有一些传统OLAP能力的系统。

Q:Imagine you want some information about Madonna. You mentioned that the user type Madonna in the search box. Send more requests before the final results presents to the user? Different component
A:The question I understood it is someone say: tell me all about Madonna. What are the different steps that the request flow through before the user see the results?
Q:Yes.
A:Ok, the story has to begin before the user issued the question. If I really want to give you everything to know about Madonna, I have to anticipate that you are interested in celebrities. And therefore, I have to be able to get all the relevant information about celebrities from different feeds, from people who maintain feeds on video or movies, from calling the web to do the information extraction. I have to get the data, integrate them, and create the relevant tables in the web of concepts. All of this happens before you type the query.
Once you type the query, it go through the steps: that analyzing your query, enlight similar query that other users use, to make sure we understand what you are really looking for, are you looking for Madonna the actress, Madonna the musician, Madonna the mother of Christ. Ensure your major intent.Then invokes a call to the results of your previous aggregation. This maybe a system like .Which has necessary data.
By the way, in order to interpret your query Madonna. I probably aslo has the profile data about you. And every time you do something, I am updating your profile. That profile data are very likely stored in PNUTS or other system. And we look it in hadoop to interpret what you really want.
And the first pass of gathering your profile data, and create semantic aggregation for Madonna, all of that used Hadoop.

Q:您在演讲中提到了下一代的搜索(Next-Gen search),在这里我想问一下,假设您打算搜索与Madonna的信息,您在搜索框中输入Madonna,点击搜索,然后Yahoo返回搜索结果(这些结果很可能是因人而异的),在这个过程中都涉及到了那些操作呢?
A:这可要从用户输入搜索关键字之前开始说起了。
如果你需要我提供给你所有关于Madonna的信息,我就会假设你对名人感兴趣。因此,我会从各种各样的资源(视频网站,电影网站等)中,抽取和名人相关的信息。在得到这些信息后,我会把它们进行集成,然后创建对应的概念网络。这一切都是在你键入关键字搜索之前发生的。
当你输入关键字,点击搜索按钮之后,将会经历如下的步骤:分析你的查询,将你的查询与其他用户所进行的类似查询进行比较,从而来确定你真正想搜索的目标到底是演员Madonna、歌手Madonna还是圣母Madonna。在确定了你的搜索意图之后,我们会利用之前收集的信息,给出一个你所需要的答案。
顺便提一下,为了正确的解析你的查询。我们会对你的个人账户建立档案信息。我们会根据你所做的行为来更新你的档案信息。这些档案信息一般存储在PNUTS之中,然后通过Hadoop对其进行分析。

Q:OK, I want to ask the last question: Because you are very successful in research and academic field. And now moved into industrial field. I think today there are a lot of students from the IT majors at the conference. I want to ask: what is the most important capacities to prepare for their future careers. To be successful in research or in industrial field.
A:I think the most important capacity is to find good people to work with.
And, when I was a student I was fortunate to find a good teacher, and as a teacher I was fortunate to find a good student. And in industrial I was fortunate to find good colleagues.
I have to mention that Yahoo has opened the Beijing lab recently. They are doing some work like web of concepts I have mentioned. You could join us and work with us.
And the short answer is you learn from good people, you work with good people, and you will be successful.

Q:Ramgu先生您好,您无论是在学术界还是工业界都取得了巨大的成功,所以我在这里代表今天与会的听众向您提问最后一个问题:在学术界或是工业界中取得成功所需的最重要的能力是什么呢?
A:我认为最重要的能力是,找到优秀的人,与其同事,向其学习,取其之长,补己之短。
当我是一个学生的时候,我很幸运的找到了一个优秀的老师;当我成为教室后,我很幸运可以找到优秀的学生;而我进入工业界后,我很幸运的与一群最优秀的人员所共事。对了,Yahoo刚刚在北京创建了实验室,这个实验室正在做一些诸如概念网络方面的研究,如果你对这方面感兴趣,请加入我们。
总而言之,三人行,必有我师焉,择其善者而从之,其不善者而改之。(You learn from good people, you work with good people, and you will be successful)







posted @ 2011-04-06 21:50  _Luc_  阅读(1165)  评论(0编辑  收藏