[转贴]What is Google Binary Search and Should We Fear it?

 

What is Google Binary Search and Should We Fear It?

Published 14 September 06 02:46 PM | msutton

Background

The so-called Google Binary Search (GBS) gained a fair bit of press attention in July 2006, when PC World published an article entitled 'Google's Binary Search Helps Identify Malware'. In the article, Websense revealed that they had used an undocumented Google search feature to identify malicious code. At the time, Websense Senior Directory of Security, Dan Hubbard, indicated that he planned to privately share the code that they were using among fellow security researchers, but would not be making it public.

GBS was back in the news a couple of weeks later when PC World published a follow up article. This time around, the article discussed HD Moore's Malware Search project, which had recently been made public. Moore downplayed the threat of GBS being used to obtain malcode and argued that it was more useful for identifying sites that distribute malware.

To the best of my knowledge, GBS is not publicly documented or mentioned by Google. This in and of itself is interesting for a company that typically provides early insight into research projects via Google Labs. After a fair bit of searching, I was unable to find much beyond Moore's Ruby source code to provide insight into how GBS works. If you don't speak Ruby, this blog is for you.

How it works

Google search engines appear to be downloading a sampling of executable files in addition to the web pages, documents, etc. that it is typically used to search for. With executable files, rather than indexing human readable strings within the file, Google is instead disassembling the header of the executable file and indexing information from it. All executable files begin with a header, which contains basic information such as the type of executable in question, where various segments exist, etc. This information is needed by the operating system that is launching the file. Windows executables, object code and DLL's, adhere to a file format known as the Portable Executable (PE) format. Therefore, all such files begin with a PE header and this is the information indexed by Google for Windows executables.

To see an example of this, do the following:

  1. Conduct a Google search for: "Signature: 00004550"+"Machine: Intel 386"
  2. Click on any of the "View as HTML" link in the search result

The page that you are viewing is simply a neatly formatted report containing PE header information for one of the executables that Google has indexed. Google treats this page, which it has created, just like any other web page. You can search for any of the unique phrases within the page and obtain results for executable files.

What it can be used for

Now that we know what GBS is and how it works, the next logical question is "how can I use it for something useful?". Good question. As mentioned, GBS can be used to search for any executable file, but you first need a means of identifying unique information within the PE header. In order to explain how this is done, we'll download a copy of popular telnet/SSH client Putty. Putty was selected as we need an executable file which is likely to be hosted on websites that Google would index. A popular self-extracting installation file is a solid choice as a *.zip file would not be indexed by GBS.

Next, we need a means of viewing the PE header information. There are various freeware tools for doing this such as LordPE. Once your PE editor is installed, open the target executable. In LordPE, click on the 'PE Editor' button, browse to the Putty executable and open it. Once this is done, you will see the basic PE header as shown in the image below. Most of the search data that we require will come from this screen.

LordPE - basic pe header information

Next, click on the 'Sections' button, to view the section table. The section table provides details on the location of various components of the executable file, such as the code and data segments. Below is a screenshot of the Section Table.

LordPE - section table

Now, we need to select unique values to search for. Following HD's lead, we'll use the following fields:

Google Field
LordPE Field
LordPE Screen

Time Date Stamp
TimeDateStamp
Basic PE Header Information

Size of Image
SizeOfImage
Basic PE Header Information

Entry Point
EntryPoint
Basic PE Header Information

Size of Code
.text-->RSize
Section Table

The table above lists the field names used by Google, the same field as it's named in LordPE and finally, the LordPE screen name where you will find the data. Now that we have our search data, we can construct our query.

"Time Date Stamp: 4252EA65"
"Size of Image: 0006D000"
"Entry Point: 0004265F"
"Size of Code: 0004A000"

Concatenating the individual search phrases into a single query leaves us with the following:

"Time Date Stamp: 4252EA65"+"Size of Image: 0006D000"+"Entry Point: 0004265F"+"Size of Code: 0004A000"

Assuming that the Google index hasn't changed and that the version of Putty initially downloaded for analysis is consistent, you should receive one search result that is indeed a link to download an identical copy of putty.exe at a location other that our original download site. What does this tell us? It tells us that Google is only indexing a small fraction of the executables that it locates. We know this because Putty is a popular program available for download at many sites and we only found one. In fact we didn't even find the initial download site, which we know exists.

What about Malcode?

The initial eWeek story expressed concern that GBS could be used to identify malcode samples. While it's true that it could be used for such a purpose, it's questionable just how useful that approach would be. For those that are curious, HD has saved you the trouble of needing to generate your own search terms by publishing a signature database for common malcode samples. The format of the database is 'Descriptive Name:Time Date Stamp:Size of Image:Entry Point:Size of Code'. If you test the signatures in this database you will find that for the most part you receive surprisingly few results given the prevalence of malcode. Once again, this can be attributed to the fact that the Google index contains only a sample of executable files. Beyond that, if you run the executables obtained through an AV scanner, you'll see that many are false positives. Why? The signatures created are far from perfect. It's very possible to create two completely different executables with the same signatures given the fields that we've chosen to search for. Naturally, the false positives could be reduced by creating more precise signatures using additional fields. If you're really ambitious, you could create your own signatures by obtaining samples of malcode from one of the many public repositories, but if you've done that, you don't really need GBS to get malcode now do you?

Conclusion

I agree with HD Moore. Given the number of binary files being indexed at this point, GBS is not particularly useful for obtaining malcode samples. It is somewhat useful for identifying sites that may be hosting malcode but even then, the results tend to reveal binary attachments in email messages sent to mailing lists. It's hardly surprising that malcode would be found in such a location. Moreover, if you're looking for malcode, there are no shortage of places to find it, with or without GBS. That's not to say however that GBS is not a useful tool. I have no doubt that over time, as the index grows, the results will continue to be more useful to whitehats and blackhats alike.

- michael

posted @ 2006-10-10 18:34  Dream world 梦想天空  阅读(650)  评论(0编辑  收藏  举报