AOL-user-ct-collection
500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.
Brief description:
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
Thegoal of this collection is to provide real query log data that is basedon real users. It could be used for personalization, queryreformulation or other types of search research.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.
Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
Inthe first case (query only) there is data in only the first threecolumns/fields -- namely AnonID, Query, and QueryTime (see above).
Inthe second case (click through), there is data in all five columns. For click through events, the query that preceded the click through isincluded. Note that if a user clicked on more than one result in thelist returned from a single query, there will be TWO lines in the datato represent the two events. Also note that if the user requested thenext "page" or results for some query, this appears as a subsequentidentical query with a later time stamp.
CAVEAT EMPTOR --SEXUALLY EXPLICIT DATA! Please be aware that these queries are notfiltered to remove any content. Pornography is prevalent on the Weband unfiltered search engine logs contain queries by users who arelooking for pornographic material. There are queries in thiscollection that use SEXUALLY EXPLICIT LANGUAGE. This collection ofdata is intended for use by mature adults who are not easily offendedby the use of pornographic search terms. If you are offended bysexually explicit language you should not read through this data. Alsobe aware that in some states it may be illegal to expose a minor tothis data. Please understand that the data represents REAL WORLDUSERS, un-edited and randomly sampled, and that AOL is not the authorof this data.
Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006
Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID's
Please reference the following publication when using this collection:
G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First
International Conference on Scalable Information Systems, Hong Kong, June,
2006.
You can download it from here:AOL-data.tgz.
According to Adam D’Angelo, the reason AOL published the data was for recognition in the search-engine research arena:
This was not a leak - it was intentional. In theirdesperation to gain recognition from the research community, AOLdecided they would compromise their integrity to provide a data setthat might become often-cited in research papers: “Please reference thefollowing publication when using this collection: G. Pass, A.Chowdhury, C. Torgeson, ‘A Picture of Search’ The First InternationalConference on Scalable Information Systems, Hong Kong, June, 2006.” isthe message before the download.
Here’s a breakdown of the core facts:
- 20,000,000 queries from 650,000 users in 2GB uncompressed tab-delimited files
- Uncensored queries for three months of AOL search service, spring 2006
- Essentially public domain
- Contains dangerous private information
Update
The data is rife with all kinds of personally identifiable data. Forexample, a quick grep for credit-card patterns produces the following:
grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
- 9006-0512-xxxx-xxx
- 1550-0905-xxxx-xxxx
Looking for Social Security Numbers (SSN) turns up this HUGE amount of data:
grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt
- kristy nicole vega hammond la. social secruity number 437-67-xxxxbirth date 03 08 xx drivers license number la. 00765xxxx address 41178rene dr. hammond la.
- pamela button 079-60-xxxx
- thomas j finney socsec 370-40-xxxx
- 419-94-xxxx thomas black
- 458-87-xxxx seguro social
- social security number 545-29-xxxx
- ssn 436-47-xxxx
I’ve censored the personal information, but there are about 200entries of social security numbers in the test data. Searching forthings that look email addresses ([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.)turns up another 60 or so.
Update 2:
If you want to get this data into a more usable form, say MySQL, trythis (note that we’re not going to bother storing duplicate queries,but you might want to):
mysql> CREATE TABLE aoldata (anonid int unsigned notnull, query varchar(255), querytime datetime, itemrank int unsigned,clickurl varchar(255), PRIMARY KEY(anonid, query))
Then you just need to import it, as appropriate:
LOAD DATA LOCAL INFILE ‘user-ct-test-collection-01.txt’
INTO TABLE aoldata
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
(anonid, query, querytime, itemrank, clickurl);
浙公网安备 33010602011771号