Becoming a data scientist

Data Week: Becoming a data scientist

Data Pointed, CouchDB in the Cloud, Launching Strata

Life Advice

Career Advice

Data

How do I become a data scientist?

Background: I recently finished my bachelor's degree in computer science at Berkeley. Although it may be a bit late, I am just now getting interested in learning more about statistics and "data science." Unfortunately, I don't have much of a math background (only took up to Linear Algebra) and the required probability/discrete math course for CS. Although I started working, I have the option of enrolling in an MS CS program in January. What courses should I be looking at and will a MS in Statistics be more useful? If so, is it possible to get into an MS in Statistics without a strong math background? I will probably be looking into taking machine learning and data visualization.

9 Answers • Stay updated about new answers by joining Quora

Alex Kamil

82 votes by Edwin Khoo, Anon User, Neil Kodner, (more)

Strictly speaking, there is no such thing as "data science" (see What is data science? ). See also: Vardi, Science has only two legs: http://portal.acm.org/ft_gateway...

Here are some resources I've collected about working with data, I hope you find them useful (note: I'm an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix factorizations:

Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and usually underrepresented in a standard "machine learning" curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I'd recommend these resources for self study/reference material:

BellKor, Matrix factorization for recommender systems: www2.research.att.com/~volinsky/...
BellKor, Scalable Collaborative Filtering..: public.research.att.com/~volinsk...
Press et al., Numerical Recipes in C++: http://www.amazon.com/Numerical-...
Golub & Van Loan: Matrix Computations: http://www.amazon.com/Computatio...
Watkins, Fundamentals of Matrix Computations (this is a very gentle intro to the field): http://www.amazon.com/Fundamenta...
Demmel, Applied Numeric Linear Algebra: http://www.amazon.com/Applied-Nu...
Trefethen & Bau, Numerical linear algebra: http://www.amazon.com/Numerical-...
Watkins: The Matrix Eigenvalue Problem: GR and Krylov Subspace Methods: http://www.amazon.com/Matrix-Eig...
Parlett, The Symmetric Eigenvalue Problem: http://www.amazon.com/Symmetric-...
Iverson, Algebra as a language: http://www.jsoftware.com/papers/...
Iverson, Algebra: an algorithmic treatment: http://www.amazon.com/Algebra-al...
Bertsekas, Parallel and Distributed Computation: Numerical Methods:http://www.amazon.com/Parallel-D...
Hamming, Numerical Methods for Scientists and Engineers: http://www.amazon.com/Numerical-...
Bierman, Factorization Methods for Discrete Sequential Estimation: http://www.amazon.com/Factorizat...
Wilkinson, The algebraic Eigenvalue Problem: http://www.amazon.com/Algebraic-...
Horn, Matrix Analysis: http://www.amazon.com/Matrix-Ana...
Harville, Matrix Algebra from a statistician perspective: http://www.amazon.com/gp/product...
Fiedler, Special Matrices: http://www.amazon.com/Special-Ma...
Higham, Accuracy and stability of numerical algorithms: http://www.amazon.com/gp/product...
Langville & Meyer, Google Page Rank and Beyond: http://www.amazon.com/Googles-Pa...
Nielsen, PageRank tutorial: http://michaelnielsen.org/blog/u...
Mannix, Numerical recipes in Hadoop: http://www.slideshare.net/jakema...
Godsil, Algebraic Graph Theory: http://www.amazon.com/Algebraic-...
Wheeler: On building a stupidly fast graph database: http://blog.directededge.com/200...
http://numpy.scipy.org/

2) Start learning statistics by coding with R:

Pick up some R manuals (see

What are essential references for R?) and experiment with some of these data sets: http://www.datawrangling.com/som...
and UCI Machine learning repository: http://archive.ics.uci.edu/ml/

Here is a good reference to get started with regression analysis:

Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models:
http://www.amazon.com/Analysis-R...

Albert, Bayesian computation with R:

http://www.amazon.com/Bayesian-C...

Spector, Data Manipulation with R:

http://www.amazon.com/Bayesian-C...

Gries, Quantitative corpus linguistics with R: http://www.amazon.com/Quantitati...
Duda & Hart, Pattern Classification:http://www.amazon.com/Pattern-Cl..., it is a classic book on statistical inference and a very readable intro to the field
Go through the Exploratory Data Analysis by Tukey: http://www.amazon.com/Explorator.... Read Hamming for inspiration: http://www.cs.virginia.edu/~robi...
If you want to get a job look up "statistician" or "data scientist" job specs on Twitter and see what the market wants: http://twitter.com/#search?q=sta..., http://twitter.com/#search?q=%22...
E.g. here is Netflix's definition of "data scientist" body of knowledge: http://jobs.netflix.com/DetailFl... Multivariate Regression, Logistic Regression, Support Vector Machines, Bagging, Boosting, Decision Trees, Time Series Analysis, Optimization, Stochastic Processes, Experiment Analysis, Bootstrapping, R, SAS, Python, Weka, SQL and Excel . This looks like a standard Statistics curriculum.
According to LinkedIn job posting (http://www.sanfranrecruiter.com/...) you need to know some of the following: algorithm design, information retrieval, relational databases (SQL) and non-relational databases (Hadoop/pig), big data analytics, data classification, text mining, search algorithms. This seems to be more of a CS/IR oriented role.
Learn about Palantir (http://www.palantirtech.com/), Recorded Future (https://www.recordedfuture.com/) and Lyric Semiconductor (http://www.lyricsemiconductor.com/), they make interesting products.
Subscribe to DBWorld (it's a bit noisy but worth following): http://www.cs.wisc.edu/dbworld/; Consider joining at least one of these interest groups: http://www.sigkdd.org/, http://www.sigir.org/, http://www.sigmod.org/, http://www.sigsam.org, http://www.amstat.org/, http://www.siam.org/
Choose an interesting problem to tackle, say temporal search: http://www.google.com/search?q=t...
See what interests you more, do your market research. Would you prefer working with vendor tools and do mostly modeling and reporting, or build data mining systems yourself and write a lot of code? Do you see yourself as a corporate employee, a researcher in academia or a startup founder in the future? What data interests you? Structure your curriculum based on that.

3) Learn about distributed systems and databases:

Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. I believe it is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data. It is also becoming increasingly important to be able to utilize the full power of multicore. (see http://en.wikipedia.org/wiki/Moo... , http://techresearch.intel.com/ar...)
Download Hadoop [8] and run some MapReduce jobs on your laptop in pseudo-distributed mode (see

What's the best way to come up to speed on MapReduce, Hadoop, and Hive? )

Learn about Google technology stack (MapReduce, BigTable, Dremel, Pregel, GFS, Chubby, Protobuf etc). (See

What are the most interesting Google Research papers?
also http://research.google.com/pubs/... and http://www.umiacs.umd.edu/~jimmy..., http://www.columbia.edu/~ak2834/...)

Setup account with Amazon AWS/EC2/S3/EBS and experiment with running Hadoop on a cluster with large data sets (you can use Cloudera or YDN images, but in my opinion you can better understand the system if you set it up from scratch, using the original distribution). Watch the costs.

Try out Hadoop alternatives, specifically the minimalist frameworks such as BashReduce: http://github.com/erikfrey/bashr... and CloudMapReduce: http://code.google.com/p/cloudma... (see

What are some promising open-source alternatives to Hadoop MapReduce for map/reduce? )

Run Bryan Cooper's Cloud Serving Benchmark on AWS, compare Hbase vs Cassandra performance on a small cluster (6-8 nodes): http://wiki.github.com/brianfran...
Run LINPACK benchmark: http://www.datawrangling.com/on-...
Run some experiments with MPI (http://www.mcs.anl.gov/research/...) try to implement a simple clustering algorithm (e.g http://en.wikipedia.org/wiki/K-m...) with MPI vs Hadoop/MapReduce and compare the performance, fault tolerance, ease of use etc. Learn the differences between the two approaches, and when it makes sense to use each one.
Check out Dongarra' papers: http://www.netlib.org/utk/people...
There is a new library called MPI-Mapreduce (http://www.sandia.gov/~sjplimp/m...) see how it works and how it compares to other MapReduce implementations
Run some tests with Scalapack [5], try to port one of the routines to Hadoop, compare the performance and scalability
Write your own simplified MapReduce runtime in C or any other programming language
Check out http://www.cascading.org/, http://clojure.org/ and http://github.com/bradford/infer
Learn about distributed hash tables (http://en.wikipedia.org/wiki/Dis...)
Learn about Paxos (http://en.wikipedia.org/wiki/Pax...), run some experiments with open source implementations.
Download Nutch (http://nutch.apache.org/) or Solr (http://lucene.apache.org/solr/), run a crawl on Wikipedia. Analyze the collected data with R (see item 2 above) or Python (http://www.nltk.org/)
Write you own simplified crawler/indexer, test the performance and scalability, look at the Lucene source for ideas, look at http://infolab.stanford.edu/~bac... for inspiration. You can probably build it as a term project in either Information Retrieval or Search Engines course.
Learn about prefix-sum: http://en.wikipedia.org/wiki/Pre... ,parallel matrix multiplication: http://www.cs.berkeley.edu/~yeli... ,streaming: http://infolab.stanford.edu/stream/ and BSP: http://en.wikipedia.org/wiki/Bul...
Pick one of the PGAS languages (http://en.wikipedia.org/wiki/Par...), e.g. X10 (http://en.wikipedia.org/wiki/X10..., go through the tutorials (http://ppppcourse.ning.com/forum...), run some HPC benchmarks (LU, FFT) and the examples (the streaming example in particular): see how it scales on a cluster/AWS, compare to sequential and Hadoop/MapReduce implementation, see what kind of performance/scalability gains it gives you on multicore boxes.
Some good references on parallel programming: Herlihy& Shavit, The art of multiprocessor programming: http://www.amazon.com/Art-Multip... , Blelloch, Vector models for data-parallel computing: http://citeseerx.ist.psu.edu/vie... , Valiant, A bridging model for parallel computation: http://portal.acm.org/citation.c... ,Hillis & Steele, Data Parallel Algorithms: http://portal.acm.org/citation.c...
Take a course in Parallel Computer Architecture: http://www.eecs.berkeley.edu/~cu...
Check out Cilk: http://software.intel.com/en-us/...
Run some experiments with Weka (http://www.cs.waikato.ac.nz/ml/w...) or RapidMiner (http://rapid-i.com/), pick a simple algorithm and port it to MapReduce, see how it scales on a cluster/AWS
Experiment with distributed 'NoSQL' data stores (Voldemort, Hbase, Redis, Tokyo, Cassandra etc). Figure out what is CAP theorem all about (http://www.allthingsdistributed....). Create a simple app with key-value or column-based store as a back-end. Import several GBs of interesting data into it and run some simple clustering/KNN algos (http://en.wikipedia.org/wiki/Clu..., http://en.wikipedia.org/wiki/Nea...). Optimize your algo to better utilize random access patterns, experiment with various tuning options. Build a frond-end visualization for the results (Check out Protovis or similar visualization package: http://vis.stanford.edu/protovis/)
A good resource on 'NoSQL': Varley, No Relation: The Mixed Blessings of Non-Relational Databases: http://ianvarley.com/UT/MR/Varle...
Learn about main-memory databases: http://en.wikipedia.org/wiki/In-... , http://scholar.google.com/schola..., http://monetdb.cwi.nl/
Write a distributed hash table in C, here is a good reference: http://pdos.csail.mit.edu/papers...
Write a distributed file system in C. Learn how to write good systems code using the following resources:

http://swtch.com/~rsc/
http://herpolhode.com/rob/
http://www.cs.princeton.edu/~bwk/
http://cm.bell-labs.com/who/dmr/
http://www.cs.columbia.edu/~aho/
http://plan9.bell-labs.com/who/ken/
http://www.informatik.uni-trier....

4) Learn about data compression
To be added
5) Learn about machine learning

This is an excellent resource for self-study: Cross, Learning about machine learning: http://measuringmeasures.com/blo... , also http://metaoptimize.com/qa/quest...
The alternative (and rather expensive) option is to enroll in a CS program/Machine Learning track if you prefer studying in a formal setting.
Since all the standard machine learning, data mining, IR, statistics, AI, NLP content is available online, can be forked on github or purchased on Amazon I personally don't see much value in studying for a Masters degree unless you want a corporate job afterwards.
See: Was your Master's in Computer Science (MS CS) degree worth it and why? , When is it a good idea to get an MS in Computer Science? , Was your Master's degree in Statistics/Applied Math/Symbolic systems worth it and why? What are the advantages and disadvantages of doing a CS PhD?
[Higher Education] Which are the best universities for an MS or PhD related to Information Retrieval, and why?
See Lorica, How to nurture data scientists: http://practicalquant.blogspot.c...
You can structure your study program according to online course catalogs and curricula of MIT (http://web.mit.edu/catalog/degre..., http://ocw.mit.edu/courses/elect...), Stanford (http://www.stanford.edu/dept/reg...) or other top engineering schools. Experiment with data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your garage (http://www.ngoprekweb.com/2006/1...).
Joining a well-capitalized data-driven startup and learning by doing (with some part-time self-study using the resources above) could be a good option. See

What are the hottest startups in the analytics space?
Who are the best VCs in the field of analytics / data mining / databases?
Which companies have the best data science teams?
What are the notable startups in the news space?
Does the US Census have a data team?
Why do so many data geeks join web companies instead of solving large scale data problems in biology?

6) Learn about least-squares estimation and Kalman filters:

This is a classic topic and "data science" par excellence in my opinion. It is also a good introduction to optimization and control theory. Start with Bierman's LLS tutorial given to his colleagues at JPL, it is clearly written and is inspiring (the Apollo mission trajectory was estimated using these methods): http://www.amazon.com/Factorizat... , also see Curkendall & Leondes: http://adsabs.harvard.edu/full/1974CeMec...8..481C and Quarles: http://citeseerx.ist.psu.edu/vie....
See Steven Kay's series on statistical signal estimation: http://www.amazon.com/Fundamenta..., also check out his short course outline at University of Rhode Island for a list of interesting topics to learn (this is usually part of EE curricula): http://www.ele.uri.edu/faculty/k...

7 Comments • Wed Aug 25 18:08:10 UTC+0800 2010

Peter Skomoroch, Sr. Data Scientist @ Linkedin - ... 19 endorsements

12 votes by Alex Kamil, Lakshmi Narasimhan Parthasarathy, Mat Kelcey, (more)

If you have the time to take courses, give it a shot.

1) Try to take some of the undergrad math courses you missed. Linear Algebra, Advanced Calculus, Diff. Eq., Probability, Statistics are the most important. After that, take some Machine Learning courses. Read a few of the leading ML textbooks and keep up with journals to get a good sense of the field.

2) Read up on what the top data companies are doing. After 1 or 2 machine learning courses you should have enough background to follow most of the academic papers. Implement some of these algorithms on real data.

3) If you are working with large datasets, get familiar with the latest techniques & tools (Hadoop, NoSQL, R, etc.) by putting them into practice at work (or outside of work).

Read these posts by Mike Driscoll:

* http://dataspora.com/blog/the-se...
* http://dataspora.com/blog/sexy-d...

Fri Sep 3 05:17:29 UTC+0800 2010

Joseph Misiti

6 votes by Charlie Cheever, Edwin Khoo, Mei Marker, (more)

I am currently working as a data engineer with a team of others and I can tell you what we all have in common:

1) MS or PhDs in Applied Mathematics or Electrical Engineering
2) Fluency C++/Matlab/Python
3) Experience building distributed systems and algorithms.

I agree with Anon that CS is probably not the way to go unless you are going to MIT, Caltech, Stanford, CMU, etc. The way I ended up in the field was working as a software engineer designing real-time systems and getting a MS in Applied Math part-time. After 4 years I had skills from both fields and was offered a position doing ML/DM. With that said, I can tell you that its an extremely interesting field, and it appears the skill set will only become more desirable in the future.

2 Comments • Thu Aug 26 10:24:51 UTC+0800 2010

Gregory Piatetsky, analytics/data mining consultant... 1 endorsement

5 votes by Peter Skomoroch, Susheel Kiran J, Carlos Leiva Burotto, (more)

A good start for becoming a data scientist is to get MS (or PhD) in Machine Learning / Data Mining - along the way you will get plenty of experience in relevant math and use latest systems. Stanford, UCI, CMU, MIT are top schools, but there are many others in USA - see
http://www.kdnuggets.com/educati... and in Europe
http://www.kdnuggets.com/educati...

Stanford has online courses in data mining / ML - check
http://www.kdnuggets.com/2010/06...
http://scpd.stanford.edu/

Thu Sep 9 02:08:55 UTC+0800 2010

Russell Jurney, Data Viznik, Hack Historian 2 endorsements

4 votes by Alex Kamil, Simplicio Gamboa III, Luis Alberto Santana and Mat Kelcey

The school route is well covered. This is the autodidactic route:

Look at some common problems solved with machine learning. Look at problems in your areas of interest with an abundance of available data. Intersect these sets, pick a problem to solve with ML. Learn whatever it takes to solve it poorly. Get people using the output of your model. Iterate, learn more techniques. Work on your maths as needed. Find mentors to talk with about problems you're working on. Keep them updated, collaborate, learn from them.

Get good at building things with data. Update your LinkedIn profile - congratulations, you're a data scientist!

Thu Sep 2 07:26:18 UTC+0800 2010

Paco Nathan, 45 years ago I couldn't even spe... 4 endorsements

4 votes by Joey Shurtleff, Edwin Khoo, Alex Kamil and Josh Wills

Stanford has an interdisciplinary degree specifically for data science, called Mathematical and Computational Sciences (MCS). It's sponsored by the Stats department and overlaps with CS, Math, Operations Research, etc. http://www.stanford.edu/group/ma... The BS degree dovetails particularly well with a co-term program to get an MS in Computer Science -- say, with a distributed systems specialization.

+1 to both Pete's and Russ' wise words above.

1 Comment • Wed Sep 8 11:57:34 UTC+0800 2010

Yaniv Goldenrand, Fraud and credit modeling

3 votes by Alex Kamil, Kevin Li and Seb Paquet

Get a job doing it, this way you'll learn what really matters and get paid in the process.
The standard way to become a data analyst is master's in math/statistics + internship.

Other ways are:
- PhD in some empirical subject (economics, psychology).
- Get an engineering position in some data-intensive company and convert.
Some of the best modelers I know are ex-programmers.

Thu Aug 26 07:33:52 UTC+0800 2010

Sandro Saitta

1 vote by Alex Kamil

Reading data mining related blogs is also important to understand the wide application areas of data mining. You have a list of data mining blogs here: http://www.dataminingblog.com/li...

1 Comment • Fri Sep 24 18:29:28 UTC+0800 2010

Xuehua Shen

1) infrastructure of data processing, such as Hadoop/MapReduce, Pig/Hive, and automation/cron.
2) simple stats about data, such as mean, correlation, and p-value.
3) algorithms for data modeling, such as logistic regression, and SVM.
4) visualization of data, such as chart and table.

Mon Sep 13 01:57:29 UTC+0800 2010

posted on 2010-09-27 14:11 小司阅读(557) 评论(0) 编辑收藏举报