Self study list for becoming a data scientist

Data science is an exciting and growing field! It is a relatively new field and I’m sure lots of graduates have questions about it so I thought I’d put together all the best resources out there. I’m currently putting together a list of useful resources for budding data scientists (myself included) out there.


1. An introduction to statistical learning: This is a great practical text on machine learning. I was impressed by its amazon reviews (all very positive reviews as of 3/11/2015) and indeed very impressed its clarity and precision (amazon link). The authors of the book have kindly made it available as a free pdf online but most enjoy it so much that they end up buying the hard copy.

2. Theory and Applications for Advanced Text Mining

3. 9 free data science books
Programming languages

4. There are a few programs specifically designed for PhD graduates who want to become data scientists. One of them is the insight data science course. The link takes you to a page of recommended readings and preparatory material for their data science candidates. The application for this fellowship is very competitive and they’ve recently started taking medical doctors for their new insight healthcare data science. The actual course is very self-directed and is more of a crash course. I believe one of the best things about the course is that it serves as a platform where employers can meet with data science candidates and they also prepare candidates for interviews.

5. Mathematics and Statistics background: It is essential to have a good grasp of certain mathematical concepts from linear algebra and multivariable calculus. Probability and distribution theory and Bayesian statistics would also be very valuable. The coursera machine learning course by Andrew Ng is very highly recommended.

6. An extensive list of resources on how to become a data scientist can be found on the quora website-how to become a data scientist. Most of these were posted by current data scientists in various industries and so for the most part will be up to date.

7. Programming: It’s crucial to have a good programming background especially with high level languages such as python and R. If you are a complete beginner in programming, I would highly recommend python. Not only does it have an excellent online documentation (unlike R), its syntax is also very easy to understand and hence highly recommended for beginners. I started off learning R but after picking up python, I noticed that my programming skills became so much better and concepts like control structures and algorithms were much easier to grasp. If you want to learn more about python, I would recommend the MIT python course on edX. I have personally taken this course and I would recommend it perhaps after taking an introductory course to python syntax, etc. It is highly reputed as a very thorough and difficult course which emphasises mastery of algorithms and other fundamental programming concepts. Other languages for handling data bases such as SQL and big data tools such as Hadoop and Spark/Scala are also highly desirable.

8. Harvard Data Science course: This is a free online course organised by Prof Joe Blitzstein and colleagues at Harvard University. Again, I’ve heard many good things about it and there is definitely enough to keep you busy for month!

9. Open source data science masters: There are lots of resources here on the various data science domains (machine learning, maths/statistics, databases, visualisation)

10. Piotr Migdal, a recent physics PhD graduate and now free lance data scientist has put together a fantastic article about his own journey from a PhD student in quantum physics to becoming a data scientist. Its such a good read and I would highly recommend it-there’s enough there to keep you busy for years.

11. MIT’s Analytics Edge: This is one of the best MOOCs out there that you can undertake absolutely free of charge. I’m currently taking this MOOC and it is just amazing to say the least. I will be writing a review of the entire course once I’ve completed it. Features include machine learning, data visualisation, integer and linear optimization and also a kaggle competition.