My review of MIT’s “Analytics Edge” MOOC (15.071x)

 

15.071x_10

As a mathematical modelling PhD student, machine learning has always been one of my areas of interest. There are many great machine learning MOOCs around on the internet and a very popular one is Andrew Ng’s machine learning course on coursera (https://www.coursera.org/learn/machine-learning). Whilst this is a great course, it has the slight disadvantage of being implemented in octave (open source version of matlab). I personally prefer a course implemented in either python or R and I later came across the “analytics edge” course offered by MIT via the edX platform (https://www.edx.org/course/analytics-edge-mitx-15-071x-2). Having just completed this course, I can wholeheartedly recommend it to anyone willing to try their hands on a bit of machine learning.

The course runs for approximately 12 weeks and you can take it completely free of charge and collect a honours certificate provided you obtain a pass mark of 55% (verified certificates are also available at a fee). The whole course is implemented in R and is of a very high standard (as expected of an MIT course). One feature that stands this course out from other analytics MOOCs out there is its very hands nature from day 1. As with most data science concepts, one of the best ways to understand machine learning is by implementing the various algorithms and exploring the results. Machine learning techniques covered include linear regression, logistic regression, decision trees (random forests, etc), clustering and visualisation. One thing I will say is that this course is not a core programming course in R-it mainly emphasises the use of R for data analytics and machine learning. As such it is more of an introduction to the basics of R and emphasis was more on learning syntax rather than hard core programming. There are lots of very useful R packages covered which can be used in day-to-day analytics tasks (see point 2 below).

Towards the end of the course, we were introduced to linear and integer optimization implemented in excel. Personally, I would have preferred to see more explanation of machine learning concepts.

Some nice features of the course

  1. It focuses on implementation of various machine learning algorithms without going into too much details of the underlying theory/mathematics. This is a nice way to understand machine learning for a beginner-by doing rather than getting bogged down on theory. There are other courses that cover more of the theory (see Andrew Ng’s machine learning course-https://www.coursera.org/learn/machine-learning) and this can be taken after this course.
  2. The course introduced a number of machine learning techniques which have already been implemented in R packages. These include random forests, decisions trees, logistic regression, linear regression, text analytics and clustering. Examples of packages used include rpart, randomForest, ROCR, caret, e1071, tm, kmeans, ggplot2, caTools, etc.
  3. I really liked the variety of data sets and problems introduced in the course. They are very interesting, diverse, stimulating and taken from the real world. Examples of datasets used include the Framingham heart study, crime data, stock market data, demographics, climate, imaging data (MRI), music, polling, twitter analytics, online dating , netflix (movies recommendation ), etc. Some of the datasets are very large and be very challenging from a computational perspective. As you can see, there’s almost certainly something for everyone.
  4. There’s a kaggle competition in week 7 which takes the excitement of the course to a whole different level! The data is again taken from the real world and you’ll have the joy of competing with about 3000 students from around the world.
  5. The lectures are very clear and concise. Emphasis is more on the assignments which are relatively easy to complete using the course material. These may seem repetitive initially but without doubt, the best way to explore machine learning is to dive into the problems using the various implementations in R. You can worry about the details of the algorithms or mathematical theory later.
  6. The discussion forum was a great place to learn from several other more experienced course participants. The best predictions usually rely on an ensemble of machine learning techniques which can only be learned through experience.
  7. The amount of lectures provided is very well balanced and the course is self-contained (It won’t take you totally away from other commitments). The assignment deadlines are realistic for most people taking similar courses (typically with day jobs and free time in evenings and weekends).
  8. The linear and logistic regression modules provide a reasonable grounding on the mathematical theory/assumptions behind the various implementations. Examples include sum of squared errors (SSEs), total sum of squared errors (SST), ROC curve, specificity and sensitivity, etc.
  9. The visualisation chapter is incredibly stimulating. I personally like data visualisation and so I took the liberty of enjoying this module. The main package used was ggplot2 but there were other packages for visualising maps/networks including maps, igraph and ggmap.

 

Without a shadow of doubt, I would recommend this course to anyone curious about machine learning or “analytics”. This course will by no means turn you into an expert in machine learning or “data science”. If you prefer python to R, there’s no reason not to try out the course in python-this will probably be one of my side projects in the near future. Alongside the course, I found it really helpful to go through the statistical learning book by Gareth James et al (especially the random forest chapter). In my opinion, 15.071x is one of best free MOOCs out there. It would be interesting to hear about other great machine learning courses out there.

 

A Collection of Useful R Codes for Data Manipulation

Two of the most important data science tools for wrangling and data manipulation are R and python. Of these two, I personally prefer R for data manipulation since it was written specifically for this. One down side of R is that its documentation is quite poor and so it can be helpful to make your own list of useful codes which you can refer to as and when needed. In this article, I will be posting a collection of really useful R codes which I’ve found very handy over the course of my PhD.

        1. Install and/or load multiple R packages at once: This is quite handy as most times, you will find yourself needing to either install new packages or load previously installed packages. A good exercise would be to re-write the code below as a function which can take one or more R packages as an argument. The code below was obtained from http://diggdata.in/.

# List of packages to be installed/loaded
packageList <- c("ggplot2", "nlme")
check <- packageList %in% rownames(installed.packages())
if(any(!check)) install.packages(packageList[!check])
lapply(packageList, library, character.only = TRUE)

        2. Create new folders in R: It is possible to create a new folder in your current directory using R.

# get current working directory
get.wd()
# create "folder_1" in current working directory
dir.create("folder_1")

What are employers looking for in PhD graduates?

Many a maths PhD graduate may have envisioned themselves landing their dreams jobs almost straight after their vivas. This is not an unreasonable dream. However, this isn’t always the case and it is increasingly more important to understand very clearly what skills graduates are able to bring into a role or industry. There is the need to reflect on the PhD journey with a view to highlighting achievements that may be of particular interest to a potential employer. It may seem obvious to any PhD graduate what skills they’ve accrued over the course of their research. However, not all employers really understand and appreciate the value of a PhD. Therefore it is so important when writing your CV that your skills are well profiled and relevant to the various roles of interest. This is a type of inverse problem where you have to first understand what an employer is looking for and then find ways of demonstrating that you are the right fit for the job.

There are two broad categories of jobs that maths/natural science PhDs often end up doing:

  1. Jobs that require a PhD qualification in the subject studied: These types of jobs are very specialist and research focused and hence do specify the need for a PhD qualification. As a result, most companies won’t hire until you’ve completed your viva. They need to be sure that you indeed have the skills they are after and this is fair. Industries in this category often hire graduates who have focused on a subject matter that’s directly relevant to that industry. Examples include machine learning, mathematical modelling, fluid dynamics, financial modelling and time series analysis, optimization, PDEs, cryptography and number theory, advanced statistical modelling, etc. These jobs tend to attract higher salary packages because they value the skill set you’re bringing into the company. Examples of companies include pharmaceutical industry (modelling and simulation), finance industry (quantitative analysts and financial modellers), data science and analytics firms (twitter, quora, facebook, etc), GCHQ, google, IBM, microsoft, Academia, etc. These jobs are likely to value relevant publications in the respective field of research.
  2. Jobs that do not directly require a PhD qualification: Most PhD graduates may be wondering-what’s the point of going for roles that do not require my PhD? Whilst these roles may not necessarily require an expertise in the subject matter of your PhD, it is the skills acquired and demonstrated in the process that are the main reasons for the hire. These include problem solving skills, ability to learn difficult and technical concepts very quickly, programming, analysis, modelling skills, independent research, written and verbal communication, etc. These roles are often open to non-PhD graduates including Masters and sometimes holders of BSc degrees. However, a PhD graduate has a greater advantage due to the fact that they’ve had more time to develop some of these key skills. Publications may not necessarily count for these types of roles. An employer in this category is interested in what yon can add to their industry and this boils down to transferrable skills.

There are also other types of roles that do not necessarily require the subject matter of a PhD but nonetheless require a PhD qualification. A good example is the insight data science fellowship program. In summary, the skills you’ve gained during your PhD are just as important as the PhD itself and these need to be reflected in your resume.

Self study list for becoming a data scientist

Data science is an exciting and growing field! It is a relatively new field and I’m sure lots of graduates have questions about it so I thought I’d put together all the best resources out there. I’m currently putting together a list of useful resources for budding data scientists (myself included) out there.

 

1. An introduction to statistical learning: This is a great practical text on machine learning. I was impressed by its amazon reviews (all very positive reviews as of 3/11/2015) and indeed very impressed its clarity and precision (amazon link). The authors of the book have kindly made it available as a free pdf online but most enjoy it so much that they end up buying the hard copy.

2. Theory and Applications for Advanced Text Mining

3. 9 free data science books
Programming languages

4. There are a few programs specifically designed for PhD graduates who want to become data scientists. One of them is the insight data science course. The link takes you to a page of recommended readings and preparatory material for their data science candidates. The application for this fellowship is very competitive and they’ve recently started taking medical doctors for their new insight healthcare data science. The actual course is very self-directed and is more of a crash course. I believe one of the best things about the course is that it serves as a platform where employers can meet with data science candidates and they also prepare candidates for interviews.

5. Mathematics and Statistics background: It is essential to have a good grasp of certain mathematical concepts from linear algebra and multivariable calculus. Probability and distribution theory and Bayesian statistics would also be very valuable. The coursera machine learning course by Andrew Ng is very highly recommended.

6. An extensive list of resources on how to become a data scientist can be found on the quora website-how to become a data scientist. Most of these were posted by current data scientists in various industries and so for the most part will be up to date.

7. Programming: It’s crucial to have a good programming background especially with high level languages such as python and R. If you are a complete beginner in programming, I would highly recommend python. Not only does it have an excellent online documentation (unlike R), its syntax is also very easy to understand and hence highly recommended for beginners. I started off learning R but after picking up python, I noticed that my programming skills became so much better and concepts like control structures and algorithms were much easier to grasp. If you want to learn more about python, I would recommend the MIT python course on edX. I have personally taken this course and I would recommend it perhaps after taking an introductory course to python syntax, etc. It is highly reputed as a very thorough and difficult course which emphasises mastery of algorithms and other fundamental programming concepts. Other languages for handling data bases such as SQL and big data tools such as Hadoop and Spark/Scala are also highly desirable.

8. Harvard Data Science course: This is a free online course organised by Prof Joe Blitzstein and colleagues at Harvard University. Again, I’ve heard many good things about it and there is definitely enough to keep you busy for month!

9. Open source data science masters: There are lots of resources here on the various data science domains (machine learning, maths/statistics, databases, visualisation)

10. Piotr Migdal, a recent physics PhD graduate and now free lance data scientist has put together a fantastic article about his own journey from a PhD student in quantum physics to becoming a data scientist. Its such a good read and I would highly recommend it-there’s enough there to keep you busy for years.

11. MIT’s Analytics Edge: This is one of the best MOOCs out there that you can undertake absolutely free of charge. I’m currently taking this MOOC and it is just amazing to say the least. I will be writing a review of the entire course once I’ve completed it. Features include machine learning, data visualisation, integer and linear optimization and also a kaggle competition.

Data science careers

“Data science” is a relatively new field that combines knowledge of statistics, machine learning and programming in solving real world problems using data. The field is still very new and whilst many companies understand why they need data scientists, many don’t know what skills they should be looking for in potential hires. A lot of data science job descriptions have a long list of skills which suggests that companies are looking for a “unicorn” with all the possible combination of skills required. However, in reality, most of these unicorns do not exist and its often more practical to build a data science team with individuals who have strengths in various areas.

Broadly speaking, data careers can be divided into data science and data engineering. Data science is more to do with analysing, visualising and deriving meaningful insights from data. Data engineering on the other hand is more concerned with building data pipelines to deal with large datasets, etc. Data engineering is more related to software engineering whilst data science is more suitable for candidates from physical sciences background (including maths, physics, quantitative biology/neuroscience, computer science, engineering, etc). In reality, this distinction is often not clear on job descriptions. Also, due to a shortage of talents, intersections between data science and engineering is quite common. See the four types of data scientists for more.

 

 

How do you get your first as a data scientist?

Getting that first job as a data scientist can be a very important first step on your journey to becoming a competent data scientist. It can very challenging especially if you’re coming straight from university with little “real world” experience. A PhD qualification alone will not give you an automatic entry into a data science job-you will have to be able to demonstrate what you can contribute to a data science team. It’s worth mentioning that there are various paths to becoming a data scientist and you don’t necessarily need a PhD to get into one but you do have to demonstrate a breadth of skills.

  1. PhD route: A common route for landing a data science role is through a PhD qualification in a quantitative/scientific field such as physics, mathematics, engineering, neuroscience, biology, bioinformatics, computer science, etc. There are bootcamps that specifically recruit PhD graduates where they get to work on a real world project in collaboration with other graduates. Some popular bootcamps include the ASI fellowships2ds and insight fellowship  programmes (note that the first two  are based in the UK whilst the third is in the USA). For a more comprehensive list of data science bootcamps, see the following link.
  2. Masters/undergraduate route: Data scientists can be hired straight after an undergraduate or masters degrees into entry level data science roles. Its important to demonstrate your competencies through projects and of possible by doing an internship with a data-driven company.
  3. Portfolio/work experience route: This may be suitable for individuals who are already in industry and are wishing to move into data science roles. Increasingly, many software engineers may find themselves in this situation.

 

So the question is “How do you land that first job”?

  1. Know the basics of data science theory very well. This includes mathematics/statistics (linear algebra, calculus, numerical optimization, regression, algorithms, etc), programming (at the very minimum-python and R), machine learning techniques, visualisation, some familiarity with big data tools. If interested in Data Engineering, its crucial to get familiar with the big data tools-scala, spark, hadoop, Apache. As a general rule, try to be comfortable with at least one of the data tools (R or python)-this means at least 10,000 hours of coding in that particular language.
  2. Demonstrate your interest by undertaking a data science project in your spare time. Find a question that you can address using online data and showcase your work on a github account. If you’re already studying for a masters or a PhD, try to demonstrate your data science interest through your projects.
  3. If possible get some relevant industry experience related to data science and to the industry of your choice. Domain knowledge is crucial in being an effective data scientist. This can be acquired through an internship, kaggle competition, hackathon or through previous work experience.
  4. Network within the data science community by attending meetups, conferences and arranging meetings with data driven companies of interest.
  5. Keep up to date with the field by reading new articles, publications and algorithms.

Career options for maths PhD graduates

The aim of this blog is to educate Maths PhD students (as well as their lecturers, career counsellors and the general public) on potential careers in industry available to them. The good news is that maths/STEM PhDs are in great demand in very attractive careers.

There’s no doubt that a vast number of career options are open to graduates with mathematical talent and education. A good example is this list of 85 job descriptions of mathematicians working in industry. I like this list because it gives concise descriptions of the various roles and as such gives a great insight into the skills required to do them. For this reason, finding the right career can seem like “finding a needle in haystack” situation. This is the main reason why this blog was created.

To further complicate this, I have noticed that most mathematician roles in industry rarely carry the title “mathematician”. They are often called various other names including but not limited to the following:

1. Business analyst
2. Software Engineer
3. Computer scientist
4. Research associate
5. Data scientist
6. Operations researcher
7. Hydrologist
8. Basin modeller
9. Geologist
10. Statistician
11. Actuary
12. Cryptographer
13. Quantitative analyst
14. Financial Engineer

This is hardly surprising as applied mathematicians are so versatile and often find themselves in roles that are traditionally occupied by other science, technology and engineering graduates alike. Consequently, the general public do not have a true appreciation of what mathematicians really do apart from the glaringly obvious teaching of mathematics. The job titles are often a reflection of the work environment a mathematician may be working rather than the background/skills required to get the job done. Not surprisingly, a quick search on google looking for “mathematician jobs” in industry may not necessarily yield very much. My first advice for maths PhD graduates is not to dismiss any role based on its job title but to pay particular attention to the job description before making a decision. As a maths PhD student myself, I will be sharing my own discoveries which I hope will be of some use to other research students. I plan to update this page on a weekly basis so please do visit again for additional information.

1. The first resource I’m going to recommend can be found on the society for industrial and applied mathematics website. This website is packed with lots of useful resources on destinations of maths PhD students and a lot of case studies of mathematicians working in industry-definitely worth a look. You can also download their brochure and read in your spare time. It seems to me that a mathematician is a true jack of all trades and master of all! Alongside this, I would also highly recommend the American society of mathematics website.

 

2. This website is packed with lots of quality information about various career options for maths graduates. Although, some of the first few links are no longer working, don’t be put off by this. I really recommend the financial mathematics section for those interested in this line of work.

 

3. Internships: An internship experience can serve as a straight entry route which allows you to explore a company and indeed secure that dream job. Most internships for PhD students are paid and these are usually offered in the penultimate year. For example, here is a list of companies offering PhD students internships in quantitative finance. Some companies however offer off cycle internships meaning you can apply anytime even after completion of your PhD.

4. This is a general website from the university of manchester containing adverts for jobs outside academia-very useful if you don’t know where to start from.

List of Companies specifically recruiting PhD applicants

A Phd in mathematics is a very valuable qualification to have on a CV. However, as valuable as it is, not all companies specifically recruit PhD talents into industrial roles. This unfortunately means PhD graduates may have to compete with Masters or even undergraduates (very annoying if you ask me). Hence, it is worth targeting companies with specific roles and schemes for PhD graduates (Google, McKinsey and Co, PWC, companies recruiting quantitative analysts, pharmaceutical companies, data analysis and tech companies, GCHQ, etc)

All in all, I think it’s important to really make a list of skills that each PhD graduate feels they can offer and target companies who are after those skills. For example, google is very keen to employ PhD graduates who have strong skills in software development, machine learning and statistics.