Data Science

Transitioning from Engineering PhD to Data Scientist in Industry, Part I

Part I: What data scientists do

4 min readJun 7, 2020

Around six years ago, I began a career transition from biomedical engineering research in an academic research lab to working as a data scientist in the healthcare industry. Since then, I have advised dozens of other folks pursuing a similar career trajectory. Based on my own experience and observing many of my peers, I’ve concluded that most engineering PhDs are well-equipped to be data scientists in industry and can make a transition after filling a few common gaps. (More specifically, I believe biomedical engineers are well-equipped to be clinical data scientists, but I won’t go into detail on those subfields here.) Since there is a demand for data scientists and an oversupply of engineering PhDs searching outside of academia, I synthesized some notes to help folks figure out if a data science career sounds appealing and what they need to work on in order to be a competitive applicant. In this post, I describe typical skills/methods that companies are looking for, many of which overlap with engineering PhD experience. In the next post (Part II: What engineering PhDs do well & common gaps), I break down item by item the typical strengths and weaknesses of engineering PhDs and suggest resources to help fill gaps.

Functions

To secure a position and thrive as a data scientist in industry, typically one must be able to conduct two or more of the following functions with minimal guidance:

Data fetching, munging/manipulation & organization

Data fetching (e.g. database querying, understanding database organization, etc.)
Data cleaning & merging (and doing it quickly)
In a nutshell: data retrieval self-service from common data stores (i.e. not relying on others to fetch/clean data for analytics/modeling)

Analytics & Insights

Answering questions and discovering useful patterns/trends in data that can help with operations and/or decision-making

Note, unlike academia, discovery and creation of new knowledge in itself without clear application to the business is often unhelpful (and de-prioritized)

Designing, helping administer, and analyzing experiments to test important hypotheses
Creating appropriate metrics for quantifying success (i.e. Key Performance Indicators, KPIs)
Generating automated reports/dashboards/data visualizations of KPIs and other data related to business operations

Applied artificial intelligence/machine learning (AI/ML)

Generating and deploying algorithms (that learn from data) to create a new or enhance an existing product/service

Some examples (non-exhaustive list):

Predicting unwanted outcomes and prioritizing early interventions (e.g. predicting churn/drop out)
Matching users to products/services (i.e. recommendation systems)
Forecasting trends (e.g. supply, demand, capacity, efficiency, etc.)
Classifying/processing high dimensional input (e.g. image classification, video/audio classification/processing/generation)
Deriving insights from/summarizing unstructured data (e.g. natural language processing)

Methods

Some common methods (non-exhaustive list) that data scientists employ to perform functions [related skills in brackets, full list below]:

Database knowledge and data retrieval [B]
Data pipelines & ETL (Extract, Transform, Load) [B, some A]
Programmatic data cleaning and analysis [B, some A]
Programmatic data visualization [A, B, E, some C, F]
Causal inference from observational data (e.g. propensity score matching) [A, C, some B]
Experimental design and analysis/statistical tests (e.g. A/B Test, Chi-square vs. T-test, etc.) [A, C, E, some B, F]
Applied Machine learning (e.g. regression, tree ensembles, neural networks/deep learning, clustering, etc.) [A, B, some C, F]

Note D–domain knowledge and G–project management can apply to all methods/projects

Skills

A concise list of generic skills (technical & interpersonal) that are the foundations of common methods used by data scientists:

Technical

A. Math, statistics & probability

B. Computer science, programming, software development

C. Experimental design, causal inference

D. Domain knowledge (relevant to the specific company)

Interpersonal

E. Communication with non-technical colleagues

F. User-centered design

G. Project management (for self and for a team)

Tools

Common software tools (non-exhaustive list) [related methods in brackets, full list above]:

Python / R [3, 4, 5, 6, 7 some 1, 2] (& Jupyter Notebook / R Studio)
SQL [1, 2, some 3]
Other database languages [1, 2, some 3]
Git/Github [2, 3, 4, 7]
Business Intelligence (e.g. Spreadsheets/Tableau/Looker/Cognos etc.) [some 1, 2, 3, 4]

Definitions of terms:

Note, there is not a consensus in the field on these terms and what they encompass and thus do not be surprised if other data scientists/job descriptions do not organize terms in this manner. However, I believe this type of organization can help understand how well your past experience/skills match up with a given job description, interview question, project etc.

Functions: The activities that data scientists perform on a regular basis.
Methods: The approaches/procedures that underlie functions. Typically these are built upon one or more skills and involve tools.
Skills: The (generic) abilities pertinent to relevant methods and enable performance of functions.
Tools: Common software tools that data scientists use

Next, Part II: What engineering PhDs do well & common gaps