Data Science

Transitioning from Engineering PhD to Data Scientist in Industry, Part I

Part I: What data scientists do

Nasir Bhanpuri, PhD

--

Around six years ago, I began a career transition from biomedical engineering research in an academic research lab to working as a data scientist in the healthcare industry. Since then, I have advised dozens of other folks pursuing a similar career trajectory. Based on my own experience and observing many of my peers, I’ve concluded that most engineering PhDs are well-equipped to be data scientists in industry and can make a transition after filling a few common gaps. (More specifically, I believe biomedical engineers are well-equipped to be clinical data scientists, but I won’t go into detail on those subfields here.) Since there is a demand for data scientists and an oversupply of engineering PhDs searching outside of academia, I synthesized some notes to help folks figure out if a data science career sounds appealing and what they need to work on in order to be a competitive applicant. In this post, I describe typical skills/methods that companies are looking for, many of which overlap with engineering PhD experience. In the next post (Part II: What engineering PhDs do well & common gaps), I break down item by item the typical strengths and weaknesses of engineering PhDs and suggest resources to help fill gaps.

Functions

To secure a position and thrive as a data scientist in industry, typically one must be able to conduct two or more of the following functions with minimal guidance:

Data fetching, munging/manipulation & organization

  • Data fetching (e.g. database querying, understanding database organization, etc.)
  • Data cleaning & merging (and doing it quickly)
  • In a nutshell: data retrieval self-service from common data stores (i.e. not relying on others to fetch/clean data for analytics/modeling)

Analytics & Insights

  • Answering questions and discovering useful patterns/trends in data that can help with operations and/or decision-making

Note, unlike academia, discovery and creation of new knowledge in itself without clear application to the business is often unhelpful (and de-prioritized)

  • Designing, helping administer, and analyzing experiments to test important hypotheses
  • Creating appropriate metrics for quantifying success (i.e. Key Performance Indicators, KPIs)
  • Generating automated reports/dashboards/data visualizations of KPIs and other data related to business operations

Applied artificial intelligence/machine learning (AI/ML)

  • Generating and deploying algorithms (that learn from data) to create a new or enhance an existing product/service

Some examples (non-exhaustive list):

  • Predicting unwanted outcomes and prioritizing early interventions (e.g. predicting churn/drop out)
  • Matching users to products/services (i.e. recommendation systems)
  • Forecasting trends (e.g. supply, demand, capacity, efficiency, etc.)
  • Classifying/processing high dimensional input (e.g. image classification, video/audio classification/processing/generation)
  • Deriving insights from/summarizing unstructured data (e.g. natural language processing)

Methods

Some common methods (non-exhaustive list) that data scientists employ to perform functions [related skills in brackets, full list below]:

  1. Database knowledge and data retrieval [B]
  2. Data pipelines & ETL (Extract, Transform, Load) [B, some A]
  3. Programmatic data cleaning and analysis [B, some A]
  4. Programmatic data visualization [A, B, E, some C, F]
  5. Causal inference from observational data (e.g. propensity score matching) [A, C, some B]
  6. Experimental design and analysis/statistical tests (e.g. A/B Test, Chi-square vs. T-test, etc.) [A, C, E, some B, F]
  7. Applied Machine learning (e.g. regression, tree ensembles, neural networks/deep learning, clustering, etc.) [A, B, some C, F]

Note D–domain knowledge and G–project management can apply to all methods/projects

Skills

A concise list of generic skills (technical & interpersonal) that are the foundations of common methods used by data scientists:

Technical

A. Math, statistics & probability

B. Computer science, programming, software development

C. Experimental design, causal inference

D. Domain knowledge (relevant to the specific company)

Interpersonal

E. Communication with non-technical colleagues

F. User-centered design

G. Project management (for self and for a team)

Tools

Common software tools (non-exhaustive list) [related methods in brackets, full list above]:

  • Python / R [3, 4, 5, 6, 7 some 1, 2] (& Jupyter Notebook / R Studio)
  • SQL [1, 2, some 3]
  • Other database languages [1, 2, some 3]
  • Git/Github [2, 3, 4, 7]
  • Business Intelligence (e.g. Spreadsheets/Tableau/Looker/Cognos etc.) [some 1, 2, 3, 4]

Definitions of terms:

Note, there is not a consensus in the field on these terms and what they encompass and thus do not be surprised if other data scientists/job descriptions do not organize terms in this manner. However, I believe this type of organization can help understand how well your past experience/skills match up with a given job description, interview question, project etc.

  • Functions: The activities that data scientists perform on a regular basis.
  • Methods: The approaches/procedures that underlie functions. Typically these are built upon one or more skills and involve tools.
  • Skills: The (generic) abilities pertinent to relevant methods and enable performance of functions.
  • Tools: Common software tools that data scientists use

Next, Part II: What engineering PhDs do well & common gaps

--

--

Nasir Bhanpuri, PhD

AI at Virta Health where I use data science to solve challenges in healthcare/medicine. I also use DS for sports, education, and music.