Data Science
Transitioning from Engineering PhD to Data Scientist in Industry, Part I
Part I: What data scientists do
Around six years ago, I began a career transition from biomedical engineering research in an academic research lab to working as a data scientist in the healthcare industry. Since then, I have advised dozens of other folks pursuing a similar career trajectory. Based on my own experience and observing many of my peers, I’ve concluded that most engineering PhDs are well-equipped to be data scientists in industry and can make a transition after filling a few common gaps. (More specifically, I believe biomedical engineers are well-equipped to be clinical data scientists, but I won’t go into detail on those subfields here.) Since there is a demand for data scientists and an oversupply of engineering PhDs searching outside of academia, I synthesized some notes to help folks figure out if a data science career sounds appealing and what they need to work on in order to be a competitive applicant. In this post, I describe typical skills/methods that companies are looking for, many of which overlap with engineering PhD experience. In the next post (Part II: What engineering PhDs do well & common gaps), I break down item by item the typical strengths and weaknesses of engineering PhDs and suggest resources to help fill gaps.
Functions
To secure a position and thrive as a data scientist in industry, typically one must be able to conduct two or more of the following functions with minimal guidance:
Data fetching, munging/manipulation & organization
- Data fetching (e.g. database querying, understanding database organization, etc.)
- Data cleaning & merging (and doing it quickly)
- In a nutshell: data retrieval self-service from common data stores (i.e. not relying on others to fetch/clean data for analytics/modeling)
Analytics & Insights
- Answering questions and discovering useful patterns/trends in data that can help with operations and/or decision-making
Note, unlike academia, discovery and creation of new knowledge in itself without clear application to the business is often unhelpful (and de-prioritized)
- Designing, helping administer, and analyzing experiments to test important hypotheses
- Creating appropriate metrics for quantifying success (i.e. Key Performance Indicators, KPIs)
- Generating automated reports/dashboards/data visualizations of KPIs and other data related to business operations
Applied artificial intelligence/machine learning (AI/ML)
- Generating and deploying algorithms (that learn from data) to create a new or enhance an existing product/service
Some examples (non-exhaustive list):
- Predicting unwanted outcomes and prioritizing early interventions (e.g. predicting churn/drop out)
- Matching users to products/services (i.e. recommendation systems)
- Forecasting trends (e.g. supply, demand, capacity, efficiency, etc.)
- Classifying/processing high dimensional input (e.g. image classification, video/audio classification/processing/generation)
- Deriving insights from/summarizing unstructured data (e.g. natural language processing)
Methods
Some common methods (non-exhaustive list) that data scientists employ to perform functions [related skills in brackets, full list below]:
- Database knowledge and data retrieval [B]
- Data pipelines & ETL (Extract, Transform, Load) [B, some A]
- Programmatic data cleaning and analysis [B, some A]
- Programmatic data visualization [A, B, E, some C, F]
- Causal inference from observational data (e.g. propensity score matching) [A, C, some B]
- Experimental design and analysis/statistical tests (e.g. A/B Test, Chi-square vs. T-test, etc.) [A, C, E, some B, F]
- Applied Machine learning (e.g. regression, tree ensembles, neural networks/deep learning, clustering, etc.) [A, B, some C, F]
Note D–domain knowledge and G–project management can apply to all methods/projects
Skills
A concise list of generic skills (technical & interpersonal) that are the foundations of common methods used by data scientists:
Technical
A. Math, statistics & probability
B. Computer science, programming, software development
C. Experimental design, causal inference
D. Domain knowledge (relevant to the specific company)
Interpersonal
E. Communication with non-technical colleagues
F. User-centered design
G. Project management (for self and for a team)
Tools
Common software tools (non-exhaustive list) [related methods in brackets, full list above]:
- Python / R [3, 4, 5, 6, 7 some 1, 2] (& Jupyter Notebook / R Studio)
- SQL [1, 2, some 3]
- Other database languages [1, 2, some 3]
- Git/Github [2, 3, 4, 7]
- Business Intelligence (e.g. Spreadsheets/Tableau/Looker/Cognos etc.) [some 1, 2, 3, 4]
Definitions of terms:
Note, there is not a consensus in the field on these terms and what they encompass and thus do not be surprised if other data scientists/job descriptions do not organize terms in this manner. However, I believe this type of organization can help understand how well your past experience/skills match up with a given job description, interview question, project etc.
- Functions: The activities that data scientists perform on a regular basis.
- Methods: The approaches/procedures that underlie functions. Typically these are built upon one or more skills and involve tools.
- Skills: The (generic) abilities pertinent to relevant methods and enable performance of functions.
- Tools: Common software tools that data scientists use