Data Science has become a highly sought after tool for businesses, academics, and organizations. The ability to take data, analyze it to detect insightful patterns or create value add for a product can be incredibly beneficial. In 2023, the field of Data Science has changed drastically. There are now a variety of tools that automate processes and coding isn’t necessarily utilized on the job. This may lead some to believe that direct training in automated data analysis, visualization, or storage tools to be the emphasis of a data science curriculum, but every data scientist still needs to have a full understanding of the underlying mathematics, programmatic skills, domain expertise, and communications skills required to truly succeed in the field. Here, we will break down a rudimentary framework for how you can self-study to prepare for a career in data science and what you should focus on learning.
1. Computer Fluency and Programmatic Skills
Data scientists work with data, and most data are stored in an electronic format nowadays. A simple place to start is to make sure you know how to operate a computer efficiently. As absurd as it may seem, computer fluency is actually declining amongst the population. The age of “apps” has meant that most processes a person would do on their computer are taken care of for them. Knowing how to navigate file folders, properly name files, use a computer’s terminal and command line commands are crucial for getting tasks done quickly and efficiently.
Another thing to understand is hardware. Most calculations done in data science require a decent CPU (central processing unit) and RAM (random access memory). Some methodologies use the GPU (graphics processing unit), but for the most part it’s not common across most data science jobs. If the computer a data scientist is working on doesn’t have the CPU or RAM capacity to process the data then programs and code will hang and either take extremely long to run or crash. Lastly, a lot of data scientists work in the cloud now, so sometimes all it takes is the click of a button to expand hardware capacity.
If a person can understand file structures, command line interface programming, computer hardware, and how to navigate the OS and access root level files it’s a good skill to have.
2. Math (Statistics, Algebra, Calculus)
It’s not necessary to be a master in the subject, but it is necessary to have an intermediate understanding of algebra, statistics, and some calculus.
When it comes to data science and calculus, there aren’t a ton of computations you’ll be using on a day to day. Machine Learning algorithms do utilize calculus and are important to understand if your plan is to build predictive algorithms. Gradient descent is one of the most popular algorithms that utilize calculus. Since most tools do these types of operations for the user, it’s not necessary to be able to do calculations on paper, but it is necessary to have a good understanding of how the math behind the tool works.
Algebra is a must if you want to be a data scientist. Not only is base algebra important, but it’s also crucial to have a good understanding of Linear Algebra, which pertains to matrix arithmetic (multiplication, division, subtraction, and addition). Order of operations, transformations of units, manipulating equations to derive a result are all necessary to the job.
Statistics is the bread and butter of a data scientist’s toolkit and if they don’t know Algebra they will probably be pretty bad at statistics. Probability, measures of central tendency, bias and margins of error, ratios, correlation, regression, and much more are all statistical concepts that need to be understood.
3. Domain Expertise
Domain expertise is the understanding of a subject of which you are applying data science principles and methods to. There is no single field a data scientist works in, they can be analyzers of environmental, financial, health, or qualitative historical data (and the list goes on). To select appropriate features, build effective models, validate and interpret results, and ultimately provide insights that are relevant and actionable in the context of the specific domain firm knowledge of any given field is necessary. For example, if your specialty is financial data then you probably will not be able to handle, interpret, or validate results from an environmental data driven study. If you’ve never heard of a shapefile (a data format for geographical information) then you probably shouldn’t be doing spatial statistics and analytics.
To acquire domain expertise, pick a field and maybe a subtopic within a field, and learn the ins and outs of it. Learn the theory, applied knowledge, and practice of that field. Learn the overall system and what variables and inputs influence changes. That’s a great way to become intimate with a subject and eventually earn domain expert status.
4. Communication Skills (Emotional Intelligence)
This is one of the hardest skills for a data scientist to acquire because very often it is hard to teach and is very easy to ignore while on the job.
While data scientists work with computers all day, they very often have to communicate the value of their work to stakeholders, managers, and other team members (or even the general public). When it comes down to it, most people aren’t familiar with jargony statistical terms such as p-values, R Squared values, correlation strength (or how to properly interpret correlation results), and distribution graphs (normal, poisson, binomial, etc).
Being able to communicate the main findings from a data study in a concise and direct way will ultimately allow a data scientist to have the most impact. Additionally, being able to take feedback from your audience will allow a data scientist to modify assumptions in models to better fit the needs of the client.
5. Critical Thinking
The last, but certainly not least, important skill for a data scientist is the ability to critically think. This point goes along with point 4, but is much more refined. To be a data scientist you need to be able to connect the importance and necessity of the data you are using to the problem you are trying to solve. You need to understand how to build workflows, pick the proper tools for modeling or analysis of a problem. You need to be good at both inductive and deductive reasoning. Nowadays many data scientists use no code tools for their workflow, can you think about what deficiencies a no code tool may keep you from acquiring the proper answers? Can you consider changes in data collection methods, biases of certain models, and limitations of data representation that may or may not influence the overall outcome of a result?
Very important skills and unfortunately the only way to gain this skill is to DO and THINK data science.
Summary
Data scientists need a combination of technical and mathematical skills, research skills, and communication and teamwork skills. Technical skills include programming, statistical analysis, and data visualization. Soft skills such as collaboration, public speaking, and analytical skills are also important.
The five critical skills for being a data scientist are domain expertise, critical thinking, math, computer fluency and programming, and communication. Domain expertise is essential for data scientists to understand the context of the data, select appropriate features, build effective models, validate and interpret the results, and ultimately provide insights that are relevant and actionable in the context of the specific domain. Critical thinking is important for data scientists to objectively analyze questions, hypotheses, and results. Math is important for data scientists to understand the fundamentals of data science, machine learning, and artificial intelligence as a whole2. Computer fluency and programming are important for data scientists to maneuver and wrangle massive amounts of data to make sense of it all. Communication is important for data scientists to present their findings.