One statistics graduate student asked me some questions about how to prepare to be a “Big Data” era statistician. Since it is not the first time I was asked questions of this kind, I decide to put all of them together and hope it is helpful for some others who are interested in analytical work in the future. I started to work in industry right after my PhD, so the following are from industry point of view not academe. All of the questions are great and don’t have right/wrong answer. I just say my opinions based on personal experience. Also many professional in statistics reacted to such questions:
David Jonoho wrote an AWESOME article “50 years of Data Science”. The article reviews the current “Data Science” and discusses how/whether Data Science is really different from Statistics.
Statistics is the least important part of data science (Andrew Gelman)
Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work (Harlan Harris, Sean Murphy, and Marck Vaisman)
……
Q1 How could I prepare myself to be a “Big Data” era statistician?
I think “Big Data” has been overly used which created plenty bubbles. Everyone is talking about big data, but no one can explain what exactly it is. It is fun to look at trends for some words by google occasionally, but not helpful to solve problems in real life. A lot of data, on its own is worthless. It isn’t the size of the data that’s important. It’s what you do with it. The big data skills that so many are touting today are not skills for better solving the real problem of inference from data. As David Donoho pointed out in his article “50 years of Data Science”:
……they are coping skills for dealing with organizational artifacts of large-scale cluster computing……the range of easily constructible algorithms shrinks dramatically compared to the single-processor model, so one inevitably tends to adopt inferential approaches which would have been considered rudimentary or even inappropriate in old times. Such coping……deforms our judgements about what is appropriate, and holds us back from data analysis strategies that we would otherwise eagerly pursue. Nevertheless, the scaling cheerleaders are yelling at the top of their lungs that using more data deserves a big shout.
A science doesn’t just spring into existence simply because a deluge of data will soon be filling telecom servers.
I can’t agree with David more. So I will instead talk about how to prepare to be a statistician and ignore “Big Data”.
Statistician is a very general word and we need to further define which sub-field of statistic a “Statistician” works on. You can refer to the information on American Statistical Association’s website about “Which Industries Employ Statisticians?” If you click the industry you are interested in, you will see a page with information about how statistics fit into the industry. By reading those, you should have an idea of what statistical skills are required for your aimed area.
Q2 What’s the direction of the development of statistics?
This question is too big and I am not knowledgeable enough to answer. Since I have been working in marketing, I will talk about what I think is promising in marketing statistics.
Use Natural Language Processing/Text Mining techniques to extract information from unstructured data such as customer comments, forum and emails. It is a less biased and more economical way to study customer perception and understand high level market structure.
Learn about the knowledge of your applied area and fit statistics techniques into the context. The value your statistical skills bring to the industry increases as your experience grows. At the end of the day, you will need to add a modifier. For example, I was Bio-statistician before and now Marketing Data Scientist. Your statistical skills, domain knowledge and soft skills together decide how irreplaceable you are in the organization.
Q3 How could statisticians keep competitive in the job market (both industry and academia) compared with computer scientists and show our uniqueness from mathematicians? The center of these questions is about the current statistician identity crisis in the data science industry. Here may prompt out another question though, what the “data science” really refers to.
You are right that there is lots of confusion around Data Scientist, Statistician, Business/Financial/Risk(etc) Analyst, BI professional…… It is because the obvious intersections among those. It took me two years to get through the identity crises myself. Now I see data science as a discipline to make sense of data. In order to make sense of data, statistics is an indispensable part. Meanwhile a data scientist needs many other skills. The article “What is a data scientist?” summarizes the difference among these roles. It provides very nice skill lists for different roles and comparisons among them. Some of my comments:
It is good to think about what make you competitive and plan for that. But don’t get too bogged in. Ask yourself what you like to do and what makes you happy. If you are doing things you love, you are very likely to be competitive than many others.
Mathematical background sets the limit of your role as a scientist. Today’s data science in business world most of the time is just marketing hype and the skills is far less than science level. It is just a whole bunch of number selection, adding, subtracting, averaging …… think about every year’s tax return and I hope you get the point. I really don’t want to be too sarcastic about some business consulting company. But if you want to earn fairly decent salary by doing high school math, such company is a great choice. Of course, you need to be able to TALK well. It seems that I have digressed from the topic….. Let’s get back. Since you can’t change the culture of a company, find the right company becomes very important if you want to really do both science and art. Other than a tool, mathematics is also a way to train your brain which I am familiar and enjoy. Whatever in your brain will give you security, at least it works for me.
I don’t think it is necessary to kill too much brain cells to unique ourselves from mathematician. In spite of significant overlap, statistics and mathematics have crucial difference: statistics aims to solve real life problem and need to fit in context to bring value; mathematics is philosophical game and can stand by itself.
Also no matter what you do in the future, working for 10 years doesn’t equal to 10 years’ working experience. Many people just work for one year and repeat the first year for many years after. That is certainly one thing I try my best to avoid. The most important and far reaching thing you can get from a PhD program is to learn how to learn. It is great and necessary to prepare ahead of time. But the uncertainty is the most certain thing in life. We can never be fully prepared but always be preparing and learning. Being a life-time learner is the best way to prepare you to be a future statistician (and many others). Gook luck!
[I will add on questions later]