Data Scientist, Data Miner, Statistician or all of the above?

It was not so long ago that Steve Lohr quoted economist Hal Varian, Chief Economist at Google at the time, as saying  “ that the sexy job in the next 10 years will be statisticians.” Lohr spoke of a “new breed of statisticians (…) that use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data. The applications are as diverse as improving Internet search and online advertising, culling gene sequencing information for cancer research and analysing sensor and location data to optimise the handling of food shipments.” But then Lohr went on to say: “though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis (…), the new data sleuths come from backgrounds like economics, computer science and mathematics.”

Five years later, in 2014,  Lohr clarified what he meant,  baptized  the sleuths as data scientists, and broadcasted that the sexy part of the job, the data analysis and discovery,  ensues only after  spending 80 percent of the time coding and programming to clean and prepare the data for analysis. This is not unfamiliar to statisticians, who know that data analysis has always involved a large part of data cleaning, often delegated to the lower ranks of the statistics career (the data entry and data cleaning career). But that division of labor has now reached unprecedented dimensions. Large amounts of data brought about by the internet has resulted in startup companies specialising in cloud computing, software engineering and coding to prepare heterogeneous masses of data from web, sensors, smartphones and corporate databases for machine learning and statistical analysis. Being a “janitor” of data (the old data management job) is as lucrative a career as being a statistician now. But being both, and also being a data miner, is priceless, whether the employer is a biotech company, the ONS or Google.

Data science is and old term that, according to Wikipedia, became popularized when DJ Patil and Jeff Hammerbacher used the term “data scientist” to define their jobs at LinkedIn and Facebook, respectively. DJ Patil was recently named  Chief Data Scientist of the White House in the United States, due in large part to his being, according to Forbes magazine, one of the top 7 data scientists in the U.S. The other six are indeed an army of experts from backgrounds like economics, computer science, mathematics, health sciences. Patil, in a memo to the American people, defined data science as  “the ability to extract knowledge and insights from large and complex data sets” with  social media, search, and e-commerce being the areas most benefitting from this. He explains that the role of an organization’s CDO (Chief Data Officer) or CDS (Chief Data Scientist) is to help their organisation acquire, process, and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.”  The job title was created before a CIPS (Classification of Instructional Programs) code for data science. Perhaps that is why we find job advertisement for a Statistics position that require the same skills as for a Data Science position.

Job seekers with a statistics training have been wondering what role Statistics plays in data science. Fearful of being left behind, professional statistics groups have defined data science for their constituencies. For example, a recent statement made by The American Statistical Association (ASA) on the role of Statistics in Data Science says: “While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources; (ii)Statistics and Machine Learning convert data into knowledge; and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis.” (…) At its most fundamental level, the ASA says “we view data science as a mutually beneficial collaboration among these three professional communities, complemented with significant interactions with numerous related disciplines. For data science to fully realise its potential requires maximum and multifaceted collaboration among these groups.”  Thus, “the next generation of statistical professionals needs a broader skill set and must be more able to engage with database and distributed systems experts (…), there will be an ever-increasing demand for such “multi-lingual” experts.”

It appears then that data sleuths, multi-lingual experts and data scientists do what applied statisticians always did, but at a much larger scale, and with the complexity of the digital world added to the data cleaning and data analysis cycle. Big data has made the task bigger and more complex. For those who are not sure where they fit, there is the survey at http://survey.datacommunitydc.org , which was used by Harlan D. Harris, Sean Patrick Murphy, and Marck Vaisman in 2012 to find the attributes that  are seen in data science practitioners today, and their experiences in the job market. Those authors think that terms like “data scientist”, “analytics” and “big data” are the result of what one might call a “buzzword meat grinder” that often results in lack of clarity in what is expected from job candidates. Their survey revealed that there are a variety of data scientists. They all do machine learning or big data, math, programming, statistics and data business. However, they do so to different extents depending on the context of their job. Some are more focused in researching data using a great deal of statistics (now called analytics by many), others do a lot of machine learning and big data handling, others do a lot of programming, and others are more focused in the business part.  The difference between being a data scientist and a statistician may lie only on whether the employer is looking for a person that will do it all, or a person to join a team that embraces a variety of data scientist each doing one of the tasks. The job is more likely to be well defined in the latter case and not very well defined in the former. Job seekers be aware and ask for details!

From an economic point of view, it appears that keeping the vagueness in job descriptions may be a profit maximising strategy. Data science seems to have arisen from the need to capture the constant inflow of data that arrives to business and government, manage that data, mine it, analyse it using statistics and machine learning, extract knowledge from it using large scale computing, and communicate what is learned from this process.  For example, Twitter is a prototypical example of what people call “Big Data.” According to Nicole Lazar, Twitter generates masses of information every second, information that is unstructured, with strings of text that have to be mined for meaning (not unlike the surveillance data monitored by governments constantly or Amazon data).  There can be several “data scientists” looking at the number of users, tweets over a given time span, distribution of users around the world, subjects of tweets, and trends in subject.  A full-fledged data scientist that understands and can do all the tasks would cost less than having specialized data miners, data janitors and data analysts. This economic principle extends from Amazon to the smallest theater company tracking attendees. The bigger employers (i.e., Facebook, Linkedin, Google, Amazon, Twitter, National Defense) tend to have a diversified and specialised group, while the smaller ones may tend to hire just one person to do it all. But the trend in job advertising is going in the direction of playing it safe and asking for data scientists or business analyst. Job seekers, look carefully to see if your skills are requested.

 

We should not forget however that although big data is behind much of the data science phenomenon, it is not all that is needed for data science. The Cincinnati Shakespeare Company sells 25000 tickets every year for 10 different productions. That is small data. Xinping Zhang, Byran J. Smuckler and Jay Woffington show how a statistician with basic statistical skills used the company’s data to more effectively give advance notice of possible shortfalls or windfalls. True that a job like that is now called “predictive analytics, ” and true that were the company to go digital, the statistician would have to learn to mine the web to achieve the same goal and keep the job.  To give another example, medical companies that need to test a drug on a small number of patients abound that require statisticians that can design a clinical trial, manage the follow up and analyse the data. Were the company to do surveillance of populations using an app, the statistician would have to create some software to extract the data and transform it for analysis.

At a large or small scale, extracting knowledge from data is behind a “data science”, “analytics”, “big data”, “small data” or simply statistics job (although there are less and less job adds asking for a “statistician”).  They all are likely to require data gathering and cleaning, data exploration, and statistics. Job candidates should ask how much of each, and at what level of each, at least until job descriptions get more specific. And do not forget to ask: “what expert domain knowledge do I need?” And “what type of data do you have?”  But also, whether the job is a Statistician or a Data Scientist or Analytics job, be prepared to sound knowledgeable about the following items put together by the National Science Foundation, particularly if the word Big Data, Big Questions, Analytics or Data Science appears in the job add (and perhaps stay away from these terms if it doesn’t:

  • Reproducibility, replicability, and uncertainty quantification
  • Data confidentiality, privacy, and security issues as they relate to Big Data
  • Generating hypotheses, explanations, and models from data
  • Prioritizing, testing, scoring, and validating hypotheses
  • Interactive data visualization techniques
  • Scalable machine learning, statistical inference, and data mining
  • Eliciting causal relations from observations and experiments
  • Addressing foundational mathematical and statistical principles at the core of the new BIGDATA technologies

If that sounds like much, compare job adds (like those given in the Appendix). Statistics is center stage in all job adds that involve data, big or small.  But it is clear that employers that have web sites, be it in Business, Government, or Science want employees to transform that data into knowledge that advances their respective goals.

References

=========

ASA, Data science undergraduate degrees http://magazine.amstat.org/blog/2015/07/01/new-undergraduate-data-science-programs/

ASA, August 8, 2015 http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ )

Berkeley, Professional Master in Information and Data Science at Berkeleyhttps://datascience.berkeley.edu

Columbia University Master of Data Science http://datascience.columbia.edu/master-of-science-in-data-science

Coursera online offerings on data science https://www.coursera.org/specializations/jhudatascience

Thomas H. DavenportD.J. Patil, Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review, October 2012

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

Forbes Magazine   https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

Harlan D. Harris, Sean Patrick Murphy, and Marck Vaisman Analyzing the Analyzers, O’Reilly 2013 http://www.oreilly.com/data/free/files/analyzing-the-analyzers.pdf

Harvard online course on Big data analysis http://www.online-learning.harvard.edu/course/big-data-analytics

Harvard Data science  web page  http://statistics.fas.harvard.edu/datascience

Lazar, Nicole. Now Trending on Twitter. Chance, Vol 28.2, 2015.

Steve Lohr, New York Times, August 6, 2009, For Today’s Graduate, Just One Word: Statistics)

Steve Lohr, New York Times, August 17, 2014, For Big Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights.

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Methodist University Master of Data Science   http://www.datascience@smu

New York University, http://cds.nyu.edu/academics/ms-in-data-science/

Other site listing schools with degrees in data science, analytics, big data. http://101.datascience.community/2012/04/09/colleges-with-data-science-degrees/

Patil , https://www.whitehouse.gov/blog/2015/02/19/memo-american-people-us-chief-data-scientist-dr-dj-patil

Ranking of Master Degrees in Data Science http://www.mastersindatascience.org/schools/23-great-schools-with-masters-programs-in-data-science/

Ranking of Master in Business Analytics degrees. http://www.mba.com/us/plan-for-business-school/decide-to-go/specialized-masters-programs/big-data-programs.aspx

Stanford University’s data science track in the Master of Science Program. https://statistics.stanford.edu/academics/ms-statistics-data-science

Michael Vogelius, Nandini Kannan, and Xiaoming Huo, NSF Division of Mathematical Sciences

NSF Big Data Funding Opportunity for the Statistics Community

Xinping Zhang, Byran J. Smucker, Jay Woffington. Statistics and Show Business: Shakespeare Meets Predictive Analytis. Chance Vol