This was a training course organised by the NCRM (National Centre for Research Methods). Held at the LSE in Holborn, and facilitated by Frauke Kreuter, two days were dedicated to considering the ways in which social scientists could engage with Big Data. The content of the two days is supported by a book Big Data and Social Science: A Practical Guide to Methods and Tools. It was a shame I could only find a hard copy at the time of purchase as it really is a weighty tome, and not something one wants to carry around.
What is Big Data?
This is a good question. One response to this, that Big Data is “anything that is too big to fit onto your computer” (Foster et al, 2017: p3) reveals the temporality of this as a defining characteristic. As the computing capacity of personal computing increases, so does the ability to handle vast amounts of data using a personal computer or laptop. So, this may not be a good yardstick for defining Big Data. Still, this gives us an indication of the ‘Bigness’ of Big Data. There are three key characteristics of Big Data, including volume (large datasets), velocity (data that may be in real time, or streamed), and variety (data in various formats and from multiple sources). This is discussed in more detail in Chapter 5 of Big Data and Social Science: A Practical Guide to Methods and Tools.
Accessing Big Data
References to the proliferation of Big Data and the datafication of everyday life can be found in social scientific literature (boyd and Crawford, 2012; van Dijck, 2014; McFarland, et al, 2016). While data may be ‘everywhere’, it is important to know where to look as well as develop the skills needed to access the data. Techniques such as web scraping were discussed. This involves searching for data on the web and extracting it.
There are tools such as Beautiful Soup to facilitate web scraping, and we discussed Selector Gadget which the user can use to identify the code needed to select different parts of web pages. However, one of the challenges with this is that web sites change, meaning that this might not be a reliable way of extracting data. Further, web scraping may be illegal in some circumstances as the providers have not given permission for their data to be accessed in this way.
Another approach is to use Application Programming Interface or API. In non technical terms, this means ‘reading the data and putting it into something else’. It is distinct from web scraping, apparently. Chapter 2 in Data and Social Science: A Practical Guide to Methods and Tools provides more details on the methods and tools used in collecting data from web sources.
Big Data may be generated from more than one, indeed several, datasets. Tokle and Bender (2017) highlight the ways in which Big Data differs from the more usual survey data used by social scientists. Survey data, usually, contains all the data relevant to the area of research interest. Social scientists using Big Data may have to use data from several sources. This relates to the ‘organic’ characteristic of Big Data. That is, it is typically data that is found, rather than designed (as in survey data) and may come from the myriad everyday transactions of human activity. These include credit card transactions and social media use.
Researchers using Big Data may want to ‘match’ cases that appear in both datasets. In other words, data on individuals may be linked across datasets. This might be very useful to a researcher trying to gain a complete picture of the activity of interest.
Of course, in linking records, there is the possibility that individuals will be identified. We discussed how this meant that informed consent, usually essential for social scientists, is not enforceable. In fact, Big Data threatens informed consent as a value of social research. The consequences of using an individual’s data cannot, yet, be known. Such ethical concerns urgently need addressing by social scientists (boyd and Crawford, 2012). Chapter 3 in Data and Social Science: A Practical Guide to Methods and Tools covers more on record linkage and matching.
This was the most animated part of the session and is testimony to the ability of visualisations to tell a story with data. Of course, this is nothing new. Historically, visualisations of data including Nightingale’s Coxcombs, du Bois’ hand coloured charts of Black Life in the USA, Jon Snow’s cholera map and Mineard’s visualisation of Napoleon’s march on and retreat from Moscow have been used to tell powerful stories, that data presented as raw statistics or in tabular form could not.
We discusses how there is now an expectation that visualisations will be interactive. One example we explored was Baby Name Voyager which provided some fun as we entered various names. However, a shocking dramatic visualisation was explored in Out of Sight, Out of Mind, displaying animations of drone strikes in Pakistan, and the resulting fatalities .
Data visualisations are not just a way of presenting results, they are also used for presenting findings of work in progress, which has value for Learning Analytics. Chapter 9 in Data and Social Science: A Practical Guide to Methods and Tools covers visualisations in more detail.
What has this to do with Education?
Another way of phrasing this might be, why would Big Data not have anything to do with education? Education and educational practices have long been the subject of quantification (Smith, 2016). Today:
“Schools are increasingly caught up in the data/information frenzy” (Smith, 2016: 2).
Big Data has become part of the way in which education is governed (Sellar, 2015; Selwyn, 2015; Williamson, 2015). In particular, student performance data is increasingly used for accountability purposes. Leaders and managers of educational institutions will rapidly need to become familiar with Big Data analytics. Within Higher Education, data is routinely collected from every student transaction (lecture attendance, library visits, assignment submissions) and is collected by institutions, constituting a wealth of digital data on students. They may not be aware we collect, and use this data, and again this raises more ethical issues that researchers are engaged with. Along with Learning Analytics this data may be be used used to identify those students at risk from failing or dropping out. As Learning Analytics develops, JISC has published a review of Learning Analytics practice in UK and internationally.
A two day course couldn’t cover everything, or produce Big Data experts. Other sessions included text analysis and machine learning, which both have relevance to education, and are covered in more detail in Data and Social Science: A Practical Guide to Methods and Tools.