Our team (Anatoliy Gruzd, Jenna Jacobson, and Elizabeth Duboisrecently hosted a workshop on Doing Research with Social Media Data at the 2016 iConference in Philadelphia. The primary purpose of the workshop was to bring together overlapping research communities of social media, information and communication scholars together to identify some of the major challenges, opportunities, and possible interventions to address data issues related to Social Media Data Stewardship.

The session participants were divided into four groups and led by four invited lead discussants. Group 1 led by Dr Katie Shilton discussed ethical issues related to the access and collection of social media data. Group 2 led by Dr. Jessica Vitak discussed methodological questions related the study designs when collecting social media data. Group 3 led by Dr. Bryan Semaan focused on the topic of sharing of social media data with collaborators and the public. And finally, Group 4 led by Dr. Ayoung Yoon examined whether and how current practices and metadata schemas might be adopted to help with preservation efforts of social media data.

This post summarizes each of the group discussions from the workshop.

1. Access and Collection – Ethics 

Lead discussant: Katie Shilton

Is informed consent necessary/recommended when collecting social media data? (Consider public vs. private data)

How can informed consent be collected?

When is informed consent not necessary?

Should academic researchers be held to a higher standard than industry or government?

Ethical questions need to be carefully considered throughout the entire research process; this is certainly true of social media research, which also presents unique ethical considerations. This group discussion specifically focused on ethics as it pertains to the access and collection of social media data for academic research purposes.


Institutional Research Ethics Boards (IRBs/REBs) are often ill-equipped when it comes to reviewing and advising university researchers as to how to properly collect and handle data from social media users. The group identified that this is primarily because IRBs/REBs might not have social media researchers on their committees and as a result, they may not necessarily understand the nuanced nature of social media data.

Specifically, what might or might be considered to be “public” varies on different platforms (and even within a platform), so there are special considerations due to the variations across social media platforms. There is a need for social media researchers to participate at conferences organized for and by IRB/REB professionals, such as the CAREB-ACCER National Conference organized by the Canadian Association of Research Ethics Boards.

Ethical Considerations with Minors/Minority Groups

Another key point that emerged from this discussion is that there should be different standards or at least a different set of considerations when collecting social media data produced by minors, minority or other groups of sensitive nature. Furthermore, the expectation of privacy would likely be different whether someone is talking about a movie on social media as opposed to a more sensitive topic, such as personal health.

Informed Consent as a Teaching Moment

Finally, the discussion led to the idea of using the informed consent process as a teaching moment with social media users. Who is supposed to do the “citizen education” in this area? Is this the researcher’s task? One participant stated (and some agreed), “If we’re collecting the data, informing is at least partially our responsibility”. But this also raised the concern of potentially biasing the data since by informing users, we may (and likely will) influence users’ behaviour online. By informing users, we are also potentially moving from studying online publicly available data, which might not require IRB/REB’s approval, to studying online users, which has a whole set of other ethical implications and considerations.

2. Access and Collection Research Design

Lead discussant: Jess Vitak

Where and how is data being collected?

Public vs. private posts/profiles (e.g., tweets vs. Facebook posts vs. Snapchat messages)

Time restrictions on data availability

What is an appropriate sampling/recruitment strategy to ensure an accurate representation of social media users and their views?

In developing any research project, it is important to consider issues of access and collection of data. Using social media datasets in research is facilitated due to the ready availability of APIs, but research is also hindered given the technical constraints imposed by social media platforms themselves. This raises questions as to how and where social media data is collected and how this affects the research itself.

Social Media Data ≠ All People

Researchers need to remember that social media research that relies on scraping social media platforms does not account for differences between users and non-users. Some concerns were raised as to a Hawthorne effect in research, where people know they are being studied and change how they act/post. Furthermore, researchers should recognize that there may be a selection bias as people have freely chosen to post on social media.

Collecting Data from APIs versus Collaborating with Social Media Platforms

Researchers can gain access to social media data in various ways, but typically do so through the API or through collaborating directly with social media platforms. Application Programming Interfaces (APIs) allow third-party developers to access and collect publicly available social media data directly from the social media platforms.

Alternatively, some researchers are collaborating directly with the social media platforms to get direct access to the data. There is a tradeoff in the type of access and the process of gaining this type of access is challenging and tedious. However, without partnerships, data quality is typically lower. There are also differences in terms of the research questions one can ask based on the data available.

Extending the Shelf Life of Social Media Research

Even with appropriate access to social media data, the research design is especially challenging considering how platform interfaces are constantly changing. Researchers need to be aware and point to specific constraints/features of the social media sites at a given time, but it is also important to tie the findings to larger ideas that can be applicable over time; for example, connecting research questions to social science theories (e.g., homophily). 

3. Sharing For Collaboration

Lead discussant: Bryan Semaan

What are the issues associated with publishing or sharing social media datasets that you collected for your research?

What are effective models for social media data sharing? (e.g. repositories of data)

How can researchers negotiate with social media platforms to get the rights to share/preserve social media data?


One of the key questions that emerged from this group discussion was around who has/who should have access to the “big data” social media datasets?

Big Data Divide

A new type of divide is emerging: The Big Data Divide. There is first a divide between industry and academic researchers, but there is also a divide within the categories themselves. For academic researchers, The Big Data Divide exists between: (1) those who have the computational expertise to build tools versus those who rely on the tools of others; and (2) those who have the funding to buy access to data (e.g., Twitter Firehose) versus those who rely on the freely available data sources (e.g., public APIs).

Terms of Service Restrictions

The divide is further problematized by the inability to share social media datasets as outlined in the various Terms of Service (ToS). See http://api.socialmediadata.org. Ideally, researchers should be able to share their datasets in order for other researchers to query, validate and reuse the data in different ways. However, the ability to share data is not only limited by the ToS, but it also raises privacy and ethical implications of sharing.

Data Reuse and Intent

If social media datasets were shared, then we need to consider how others having access to the data may result in the dataset being used in unintended ways, which may have negative implications. While academic researchers may be using the data for public good – to expand theory, knowledge, and understanding of the impact of social media in society – industry is largely concerned with a profit motive. As a result, the shared social media data may ultimately be used in ways that were unanticipated.

4. Preservation Metadata Schemas  

Lead discussant: Ayoung Yoon

What metadata schemas and what formats should be used to preserve social media data?

 What are the considerations for different data types such as posts, profiles, interactions, social networks?

 What are the main strategies to ensure data reuse?  For example, should researchers preserve all available metadata (“just in case”/“all you can get” approach)?


Data preservation allows researchers to share datasets, use datasets in new ways and test and re-test findings. The group found that the preservation of social media data presents unique challenges for the following three key reasons.

a) Not All Data is Equal

First, social media data is not all equal. For example, a “Like” on Facebook and a “Like” on Twitter are not necessarily the same. Further, it is very difficult to anticipate what social media data of new applications and platforms will look like in the future.

b) Experience of Social Media Differs

Second, social media as it is experienced is potentially very different from how it is presented through collected and stored data. For example, when an individual navigates a subreddit, factors such as the page layout, inclusion of visuals and links, and other current posts may all impact how that individual perceives and interacts with the presented information. When data is collected and stored in a database, however, the experience of interacting with posts changes.

c) Metadata Schemas

Third, social media data and related metadata do not clearly map onto existing metadata schemas. For example, social media data includes important relational information (such as “friend” and “following” relationships or co-liking and co-commenting instances) that other types of data do not. Relatedly, social media data is rarely static and different versions of a dataset can lead to very different results.

These issues bring forward many questions. For example:

  • Can different types of social media be compared fairly?
  • To what extent is the social and technical context of social media use important in the preservation process?
  • Should a copy of the visual display of social media should be stored in addition to the data researchers tend to collect today?
  • Should multiple versions of data be collected?
  • Is it possible to find a single metadata scheme that is flexible enough while still being effective?

These issues are particularly challenging to deal with because there is little incentive for researchers to find or make use of solutions. Neither the peer-review process, tenure and promotion process, nor the research process itself in the social sciences reward the production of high quality, searchable and reusable datasets. As such, a key concern as the discussion of social media data stewardship progresses is finding a way to incentivize the kinds of data preservation required.


Considering that research with social media data is so multifaceted, we need to develop a common framework to examine social media research through technical, ethical and policy perspectives. We propose using Social Media Data Stewardship as a framework and urge researchers to consider how their own research fits within this emerging area.

Challenges and Opportunities of Doing Research With Social Media Data
Tagged on:
Visit the COVID19MisInfo Portal - a rapid response project of the Ryerson University Social Media Lab.COVID19MisInfo.org