How to Write an HCI Systematic Review

For my qualifying exam at Indiana University, I conducted a systematic literature review of HCI research. As someone who thinks a lot about ordering tasks, I enjoyed the careful planning that goes into a review. I took away a lot of lessons learned about the process that I wanted to share with people that are considering conducting a review of their own. I’ve focused this post on conducting systematic reviews in general, regardless of which database you’re considering using, highlighting the steps along the way.

Home screen of the ACM Digital Library

Background

I performed a systematic literature review, meaning that I carefully searched for, sorted, and gathered data from published articles. Grant and Booth compared and contrasted 14 different types of reviews, comparing them to what they the medical research community was using at that time. Systematic reviews differ from a “survey”/literature review by being more thorough, spending more time carefully searching than with other reviews.

Throughout this post, I’ll be pulling through an example from my own experience. For my systematic review, I was interested in what tangible, ubiquitous technologies the HCI community has been designing to support older adults. I was interested in showing how most technology has been designed for and with older adults. I wanted to highlight how supporting older adults to create these technologies themselves could help propel HCI forward. Between June 2018 and September 2019, I conducted 4 iterations on a systematic review of tangible, ubiquitous technology designed for older adults, focusing only on the ACM Digital Library (ACM DL). I limited my search to only including work from 1991 onwards, which was when Mark Weiser’s seminal work – The Computer for the 21st Century – laid the groundwork for ubiquitous computing.

Worth noting, I focused mine on the ACM DL only, but around December 1, 2019 ACM updated it. Some of the advice I would’ve given on data collection is now out of date. I’ve included a short commentary on old ACM DL at the end.

Planning a Systematic Review

Systematic reviews require initial planning on where to send the review, considering a formal process, identifying the potential contributions, and choosing the paper databases. One of the first steps is thinking about where you plan to send your work – is the review for your qualifying exam? Are you trying to publish it somewhere? If you’re thinking about both, I would encourage you to first focus on meeting qualifying exam expectations before working towards building it out for a particular submission venue. I got caught trying to write a paper that would please both my committee and a potential HCI conference venue, which caused quite a bit of extra stress and frustration. You’ll have to do multiple iterations of the paper anyways, so why not treat your qualifying exam as a “first draft” for the submission venue? It can be a chance to flesh out some of those ideas and get some initial feedback on them through the qualifying exam process.

Although I cover most of the steps here, you may consider using a formalized systematic review process like PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to help structure your process. PRISMA was initially developed to standardize reviews in medical research, but their tools are helpful for structuring your review. For example, they have a great checklist and flow diagram. Even if you don’t follow every single checklist item, it’ll still help you be prepared for most methods questions from your committee or reviewers. I found PRISMA still helpful, even though I was focused on HCI and not doing anything related to health. Only checklist items related to risk were irrelevant because HCI doesn’t tend to analyze risk the same way as the medical community.

PRISMA 2009 Checklist

Another key step is thinking carefully about your potential contributions to your research community. Knowing your contributions helps drive what data you will collect, so spending extra time there can save more time later if you decide to tighten up your contributions. In my case, I initially included people with dementia and with my older adult search terms, but as I iterated on the overall review, I realized that it was too hard to tie the two together – designing for people with dementia is an important line of research, but it has many qualities more in line with supporting people with disabilities than a person without disabilities.

A third consideration is choosing the database(s) for the search. Certain libraries appeal to certain audiences. The ACM DL is where most HCI-focused research tends to be published, but you’ll still find some in IEEE Xplore, the journals on Taylor and Francis Online, and more discipline-specific places, such as Oxford Academic for the Journal of the American Medical Informatics Association. Some more engineering or technical HCI projects may be found more in places like IEEE Xplore than ACM. There are also aggregator databases like Google Scholar and Microsoft Academic that pull from many places. In my case, I chose to use only the ACM DL because I wanted an HCI-specific focus. My qualifying committee was comfortable with that, but excluding other databases limited my results for publishing. Reviewers, especially those coming from technical HCI backgrounds did not appreciate excluding IEEE. In retrospect, I would have included at least ACM and IEEE Xplore, and strongly considered including Google Scholar.

Setting Search Criteria

Formulating the search criteria helps to actualize your contributions and begin to test them out. Choosing the search terms is a careful process of negotiating the scope of the work you’re looking to accomplish. Stowell et al.’s recent review of mHealth Interventions for vulnerable populations wanted to be comprehensive to point out how research has been missing support for vulnerable groups. To accomplish this, they had very specific search criteria, but they searched through 64,249 titles to find the 83 papers that ultimately made it into the study. My review contributions were a bit narrower, focusing on technology design of tangible devices for older adults. I iterated about 15 times to end up with a set of search criteria that netted 4,279 titles to review. To help with iterating, I had 5 papers that I thought should be a part of the corpus, so I kept altering the terms until I had a reasonable number of results that still included those papers.

Ultimately, I ended up with the following search string: (“older adult%”, elder%, senior%, aging, ageing) AND (technology, computer, tablet, system%, ubiquitous, sense, smart, digital). Let’s break that down – the ACM DL’s advanced search page lists the Boolean logic for searches. The first half before the AND covers all of the relevant older adult terms, and the second half are the generic technology terms I used to cast a wide net. However, casting such a wide net ended up requiring more manual review of the titles to weed out papers on topics like aging hardware. I also made careful use of % to account for plural forms of the terms.

ACM Digital Library Advanced Search

The updated ACM DL also allows for adding in some additional criteria based on the metadata, including what text to search (e.g., titles, abstracts, full text), people details (e.g., specific institutions), publication venues (e.g., journals) and formats (e.g., content type: research article vs. extended abstract), conferences, and publication date. As I mentioned earlier, I set a timeframe based on Mark Weiser’s work – in 1991 he wrote the paper that sparked ubiquitous computing. However, I had some reviewers argue that I should’ve looked at a narrower timeframe because technology changes so much. I feel that it’s open to interpretation, but justify your decision. Additionally, I limited the search to be abstracts only. It helped to limit the number of results to more manageable number, but I was opening myself up to arguments about being too limiting.

Filtering the content type is a helpful, but tricky tool for cutting out entries that are not a full length peer-reviewed paper. Around 2017, ACM got better at labeling which database were full research articles and which weren’t, such as extended abstracts, posters, dissertations, and tutorials. Prior to 2017, there doesn’t appear to be any metadata to indicate which category a paper falls into. Tutorials, in particular, are tricky because ACM doesn’t make the criteria for what is a full length research-paper vs. a tutorial very clear. Entire publications venues, such as the Pervasive Healthcare Conference, are considered a tutorial. The best I can tell is it’s because they’re juried (i.e., a pre-set reviewer pool) rather than open to a full peer review process (i.e., associate chairs reaching out to anyone to review). Regardless, it’s safest to stick to only papers labeled “Research Article” or an equivalent in other databases.

ACM DL Publications Filter Options

If I were starting a new search, I would have used more specific criteria (similar to Stowell et al.) and opened up the search to full text. It would have helped when casting a wider net across more databases that might not have the same sets of search criteria available.

Collecting the Corpus

The process of actually collecting the corpus of papers is a question of what tools are you familiar with, how much time you want to spend learning something new, and if it needs to be “free”. There is specific systematic review software available, such as Covidence. Software can be helpful for dividing up the work among multiple co-authors and formally adding validity measures, such as Cohen’s Kappa for agreement among reviewers, but they understandably cost money. Covidence costs $240 for one review with a small team. Alternatively, you may consider using more flexible “free” tools, such as XML with Python or Microsoft Excel. I had lots of experience with Excel and did not plan on working with many co-authors, so I ended up using Microsoft Excel. Excel was helpful because I could mold it to work with my process, and it also helped later with using tricks like the sort feature and concatenate function. However, it does make adding validity more challenging, as well as tracking changes among multiple authors. If I were to do it again, I would strongly consider exploring Python with a machine readable format like CSV or XML files to speed up some of the process even more and help with later integration, but I’d need to take the time to set up an editor.

Example of a CSV File downloaded from the old ACM Digital Library.

Regardless of what tool you decide to use, here are some of the recommended pieces of metadata to collect for analysis, independent of any specific tool. If you’re considering a specific software package, you may need to revisit this list:

  • Critical Metadata
    • Author list
    • Title
    • Abstract
    • Hyperlink or the DOI to create a hyperlink later
  • Helpful Metadata
    • Author’s Institution(s)
    • Publication Year
    • Publication Venue
      • Journal details (issue date, volume, etc.)
    • The database’s unique ID
    • Number of pages
    • Publisher

With the new ACM DL, collecting data is suddenly a lot more complicated than the old one. The biggest issue is you can no longer download a CSV of the search results. You can only save off 50 sets of BibTeX at a time. The DL has an anti-bot policy, so it’s not possible to automate the task, either. Anecdotally, I’ve also heard that the ACM DL is getting stricter about downloading too many PDFs in one day, doing things like temporarily or permanently blocking accounts. The best way I can suggest to do a systematic review with the updated ACM DL is to get an XML dump directly from ACM. To get access, e-mail Craig Rodkin (rodkin@hq.acm.org), the Publications Operations Manager for ACM, who will ask for a simple agreement completed on institutional letterhead. The XML includes the standard metadata along with extracted full text from the published work, but not the PDFs.

See the section on “Specific Issues with the Old ACM DL” for more details on the data collection issues I had with the CSV data from the old DL.

Preliminary Screening: Cleaning the Corpus

As with any data set, there will be some cleaning that needs to be done prior to analysis. Before I began in earnest, I cleaned up the data set by removing the duplicates by sorting on the DOIs – using them as the primary key. Cutting out duplicate DOIs will be especially helpful if you end up working with multiple databases. I also removed files that were already peer-reviewed papers, such as dissertations, books, extended abstracts, posters, summaries of speeches, etc.

Screening the Corpus

Sample Paper Refinement Chart from my systematic review.

Conceptually, I thought about sifting through the papers to find the final corpus as a funnel, with each layer of review filtering out papers with more and more scrutiny. Up until this point, I had a feel for what types of papers I wanted to include, but I had not set in stone my inclusion and exclusion criteria. I started by reviewing 100 titles only and began to set the inclusion and exclusion criteria. I wish I would have instead reviewed 100-300 papers all the way through – reviewing the title, abstract, and full text – before committing to reviewing all of the papers. As you get into reviewing in depth, you’ll have a chance to see some patterns in the titles or abstracts that will help you further refine your inclusion and exclusion criteria. For example, in my initial review of 100 titles, I had ruled out papers with “AAL” in the title thinking it was irrelevant. Had I read the abstract and full text, I would’ve remembered that AAL means Ambient Assistant Living. I lost time later re-reviewing papers, searching for AAL in the title or abstract. If you’re considering publishing your work, I’d also consider adding more validity by having someone else review the same sample with you to make sure you’re on the right track.

After that initial setting of criteria, there are two broad strategies I considered for filtering papers. (1 – Complete Paper) Read a paper’s title, then abstract, then full text in one go. I wouldn’t have to refresh my brain with the details each time, but I wouldn’t be able to settle into a rhythm during review. The other broad option is to (2 – Layers at a Time) read all of the titles first, immediately moving to the next paper after a decision, and repeating in the same way for the abstract and full text. I followed this second strategy. I was quickly able to settle into a rhythm and make fast decisions without needing to take the time to pull up the paper. However, when I made it to the full text screening, I realized I was spending a lot of time re-familiarizing myself with the paper’s abstract again before I read the full text anyways. I’d recommend a third strategy: (3 – Middle Ground) Read through all of the titles first to make quick progress, then filter through the abstracts and full text of each paper in one sitting.

Strageties for screening: 1. Complete Paper; 2. Layers at a Time; 3. Middle Ground.

Throughout my time filtering, I carefully tracked why I removed each paper and kept in mind that the next level of scrutiny can help confirm if I wasn’t 100% sure a paper should be removed. Be clear why you’re choosing filters for your inclusion and exclusion criteria. Tracking why you removed papers is also important in case you need to explain to reviewers why one of their papers was removed, or if you need to go back and revisit any decisions. For example, one of my exclusion criteria was originally “Game” – I simply wasn’t interested in them. However, I later realized I didn’t really have a good reason for excluding games, so I went back and reviewed all the papers that fell into that category.

I’d also like to remind you about adding extra validity to your results, especially if you’re thinking to publish results. As I mentioned earlier, systematic review software packages can allow you to bake validity measures in by having multiple people agree on the review. With flexible tools like Excel or Python, you’ll need to handle that yourself. Top venues will likely expect a formal measure like Cohen’s Kappa, but you can also consider having co-authors review a random sample of 10-20%, and discuss any differences (e.g., Lazar et al.).

Pulling Data from the Corpus

Start thinking about what data will help to validate your contributions. As you have been reading through the full text of papers, you have probably started to get some ideas for the type of data you want to collect. Think carefully about the contributions you’re trying to make, and begin collecting data that helps to back that up. You may consider making a diagram of your contributions and the data you’re collecting to help answer them.

I’d also recommend taking some initial “notes” until you have enough data to start to develop categories. For example, I wanted to talk about the overall focus of the work. I initially wrote down a few words on each, such as “smart wheelchair” or “in home physical therapy assistant”. Eventually I started to see patterns to define categories, such as “assistive devices” and “health”, and I iterated on those over time.

Some of the data you’re trying to collect may force you to collect data per study (i.e., method) rather than per paper. Some HCI researchers wrap up multiple different studies into the same paper, such as Piper et al. using a set of observations, and two separate technology evaluations in their paper. I recorded multiple studies per paper to be able to add commentary about the individual methods people used in their studies (e.g., how many older adults per method, how many stakeholders).

I’d also recommend writing a tweet (140-character summary) of each paper to help you recall the details about a particular paper. After a while, many of the paper start to bleed together in your memory, so a short summary can help speed up refreshing yourself about them.

There are also ways to add more validity for publishing. For example, Stowell et al. had two authors independently collect all of the data for each paper and then they later reconciled the differences. You could consider something similar, using a software package, or cross-checking 10-20% of them.

Writing the Paper

Regardless of whether you’re doing a systematic review for a qualifying exam or trying to publish it, keep in mind readers are expecting a broad overview with some actionable insights based on the data you’ve collected. As with any paper, tailor your message and the way you present the data in a way that appeals to your audience. That may mean tailoring it to a specific community regardless of whether you feel that your results are more generalizable than that. Paper reviewers, in particular, are more critical of systematic reviews than other types of papers because they can have a huge impact on the field.

Also be careful to summarize the key results – you’ve collected quite a bit of data, so try not to overwhelm readers by presenting everything. Most written reviews include hard to read in line citations, such as the following from my review:

“Overall, more were funded (73/116 papers – 62.9%) [6, 10–12, 14, 16, 19, 24–27, 29, 32, 33, 41, 42, 44, 45, 53, 59, 62, 65, 66, 69, 72–79, 84, 88, 89, 92–94, 98, 105, 106, 113, 115–118, 122, 126, 128, 129, 132–134, 139, 140, 142, 143, 145, 148, 151, 154, 157, 159, 160, 164–166, 170, 171, 174, 175, 177, 178] than not (43/116 papers – 37.1%) [1, 8, 9, 15, 21, 22, 28, 30, 31, 34, 38, 40, 43, 49, 54, 60, 61, 63, 64, 81, 85, 86, 95, 101, 102, 108–111, 119, 121, 124, 130, 149, 155, 156, 161, 162, 167, 172, 179–181].”

It’s a lot to take all that in. I’d recommend using at the very least using colored citations, so it’s easier to visually parse through the text.

Additionally, I used some Excel & NotePad++ magic to create the BibTeX citations in LaTeX. (1) I used Excel’s concatenate function to put every citation into the format of “citationlabel, ”. The “, “ helps to spread them out. Next, (2) I use Excel’s sort feature to filter the ones you’d like to copy, and (3) copied them into Notepad++ to use the keystroke macros to put them all on the same line. Note that all of this can be done with Python, as well.

What will I do the next time I do a systematic review?

I sprinkled advice throughout, but I’d like to summarize a few key suggestions for the next time I do a systematic review, that I’d like you to also consider:

  • Spend more time making a more restrictive set of search terms to speed up the manual process of screening papers. Use good Boolean logic.
  • Search through more than one database.
  • Decide if there’s a better way to pull the papers I need from the updated ACM DL, or if I need to try to get an XML download from the ACM database group.
  • Learn Python and compile the data in a machine readable format (CSV/XML/JSON/TOML) so you can quickly manipulate the data for more flexibility between databases and screening. This can also help with in choosing how to add validity.
  • Continue carefully tracking why a paper was removed from the final corpus.

Specific Issues with the Old ACM DL

As an HCI researcher, I’m glad ACM updated the ACM DL. It’s quite a visual upgrade, and they added a few more filters that weren’t available before. However, it’s challenging for systematic reviews, and reaching out to ACM for the XML file is the only easy solution outside of manually scraping data.

The old ACM DL had its own set of issues, but I was able to work around them. Interestingly, the old ACM DL would display one number for the number of results (e.g., 895), but then the CSV file would have more entries, with several duplicates added in (e.g., 967) with a minor detail changed in the duplicates. I was forced to clean it up based on the DOI. Saving off a CSV was very handy for systematic reviews, but you can only save off the metadata of about 2,000 entries at a time. I had to run multiple searches and break it up based on year chunks (e.g., 1991-2011, 2012-2016, 2017-2018). Surprisingly, the CSVs didn’t include the abstract, so I was forced to load the webpage from the DOI each time to access it. I spent a lot of time waiting for the ACM page to load for each paper. Lastly, the Boolean logic was not documented and was difficult to figure out through trial and error (e.g., the need to use normal parentheses rather than curly brackets), but the new Advanced Search describes it in detail.

Acknowledgements

I’d like to thank Alexander L. Hayes and Katie Siek for their feedback on this post.