Hello World! Welcome to my data visualisation project!

In this page, I will show you how I used publicly available data from the NHS to graph a time series of dementia diagnosis rates in England. This project has been developed for the PSY6422 Data Management and Visualisation module, which is part of the MSc in Psychological Research Methods with Data Science at the University of Sheffield. Please visit my github repository to find all documentation related to this project.

Motivation and Data Origins

I found a public data repository of dementia diagnoses managed by the NHS. Data from this repository have been used to develop a Dementia Publication Dashboard, which provides an insight into the prevalence of dementia in England.

Not every person suffering from dementia has a formal diagnosis. The national goal (i.e., benchmark rate) is to ensure that at least 2/3 (≥ 66.7%) of those suffering from dementia have a formal diagnosis. The NHS calculates dementia diagnosis rates by comparing the number of recorded diagnoses against the number of people estimated to have dementia. Diagnosis rates are calculated for people aged 65 and above.

Research Questions

As I was playing around with the dashboard, I noticed that there were no comparisons of regional vs national dementia diagnosis rates. I thus saw the perfect opportunity to ask the following questions:

1) How have regional rates (compared to national rates) changed overtime?

2) Are there regions that have consistently been above or below the national diagnosis rates and the benchmark rate?

For those unfamiliar with English geography, here’s a picture depicting the 9 (statistical) regions of England:

Retrieved from: https://www.staffordministries.org/blog/2018/4/18/geography-101

Data Preparation

This project was developed using RMarkdown. I used packages from these libraries:

library(tidyverse)
library(gghighlight)
library(rmarkdown)
library(purrr)
library(lubridate)
library(gganimate)
library(RColorBrewer)
library(jcolors)
library(plotly)
library(plyr)

Step 1: Get the data in

I used data from May 2016 until April 2021. Data were provided by the NHS in separate files containing a year’s worth of data. I uploaded 5 raw data sets:

df_2017 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-apr-2017.csv") #data period start: May 2016

df_2018 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2018.csv")

df_2019 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2019.csv")

df_2020 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2020.csv")

df_2021 <- read.csv("https://raw.githubusercontent.com/jsgm21/gonzalezmartinez_psy6422_project/main/raw_data/dem-diag-ind-phe-Apr-2021.csv") #data period end: April 2021

Step 2: Check the data

Let’s have a look at the variables in the data set and a random sample of observations. The code below is representative of all data sets because they have been structured identically by the NHS.

names(df_2017) #returns the names of the variables in a data frame
## [1] "INDICATOR" "ORG_TYPE"  "ORG_CODE"  "NAME"      "ACH_DATE"  "MEASURE"  
## [7] "VALUE"     "DQ"
paged_table( #use to make table output neat for publishing
  sample_n( #returns a random sample of 10 observations
    df_2017, 10))

A couple of challenges were clear:

Challenge 1: Dates were not in the correct format (YYYY-MM-DD) and class type (i.e., date).

Challenge 2: The variable “MEASURE” needed to be split into separate variables. The NHS collected several dementia measures but I was only interested in visualising the measure “DIAG_RATE_65_PLUS”

Challenge 3: Organisation types (ORG_TYPE) and codes (ORG_CODE) represented a mix of local, regional, and national entities. I needed to extract England and its regions. Overcoming this challenge involved doing research and using resources from the Office for National Statistics to determine which regional tags were relevant for this project.

Challenge 4: Develop code that would address challenges 1-3 simultaneously for all raw data sets. I could not merge the data frames after upload because they exceeded the memory capacity of my computer.

Solution:

class(df_2017$ACH_DATE) #spot the problem: class type is a character rather than a date
## [1] "character"
#create a list of the data frames so changes are applied to all of them
list_dfs <- list(df_2017, df_2018, df_2019, df_2020, df_2021)

#use map_dfr and pipes to create the processed data frame. I chose this method because all data sets were structured identically - thank you, NHS!
df_proc <- map_dfr(
  list_dfs, ~{.x %>% #create a new date variable (DATE) by converting the old dates into the correct format
    mutate(
      DATE = as.Date(ACH_DATE, format = "%d%B%Y")) %>% #split the values of the MEASURE variable so each measure becomes its own variable 
  pivot_wider(
    names_from = MEASURE,
    values_from = VALUE) %>% #filter data using the ORG_TYPE codes that correspond to England and its regions
  filter(
    ORG_TYPE == "PHE_CENTRE" | ORG_TYPE == "COUNTRY_GEOGRAPHICAL")})

class(df_proc$DATE) #verify that class type is indeed a date
## [1] "Date"
paged_table(
  sample_n(df_proc, 10)) #returns a random sample of the processed data frame in a neat output

Visualising the data

Visual diagnostic

I developed a basic, diagnostic graph to get an insight of whether I needed to do more adjustments to the processed data.

ggplot(
  data = df_proc,
  mapping = aes(x = DATE,
                y = DIAG_RATE_65_PLUS,
                colour = NAME)) + #use colour = NAME so the regions get different colours
  #customise y axis, lower and upper axis limit
  ylim(
    c(50, 80))+
  #use geom_line to show regions as lines
  geom_line(
    size = 1) +
  #use geom_hline to set the benchmark rate, i.e., 66.7
  geom_hline(
    yintercept = 66.7, 
    linetype = 1, #1 for a solid line 
    size = 1, 
    colour = "dark grey") +
  #use the Paired palette colour from the RColorBrewer package
  scale_color_brewer(
    palette = "Paired") + 
  #customise theme to plain black and white 
  theme_bw() +
  #assign graph labels
  labs(
    x = "Year", #x axis label
    y = "Dementia diagnosis rate (%)", #y axis label
    title = "Dementia diagnosis rates in England", #assigns chart tittle
    caption = "Data Source: NHS") #assigns a caption

There were multiple issues with this graph, but the most salient one was that Yorkshire and the Humber was not shown consistently in the graph. This was likely due to a mistake in coding the raw data.

I standardised Yorkshire and the Humber and other regional names:

#correct "yorkshire and humber" to "yorkshire and the humber"
df_proc$NAME [df_proc$NAME == "YORKSHIRE AND HUMBER"] <- "YORKSHIRE AND THE HUMBER"

#use revalue to rename 'NAME' regions so they are no longer in all capitals 
df_proc$NAME <- revalue(
  df_proc$NAME, #"old value" first, then "new value"
  c("ENGLAND" = "England",
    "LONDON" = "London", 
    "WEST MIDLANDS" = "West Midlands",
    "NORTH EAST" = "North East",
    "YORKSHIRE AND THE HUMBER" = "Yorkshire and the Humber",
    "EAST MIDLANDS" = "East Midlands",
    "EAST OF ENGLAND" = "East of England",
    "NORTH WEST" = "North West",
    "SOUTH EAST" = "South East",
    "SOUTH WEST" = "South West"))

#use 'data.frame' and 'complete' to ensure that the region and date combinations are stable. It's a small line, but it is essential. 
df_proc <- data.frame(
  complete(
    df_proc, DATE, NAME))

Then I checked that the naming issues have been solved. I also made a few tweaks to the original diagnositc plot to make it more visually appealing.

#I used a structure similar to the previous graph. I will only comment code to highlight differences.
ggplot(
  data = df_proc,
  mapping = aes(
    x = DATE,
    y = DIAG_RATE_65_PLUS,
    colour = NAME)) +
  ylim(
    c(50, 80)) + 
  #introduce and customise x axis limits by the start and end date of data collection 
  scale_x_date(
    limits = as.Date(
      c("2016-05-31","2021-04-30")), #use 6 month breaks so the x axis values show intervals of 6 months
    date_breaks = "6 month",
    date_labels = "%b %Y",
    expand = c(0.02,0)) + #use expand to control the white space between the axes and the data
  geom_line(
    size = 1) +
  geom_hline(
    yintercept = 66.7,
    linetype = 2, #2 for a dashed line
    size = 1, #change line colour to black
    colour = "black") +
  annotate( #label parameters for the benchmark rate line
    "text",
    x = as.Date("2016-10-31"), #controls where 'Benchmark' label appears in the graph
    66.7, #y-value where the label is referenced
    vjust = -0.5, #label positioning just above the line
    label = "Benchmark rate", #name of the Benchmark label
    size = 4) + #controls label size
  #change the colour palette to "Spectral" for better contrast
  scale_color_brewer(
    palette = "Spectral") + 
  theme_bw() +
  theme( #customise grid lines
    panel.grid.minor = element_blank(), #no minor grid lines for any axis
    panel.grid.major.y = element_blank(), #no major y-grid lines so we're only left with major x-grid lines which are helpful for reading the chart
    #change margin size of the graph, value order: top, right, bottom, left
    plot.margin = margin( #set parameters and units to mm
      1,1,1,1,"mm")) +
  labs(
    x = "Month",
    y = "Dementia diagnosis rate (%)",
    title = "Dementia diagnosis rates in England",
    subtitle = "From May 2016 to April 2021", #introduce a subtitle to make graph more informative
    caption = "Data Source: NHS")

Developing an engaging plot

The plot shown above could answer my research questions, but it’s a boring graph, it’s hard to read, and it does not use the potential of R as a graphic tool.

I developed a series of interactive graphs with the objective to create a visualisation where trends were easily understandable and comparisons between regions could be easily made.

Attempt 1

I developed this animated plot with the intention to visualise better the progress of national and regional trends

#I used a structure similar to the previous graph. I will only comment code to highlight differences.

#save graph structure into an object that will be used later for animation
gr_int0 <- ggplot(
  data = df_proc,
  mapping = aes(
    x = DATE,
    y = DIAG_RATE_65_PLUS,
    colour = NAME)) +
  ylim(
    c(50, 80)) +
  scale_x_date(
    limits = as.Date(
      c("2016-05-31", "2021-11-30")), #change upper limit to a date beyond data collection to give space to animated labels
    date_breaks = "9 month", #change data breaks from 6 to 9 month intervals to facilitate reading
    date_labels =  "%b %Y",
    expand = c(.02,0)) +
  geom_point() + #introduce points for the animation
  geom_line(
    size = 1) +
  geom_hline(
    yintercept = 66.7,
    linetype = 2,
    size = 1,
    colour = "black") +
  annotate(
    "text",
    x = as.Date("2016-10-31"),
    66.7,
    vjust = -0.5,
    label = "Benchmark rate",
    size = 4) + #introduce shadowtext 
  shadowtext::geom_shadowtext( #geom_shadowtext takes NAME values (i.e., regions) and places them next to each region line in the animation. You will see soon!
    data = df_proc,
    mapping = aes(
      x = DATE,
      y = DIAG_RATE_65_PLUS,
      label = NAME), #animated labels will take NAME values
    hjust=-0.1, #makes region names show just next to the line
    vjust = 0,  #makes region names show just next to the line
    bg.color = "white") + #bg.color as white because there is no need for a background colour for our region names
  scale_color_jcolors(
    palette = "pal8") + #important change, introduce new colour package -jcolors- and new colour palette -pal8- to make animation more readable
  theme_bw() +
  theme(
     panel.grid.major.y = element_blank(),
     panel.grid.minor.y = element_blank(),
     legend.position = "none", #remove legend since we'll have animated features from shadowtext
     plot.margin = margin(1,1,1,1,"mm")) +
  labs(
    x = "Month",
    y = "Dementia diagnosis rate (%)",
    title = "Dementia diagnosis rates in England",
    subtitle = "From May 2016 to April 2021",
    caption = "Data Source: NHS") +
  transition_reveal(DATE) #activate animate display of time series

#animation parameters
animate(
  gr_int0, #object used to save the graph structure
  fps = 7, #control speed of animation
  end_pause = 35) #control immediate repetition of animation, so viewers have time to digest trends

I found the visualisation somewhat overwhelming, so I experimented with the function facet_wrap, which allows to split the animated graph.

Before using facet_wrap, I coded a “dummy” variable in the processed data set to control the order of the facets. I wanted to show England in the bottom part of the new graph and the regions on top.

#make a dummy variable for face_wrap later, future 'you' will thank me later
df_proc <- df_proc %>% #make new variable by revaluing ORG_TYPE values
  mutate(
    FACET_ORDER = revalue(ORG_TYPE, 
                         c("PHE_CENTRE" = 1,
                           "COUNTRY_GEOGRAPHICAL" = 2))) #assign values 1 and 2 to control order

Then I implemented facet_wrap to get the graph below:

#I used a structure similar to the previous graph. I will only comment code to highlight differences.
gr_int1 <- ggplot(
  data = df_proc,
  mapping = aes(x = DATE,
                y = DIAG_RATE_65_PLUS,
                colour = NAME)) +
  ylim(
    c(55, 80)) +
  scale_x_date(
    limits = as.Date(
      c("2016-05-31","2021-11-30")),
    date_breaks = "6 month", #change date breaks back to 6 month intervals for reading ease
    date_labels =  "%b %Y", 
    expand = c(.02,0)) +
  geom_point() +
  geom_line(
    size = 1) +
  geom_hline(
    yintercept = 66.7,
    linetype = 2,
    size = 1,
    colour = "black") +
  annotate(
    "text",
    x = as.Date("2016-10-31"),
    66.7,
    vjust = -0.5,
    label = "Benchmark rate",
    size = 4) +
  shadowtext::geom_shadowtext(
    data = df_proc,
    mapping = aes (x = DATE,
                   y = DIAG_RATE_65_PLUS,
                   label = NAME),
    hjust=-0.1,
    vjust = 0,
    bg.color = "white") +
 scale_color_jcolors(
   palette = "pal8") +
   facet_wrap( #introduce facets
     ~FACET_ORDER, #facets will be split based on order from "dummy" variable
     ncol = 1, #number of columns is 1 so we have top and bottom subgrpahs shown individually
     labeller = labeller(
       FACET_ORDER = #labeller lets us name the subgraphs using values from the "dummy" order variable
         c("1" = "Regional Trends",
           "2" = "National Trend"))) +
  theme_bw() +
  theme(
     panel.grid.major.y = element_blank(),
     panel.grid.minor.y = element_blank(),
     legend.position = "none",
     plot.margin = margin(1,1,1,1,"mm")) +
  labs(
    x = "Month",
    y = "Dementia diagnosis rate (%)",
    title = "Dementia diagnosis rates in England",
    subtitle = "From May 2016 to April 2021",
    caption = "Data Source: NHS") +
  transition_reveal(DATE)

animate(
  gr_int1,
  fps = 7,
  end_pause = 35)

I also experimented with individual facets, breaking down each geography into individual graphs. I used the variable NAME rather than FACET_ORDER to control the splitting of graphs:

#I used a structure similar to the previous graph. I will only comment code to highlight differences.
gr_int2 <- ggplot(
  data = df_proc,
  mapping = aes(
    x = DATE,
    y = DIAG_RATE_65_PLUS,
    colour = NAME)) +
  ylim(
    c(55, 80)) +
  scale_x_date(
    limits = as.Date(
      c("2016-05-31","2021-04-30")),
    date_breaks = "12 month", #change date break to 12 month intervals for reading ease 
    date_labels = "%b %Y",
    expand = c(.02,0)) +
  geom_point() +
  geom_line(
    size = 1) +
  geom_hline(
    yintercept = 66.7,
    linetype = 2,
    size = 1,
    colour = "black") +
  annotate(
    "text",
    x = as.Date("2017-05-31"),
    66.7,
    vjust = -0.5,
    label = "Benchmark rate",
    size = 4) +
  shadowtext::geom_shadowtext(
    data = df_proc,
    mapping = aes (x = DATE,
                   y = DIAG_RATE_65_PLUS,
                   label = NAME),
    hjust=-0.1,
    vjust = 0,
    bg.color = "white") +
scale_color_jcolors(
  palette = "pal8") +
  facet_wrap(
    ~NAME, #facets will now be determined by the NAME varaible so will get 10 facets
    ncol = 2) + #set number of columns to 5 pairs of side-by-side facets
  theme_bw() +
  theme(
     panel.grid.major.y = element_blank(),
     panel.grid.minor.y = element_blank(),
     legend.position = "none",
     plot.margin = margin(1,1,1,1,"mm")) +
  labs(
    x = "Month",
    y = "Dementia diagnosis rate (%)",
    title = "Dementia diagnosis rates in England",
    subtitle = "From May 2016 to April 2021",
    caption = "Data Source: NHS") +
  transition_reveal(DATE)

animate(
  gr_int2,
  fps = 7,
  end_pause = 35)

Attempt 2

Whilst I appreciated the motion provided by the animated time series, I reckoned that a more effective visualisation would be less overwhelming and would allow to select different regions to facilitate comparisons.

#I will fully comment this code to avoid confusion from graph structure changes

#save grapgh structure into an object for animation
gr_int3 <- ggplot(
  data = df_proc,
  mapping = aes(
    x = DATE,
    y = DIAG_RATE_65_PLUS,
    colour = NAME)) + #use colour = NAME so the regions get different colours
  ylim( #customise y axis, lower and upper axis limit
    c(50, 80)) +
  scale_x_date(
    limits = as.Date( #introduce and customise x axis limits by the start and end date of data collection 
      c("2016-05-31","2021-04-30")),
    date_breaks = "6 month", #use date breaks of 6 month intervals for reading ease
    date_labels = "%b %Y",
    expand = c(0.02,0)) + #use expand to control the white space between the axes and the data
  geom_line( #use geom_line to show regions as lines
    size = 1) +
  geom_hline( #use geom_hline to set the benchmark rate, i.e., 66.7
    yintercept = 66.7,
    linetype = 2, #2 for a dashed line 
    size = 1,
    colour = "black") + #line colour will be black
  annotate( #label parameters for the benchmark rate line
    "text",
    x = as.Date("2016-12-31"), #controls where 'Benchmark' label appears in the graph
    66.7, #y-value where the label is referenced
    vjust = -0.5, #label positioning just above the line
    label = "Benchmark rate", #name of the Benchmark label
    size = 4,) + #controls label size
  scale_color_brewer( #back to RColorBrewer and Spectral palette for aesthetics
    palette = "Spectral") +
  theme_bw() + #customise theme to plain black and white
  theme( #customise grid lines
    panel.grid.major.y = element_blank(), #no minor grid lines for any axis
    panel.grid.minor.y = element_blank(), #no major y-grid lines so we're only left with major x-grid lines which are helpful for reading the chart
    #change margin size of the graph, value order: top, right, bottom, left
    plot.margin = margin( #set parameters and units to mm
      1,1,1,1,"mm")) + #assign labels
  labs(
    x = "Month", #x axis label
    y = "Dementia diagnosis rate (%)", #y axis label
    title = "Dementia diagnosis rates in England - May 2016 to April 2021") #assigns chart tittle

#animation
ggplotly() %>% #final fix to the benchmark label so it is positioned correctly
  style(textposition = "top")

I wasn not quite happy with this graph so I made some adjustments.

The final visualisation

I renamed the variables of interest so they wouldn’t show in all capitals when hovering over the graph

#rename
df_int <- dplyr::rename(
  df_proc, #new name first, old name after = sign
  Date = DATE,
  Location = NAME,
  Rate = DIAG_RATE_65_PLUS)

Then, I created a graph that would allow highlights, which required some changes to the graph structure previously used.

#the d object will be useful for highlights based on Location (i.e, previosuly known as NAME variable)
d <- highlight_key(
  df_int, ~Location)

#assign graph structure to an object, code not fully commented because previous comments apply to this structure.
gr_int4 <- ggplot(d,
            mapping =
              aes(x = Date,
                  y= Rate,
                  group = Location,
                  colour = Location)) +
  ylim(c(50, 80)) +
  scale_x_date(
    limits = as.Date(
      c("2016-05-31","2021-04-30")), 
    date_breaks = "6 month", 
    date_labels = "%b %Y",
    expand = c(0.02,0)) +
  geom_line(
    size = 1) +
  geom_hline(
    yintercept = 66.7,
    linetype = 2,
    size = 1,
    colour = "black") +
  annotate(
    "text",
    x = as.Date("2016-12-31"),
    66.7,
    vjust = -0.5,
    label = "Benchmark rate",
    size = 4) +
  scale_color_brewer(
    palette = "Spectral") +
  theme_bw() +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    plot.margin = margin(1,1,1,1,"mm")) +
  labs(x = "Month", 
       y = "Dementia diagnosis rate (%)",
       title = "Dementia diagnosis rates in England - May 2016 to April 2021")

#Final visualisation - variation 1. WARNING: The code from here below produces a graph that allows hovering in different lines but cannot show values for different regions simultaneously.

#make a new object to customise the interactive plot
gg <- ggplotly(gr_int4, 
               tooltip = c("text", "x", "y")) %>% #tool tip feature control the text when hovering over the graph
  style(textposition = "top") 

#use highlight to manipulate highlight settings in graph
highlight(gg, 
          "plotly_hover", #mouse hovering highlights one line at a time
          defaultValues = "England") #graph will show England highlighted automatically
## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_hover'). You can change this default via the `highlight()` function.

I like this graph because by default it lets users highlight different regions at once, which makes comparisons between regions easier. This graph also provides an interactive legend feature. Play with the single and double click features in the legend on different region names!

I wanted to introduce a unified hovering line to show values from different regions simultaneously, but this is where I found the limit to my coding abilities (more about this graph idea in the appendix).

Summary

I used publicly available data from the NHS to graph a time series of dementia diagnosis rates in England.

My graph depicts that:

Knowing these diagnosis rates is important because, as explained by the NHS, they can guide policies and access to services for the treatment of dementia in different areas.

Before jumping to conclusions, keep in mind these caveats:

1) The outbreak of COVID-19 has impacted the way General Practices work, which are the primary data sources for dementia rates. Use caution when reading trends after March 2020.

2) This was purely a data visualisation exercise. No statistical comparisons were calculated. The NHS advises to interpret trends carefully and from an observational rather than a causal point of view.

Future steps

Too many! Primarily, develop a visualisation that allows both highlighting and simultaneous hovering of all regions. It would also be interesting to perform some model fitting on the data and predict trends of diagnosis rates. One could predict rates using data from May 2016 to March 2020, then compare predicted trends against observed trends after March 2020. Doing so would provide an insight into the impact of COVID-19 on data collection and trends.

Appendix

I was not able to write code to reconcile highlight("plotly_hover", defaultValues = "England") and layout(hovermode = "x unified")

#the code below is referencing objects previously defined in the chunk of code immediately before this one

#Final visualisation - variation 2. WARNING: The code from here below produces a graph that allows unified hovering to show values form different regions, but the highlighting goes wrong. The interactive features also go wrong.

gg <- ggplotly(gr_int4, 
                tooltip = c("text", "x", "y")) %>% 
  style(textposition = "top") %>% 
  layout(hovermode = "x unified")

highlight(gg,
          defaultValues = "England")
## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_click'). You can change this default via the `highlight()` function.