author: “Aaron D. Schroeder” description: Takes yearly student record data in long-format, transforms and casts the data to wide-format, then finds number of times each grade was attended (times attended > 1 = retention).
1. Add Libraries and Import Data
Sample data includes client id, grade year (e.g. KG, 1, 2), and entry date (first day of school)
library(reshape2)
retentionData <- read.csv("GradeRetentionData.csv")
## clientid gradeyear entrydate
## 1 ClientA 1 8/1/2010
## 2 ClientA 1 8/1/2011
## 3 ClientA 2 8/1/2012
## 4 ClientC 4 8/1/2008
## 5 ClientC 5 8/1/2009
## 6 ClientC 5 8/1/2010
## 7 ClientC 6 8/1/2011
## 8 ClientB PK 8/1/2003
## 9 ClientB KG 8/1/2004
## 10 ClientB KG 8/1/2004
## 11 ClientB 1 8/1/2005
2. Transform column data where necessary and build new data frame
Reduce date to just year
yearattend <- format(as.Date(retentionData$entrydate, "%m/%d/%Y"), "%Y")
Custom function gradeNum is added to convert text grade number (e.g. “KG”) to numeric (e.g. “0”) to help with sorting. The function is applied using mapply, a vectorized approach (as opposed to looping).
gradeNum <- function(x){ switch(x, "KG" = 0, "K" = 0, "PK" = -1, "P" = -1, x) }
gradeyear <- mapply(gradeNum, x = as.character(retentionData$gradeyear))
Construct new data frame (here we use same name to replace old data frame).
retentionData <- data.frame(clientid = retentionData$clientid, gradeyear, yearattend)
## clientid gradeyear yearattend
## 1 ClientA 1 2010
## 2 ClientA 1 2011
## 3 ClientA 2 2012
## 4 ClientC 4 2008
## 5 ClientC 5 2009
## 6 ClientC 5 2010
## 7 ClientC 6 2011
## 8 ClientB -1 2003
## 9 ClientB 0 2004
## 10 ClientB 0 2004
## 11 ClientB 1 2005
3. Eliminate duplicate records
Two records with identical clientid, gradeyear, and yearattend are duplicates, not a grade retention. Notice record 10 was eliminated as a duplicate.
retentionData <- unique(retentionData[,c("clientid","gradeyear","yearattend")])
## clientid gradeyear yearattend
## 1 ClientA 1 2010
## 2 ClientA 1 2011
## 3 ClientA 2 2012
## 4 ClientC 4 2008
## 5 ClientC 5 2009
## 6 ClientC 5 2010
## 7 ClientC 6 2011
## 8 ClientB -1 2003
## 9 ClientB 0 2004
## 11 ClientB 1 2005
4. Cast to wide data frame
Create a single row for each clientid + gradeyear combination. Order by clientid and gradeyear.
castDF <- dcast(retentionData, clientid + gradeyear ~ yearattend, value.var="yearattend", fun.aggregate=length)
castDF <- castDF[order(castDF$clientid, castDF$gradeyear),]
## clientid gradeyear 2003 2004 2005 2008 2009 2010 2011 2012
## 1 ClientA 1 0 0 0 0 0 1 1 0
## 2 ClientA 2 0 0 0 0 0 0 0 1
## 3 ClientB -1 1 0 0 0 0 0 0 0
## 4 ClientB 0 0 1 0 0 0 0 0 0
## 5 ClientB 1 0 0 1 0 0 0 0 0
## 6 ClientC 4 0 0 0 1 0 0 0 0
## 7 ClientC 5 0 0 0 0 1 1 0 0
## 8 ClientC 6 0 0 0 0 0 0 1 0
5. Create summary column
Create row sums column and add to new wide data frame
rsums <- rowSums(castDF[c(3:ncol(castDF))])
castDF$times_attended <- rsums
## clientid gradeyear 2003 2004 2005 2008 2009 2010 2011 2012 times_attended
## 1 ClientA 1 0 0 0 0 0 1 1 0 2
## 2 ClientA 2 0 0 0 0 0 0 0 1 1
## 3 ClientB -1 1 0 0 0 0 0 0 0 1
## 4 ClientB 0 0 1 0 0 0 0 0 0 1
## 5 ClientB 1 0 0 1 0 0 0 0 0 1
## 6 ClientC 4 0 0 0 1 0 0 0 0 1
## 7 ClientC 5 0 0 0 0 1 1 0 0 2
## 8 ClientC 6 0 0 0 0 0 0 1 0 1