Hi! My name is Carrie Wiley. I’m a Senior Staff Scientist at the Human Resources Research Organization (HumRRO) and would like to share some useful tips on manipulating strings in R. As educational evaluators, we often manage complex and messy datasets that we receive from schools, states, teachers, etc. One of the most common data preparation tasks I encounter is cleaning up character or string variables so that I can match data from different years/sources, run frequencies or descriptive statistics, or prepare for qualitative coding.
Hot Tips and Cool Tricks
Clean up First and Last Names
Often, we receive data with teacher’s first and last names with no standardized data entry (e.g., John, john, john , JOHN). Here are some ways I like to clean up these data.
1. Convert the names to title case (capitalize first letter only). You can use this function:
2. Trim leading and trailing white space. Sometimes whitespaces are there even if you think they’re not. Removing white space will make for a cleaner dataset and is crucial when using the variable as a key for matching two data sources. trimws(var1) will do the trick.
Concatenate First and Last Names
We can combine the newly formatted first and last names in case we need to match it with another dataset using teacher name as the matching key.
1. paste0(TchrFirst, TchrLast) will combine both variables (e.g., ‘JohnDoe’)
2. paste(var1, var2, sep = ‘, ’) will combine both variables separated by a comma-space (e.g., ‘John, Doe’)
Extract string values
Another useful function is to extract certain strings within a variable. For example, the domain of an email address often indicates the school district. You can extract everything after the ‘@’ symbol to then create the district name. We can do so in three steps using gsub() and then recoding.
1. extract everything after the first @ symbol
2. extract everything before the first period
3. Recode ‘distA’ to ‘District Name A’
Filter records to only include certain strings
Sometimes we need to pull only the records that contain certain values. For example, if you have teacher rosters for an entire district and you only want the “mathematics” teachers, you can search for certain strings to pull those classes (e.g., ‘Math’, ‘Alg’, ‘Geom’).
1. Create an index var to use that contains the patterns you want to search for using grepl() (you can also use grep()).
2. mathIndex returns a logical vector indicating that records 1, 2, and 5 in the dat dataframe contain ‘Math’, ‘Geo’, or ‘Alg’ in the Subject variable.
3. You can then filter the dat dataframe to only include those records
- Functions like paste0(), gsub(), tolower(), toupper(), substr() are all available in Base R and knowing regex comes in really handy:
- Understanding grep() and grepl() for pattern matching are also great resources for string manipulation: https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html
- If you prefer to explore the tidyverse, check out the stringr package!:
The American Evaluation Association is hosting PreK-12 Ed Eval TIG Week with our colleagues in the PreK-12 Educational Evaluation Topical Interest Group. The contributions all this week to AEA365 come from our PreK-12 Ed Eval TIG members. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365@eval.org. AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.