Web Scraping with rvest: Exploring Sports Industry Jobs

Web scraping with rvest is easy and, surprisingly, comes in handy in situations that you may not have thought of.

For example, one of the unique things about academics is the constant need to stay “ahead of the curve,” meaning being nimble enough as a program to shift curriculum around to provide students training and education in those areas that are in demand within the industry (sports analytics, included).

This is particularly essential in the field of sport management.

Case in point: I am currently serving on a “task force” within our department charged with redefining the “future of our program.” In other words: creating a five-year plan to align our program to fit the needs of the sports industry.

We want our students to have the necessary skills and education required to be as competitive on the job market as possible. And, as if the sports industry job market was not tough enough, the COVID-19 pandemic has only made it more difficult for graduates.

Because of this, we – as a committee – have devised multiple ways to “survey” the industry to see where it is heading in terms of popular and in demand jobs.

A quick way to get a “broad” view of this is by simply scraping online job posting sites such as TeamWork Online or the NCAA Job Market.

Scraping these sites is relatively easy.

And, typically, the code used is easy to edit (typically just needing to change the URL structure and the sequencing around to make it work between sites).

Let’s take a look at how to scrape, for example, the NCAA Job Market website.

Scraping Using rvest: Setting Up A Data.Frame

The first step in the web scraping process is setting up a data frame where all the information will be stored.

It takes a little bit of forward thinking in order to do this correctly.

For example, let’s take a look at the NCAA Job Market website:

When first examining a website for scraping you need to consider the exact information that you want to collect.

What would be beneficial? What would provide insightful data?

In this case, I see three things I want to scrape:

  1. The title of the job itself
  2. The institution that is hiring
  3. And the location of the job

Because of that, I need to create a data frame that includes those three variables. Doing so in RStudio is simple:

listings <- data.frame(title=character(),
                       school=character(), 
                       location=character(), 
                       stringsAsFactors=FALSE) 

Once you create your data frame, it is time to start constructing your script that will actually pull the information off of the site. The first step is developing the sequencing and ensuring that you provide correct URL structure.

Web Scraping with rvest: Sequencing and URL Construction

If you were to visit the NCAA Job Market during the time I wrote this post, you will see that there is currently seven pages of jobs with 25 jobs posted per page.

You, of course, want to grab all of the information beyond just the first page. In order to do this, you have to instruct rvest on how the URLs are structured on the website.

If you click ahead to Page 2 of the NCAA Job Market, you will see in your browser that the URL is structured as such:

https://ncaamarket.ncaa.org/jobs/?page=2

With that in mind, the code starts like this:

for (i in 1:7) {
  url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i)

Basically, you are instructing rvest to continuously “paste and go” that URL structure but running through the numbers 1 – 7 after page=.

As it does so, it pulls the information off of all seven pages.

This is honestly the trickiest part of scraping, at least to me.

It takes a little trial and error sometimes to figure it out but, once you do it enough times, the process of piecing together this little puzzle becomes easier.

Once you get this part sorted out, you can move onto pulling the information for all the variables we listed above (title, school, and location).

Scraping with rvest: Pulling the Variables

At this point, the last thing you need to do is instruct rvest where exactly the information you are looking for is located on the site.

To better understand this, let’s look at the code:

#job title
  title <-  read_html(url_ds) %>% 
    html_nodes('#jobURL') %>%
    html_text() %>%
    str_extract("(\\w+.+)+")

  #school
  school <- read_html(url_ds)%>% 
    html_nodes('.bti-ui-job-result-detail-employer') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  #location
  location <- read_html(url_ds) %>% 
    html_nodes('.bti-ui-job-result-detail-location') %>%
    html_text() %>%
    str_extract("(\\w+).+") 

As you can see, you are using rvest to read the HTML of the URL you provided.

The most important part here, though, is the html_nodes section.

It is here that you tell rvest where to look for the information.

To get this information, you first need to install the Chrome widget called SelectorGadget.

Once you do that, visit the website, turn on SelectorGadget, and click on the information you want to scrape. You should see something like this:

You can see I highlighted the job title on the NCAA Job Market.

And, then, SelectorGadget is telling me that the title is nested within the DIV tage called ‘#jobURL.’

I take that information and simply insert it into the html_nodes section of the code.

And then do the same for school and location.

Once you do all of that, the last thing you need to do is make sure you do an rbind of all the data within the data frame. After doing so, the complete code looks like this:

listings <- data.frame(title=character(),
                       school=character(), 
                       location=character(), 
                       stringsAsFactors=FALSE) 
for (i in 1:7) {
  url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i)

  #job title
  title <-  read_html(url_ds) %>% 
    html_nodes('#jobURL') %>%
    html_text() %>%
    str_extract("(\\w+.+)+")

  #school
  school <- read_html(url_ds)%>% 
    html_nodes('.bti-ui-job-result-detail-employer') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  #location
  location <- read_html(url_ds) %>% 
    html_nodes('.bti-ui-job-result-detail-location') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  listings <- rbind(listings, as.data.frame(cbind(title,
                                                  school,
                                                  location)))
}     

Lastly, if you want to visualize the information, a wordcloud is a good choice.

wordcloud(paste(listings$title), type="text", 
          lang="english", excludeWords = c("experience","will","work"),
          textStemming = FALSE,  colorPalette="Paired",
          max.words=5000)

Scraping with rvest: Conclusion

As mentioned, the process of web scraping with rvest is not overly difficult.

Once you figure out the URL structure, the rest kind of falls in place. Of course, the use of SelectorGadget makes it even easier since you do not have to manually dig through the HTML to find where the information you want to scrape is nested.

As for the information gathered from the NCAA Job Market, it should not be surprising that coaching is an in-demand job. Specifically, it looks like Assistant Women’s Coaches are a good in high demand.

Doing the above process on other sites, such as Indeed or TeamWork Online yield vastly different results, however.

You have to keep in mind there is limited amounts of data on the NCAA Job Market – just under 200 jobs.

On the other hands, a search for “sports” on Indeed returns over a thousand results. TeamWork Online has nearly 700 jobs posted.

So, as you can imagine, the wordclouds and the information you can conclude from scraping those websites are a bit broader than the NCAA Job Market.

All said, though, the process of web scraping with rvest quickly lead to some broad, overarching results that can lead to more nuanced discussion.

The following two tabs change content below.

Brad Congelio

An Assistant Professor in the College of Business at Kutztown University of Pennsylvania, Brad Congelio uses data science and analytics to investigate the sport industry.

Latest posts by Brad Congelio (see all)

10 thoughts on “Web Scraping with rvest: Exploring Sports Industry Jobs”

  1. Did you run this in R 4.0.2? I installed libraries (tidyverse, rvest wordcloud and tm), which were never mentioned. Data returned, but wordcloud visual kicked out errors:
    graphical parameter “type” is obsolete
    “lang” is not a graphical parameter
    “excludeWords” is not a graphical parameter

    Reply
    • I am not at my computer with R installed, but I regularly keep it up to date. Can’t imagine version is the issue, though.

      The wordcloud function is from here: WordCloud

      Try this: install.packages("SnowballC") # for text stemming

      I have not run into those specific errors, but running SnowballC (as outlined on STHDA) is a good start.

      Everything else you mentioned looks good.

      Reply
  2. I just used the wordcloud library “wordcloud(listings$title)” and the visual showed up fine and without errors.. Why did you use all the additional paremeters?

    Reply
  3. I think I understand why you used the parameters. Good articles on STHDA about snowball and text mining. I now understand about word stems. Also, I realized if I put ““`{r, warning=FALSE} ” in the chunk I get a good output. These were warnings not errors.
    Thanks for the help…

    Reply
  4. Hi Brad,
    thanks for the useful and clear explanation.
    Could you please get into more details about the regex in your code?
    That is something I have not undestood yet.
    Thank you for your time.
    Greetings from Italy

    Reply
      • Hi –

        In short:

        for (i in 1:7)

        is instructing R to add the page number (that is, numbers 1-7) after “page=” in the URL.

        In other words, there were a total of seven pages … therefore, the process had to be repeated on all seven pages. The code is designed to grab all the material off of the first page, then ‘paste’ the # 2 after “page=” in the URL, grab all the material again, so and so forth until the #7.

        It is basically running the process over and over again, a total of 7 times, on the specific URL structure needed to grab information off all the pages, and then binding the info into the DF called ‘listings.’

        Hope that helps.

        Reply

Leave a Comment

Follow Me on Twitter

I am always talking about RStudio, data science, and sports analytics on Twitter - especially those subjects that aren't quite enough for blog posts on my site. Click below to follow me and join the conversation.