Yield Rate Gender Gap, Part 3: Adding Rankings

This is Part 3 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

Our previously scraped and cleaned CollegeData.com table contains a lot of admissions information for each school, but it doesn’t contain anything about rankings. One of the most influential college rankings – for better or worse – are those put out by US News and World Report.

US News keeps multiple ranking lists, tracking what they call ‘National Universities’, ‘National Liberal Arts Colleges’, ‘Regional Universities’ (which are split into four regions), and ‘Regional Colleges’ (also split into four regions).

When scrolling down the page for each of these rankings, more rows will appear:

Scraping US News with BeautifulSoup

After scrolling down until the schools are no longer ranked, I opened Safari’s developer console and saved the generated HTML in its final state as a local .html files. I looked in the .html files to find the <tr> tags that correspond to a single row of results, and copied its attribute data-view = "colleges-search-results-table-row" to help identify all the <tr> tags each containing a row of rankings data. I then used BeautifulSoup to scrape the page and return a pandas DataFrame.

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

filepath = 'usnews.html'
row_id_attr = {"data-view":"colleges-search-results-table-row"}
field_labels = ['Name','Location','Rank','Tuition','Undergrads']

scraped = []
with open(filepath,'r') as file:
    page = BeautifulSoup(file.read(), "lxml")
    for tag in page('tr', attrs = row_id_attr):
        values = list(tag.stripped_strings)
        if '(tie)' in values: values.remove('(tie)') # unwanted field
        if '1' in values: values.remove('1') # unwanted field
        scraped.append(dict(zip(field_labels, values)))

rankings = pd.DataFrame(scraped)

rankings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 5 columns):
Location      1400 non-null object
Name          1400 non-null object
Rank          1400 non-null object
Tuition       1400 non-null object
Undergrads    1399 non-null object
dtypes: object(5)
memory usage: 54.8+ KB

US News had rankings for 1400 schools. Each listing also contains the tuition and the total number of undergraduates, which are helpful pieces of data that we did not get from CollegeData.com (they were both available on another one of the five pages for each school that we didn’t scrape).

We will want to join this new 'rankings' DataFrame with our saved 'college' DataFrame we scraped from CollegeData.com. We will discover that when we try to join tables on the 'Name' that US News and CollegeData.com don’t use the exact same names for each school, making it tricky to merge the tables together.


Cleaning the rankings

First, let’s clean 'rankings' column by column like we did for 'college' back in Part 2.

We’ll check the .mode() to find potential null values and replace them with np.nan:

rankings.mode().loc[0]
Location                             Chicago, IL
Name                               Union College
Rank          #231-#300 in National Universities
Tuition                                      N/A
Undergrads                                   N/A
Name: 0, dtype: object
rankings = rankings.replace('N/A',np.nan)

Next, we’ll extract the numeric 'Rank' as well as save the 'Type' of rank, that is, whether the school is a ‘National University’ or a ‘Regional Colleges North’, etc. We will also drop any schools we accidentally scraped from the end of the list that did not contain a ranking.

rankings[['Rank']].head()
Rank
0 #1 in Regional Colleges Midwest
1 #1 in Regional Colleges West
2 #1 in Regional Colleges North
3 #1 in Regional Universities Midwest
4 #1 in Regional Universities South
regexp = '([A-Z].*)$'
rankings['Type'] = rankings['Rank'].str.extract(regexp)
rankings['Type'] = rankings['Type'].fillna('Unranked')

unranked = rankings['Type'].str.contains('Unranked')
rankings = rankings.drop(rankings[unranked].index)
regional_colleges = rankings['Type'].str.contains('Regional Colleges')
rankings.loc[regional_colleges,'Type'] = 'Regional Colleges'
regional_universities = rankings['Type'].str.contains('Regional Universities')
rankings.loc[regional_universities,'Type'] = 'Regional Universities'

rankings['Type'] = rankings['Type'].astype('category')

rankings['Rank'] = rankings['Rank'].str.extract('#(\d*)').astype('float')
rankings = rankings.drop(rankings[rankings['Rank'].isnull()].index)

rankings[['Rank','Type']].head()
Rank Type
0 1.0 Regional Colleges
1 1.0 Regional Colleges
2 1.0 Regional Colleges
3 1.0 Regional Universities
4 1.0 Regional Universities

Next, we will split the 'City' and 'State' from the 'Location' column before dropping it:

rankings[['Location']].head()
Location
0 Grand Rapids, MI
1 Helena, MT
2 New York, NY
3 Omaha, NE
4 Elon, NC
regexp = '(.*), ([A-Z]{2})'
rankings[['City','State']] = rankings['Location'].str.extract(regexp)
rankings['State'] = rankings['State'].astype('category')
rankings = rankings.drop(columns='Location')
rankings[['City','State']].head()
City State
0 Grand Rapids MI
1 Helena MT
2 New York NY
3 Omaha NE
4 Elon NC

Next, we’ll extract the 'Tuition' and, when available, the 'Tuition (in-state)':

rankings.loc[slice(20,24),['Tuition']] # picked a slice w/in-state tuition
Tuition
20 $31,960
21 $51,848
22 $17,968 (out-of-state), $6,808 (in-state)
23 $39,062
24 $47,165
newcols = ['Tuition','Tuition (in-state)']
rankings[newcols] = rankings['Tuition'].str.replace(',','') \
                                       .str.extractall('\$(\d*)') \
                                       .astype('float').unstack()[0]
        
rankings.loc[slice(20,24), newcols] 
Tuition Tuition (in-state)
20 31960.0 NaN
21 51848.0 NaN
22 17968.0 6808.0
23 39062.0 NaN
24 47165.0 NaN

Looking at our last column:

rankings[['Undergrads']].head()
Undergrads
0 3,807
1 1,373
2 876
3 4,203
4 6,008

Finally, we just need to remove the commas from the 'Undergrads' column and convert it to a float:

rankings['Undergrads'] = rankings['Undergrads'].str.replace(',','')
rankings['Undergrads'] = rankings['Undergrads'].astype('float')
rankings[['Undergrads']].head()
Undergrads
0 3807.0
1 1373.0
2 876.0
3 4203.0
4 6008.0

Like in Part 2, we’ll append the 'State' in parenthesis to any duplicated 'Name':

duplicated = rankings['Name'].duplicated(keep=False)
duplicated.sum()  # check num of duplicates
32
for i, row in rankings[rankings['Name'].duplicated(keep=False)].iterrows():
    rankings.loc[i,'Name'] += ' (' + row['State'] + ')'
    
rankings.loc[duplicated, 'Name'].duplicated(keep=False).sum() # check again
0

Finally, set the 'Name' to be the DataFrame index.

rankings = rankings.set_index('Name', verify_integrity=True).sort_index()
rankings.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1381 entries, Abilene Christian University to Youngstown State University
Data columns (total 7 columns):
Rank                  1381 non-null float64
Tuition               1373 non-null float64
Undergrads            1268 non-null float64
Type                  1381 non-null category
City                  1381 non-null object
State                 1381 non-null category
Tuition (in-state)    515 non-null float64
dtypes: category(2), float64(4), object(1)
memory usage: 70.5+ KB

Matching the names in both tables

We have our final cleaned 'rankings' dataset from US News. Unfortunately, because of slight differences in our index 'Name', it won’t be so straightforward to join 'rankings' with our previously scraped 'college' dataset from CollegeData.com.

# IMPORT AND CLEAN UP COLLEGE DATAFRAME (NEED TO UPDATE PART 2)

# Import
college = pd.read_csv('collegedata_cleaned_backup.csv')
college['State'] = college['State'].astype('category')

# Differentiate dulicate names and set name as index
college = college.rename(columns={'schoolname':'Name'})
duplicated = college['Name'].duplicated(keep=False)
for i, row in college[college['Name'].duplicated(keep=False)].iterrows():
    college.loc[i,'Name'] += ' (' + row['City'] + ')'
college = college.set_index('Name', verify_integrity=True).sort_index()
matched = rankings.index.isin(college.index)
matched.sum()
1093

1093 of the 1381 'rankings' names are listed verbatim in 'college'. We will create a new temporary dataframe 'cleaned_rankings' to hold them for now, and then drop them from 'rankings'.

cleaned_rankings = rankings[matched]
rankings = rankings.drop(rankings[matched].index)
len(rankings)
288

Using ‘fuzzywuzzy’ to match names

There is a high probability many of the remaining 288 unmatched names in 'rankings' are written slightly differently than their yet-to-be-matched counterparts in 'college'. Using fuzzywuzzy, which was put together by the people at SeatGeek and built on difflib, we can find a list of closest names in the 'college' index for each name in the 'rankings' table. We’ll then check that the city of the matched 'college' name lines up with the city in the 'rankings' table, and if so, output the name to a new column callled 'Close Name' that we’ll visually inspect:

from fuzzywuzzy import process

def get_closest_name(row):
    close_names = process.extract(row.name, college.index)
    for close_name in close_names:
        if row['City'] == college.loc[close_name[0],'City']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name, axis=1)
rankings['Close Name'].count()
208

Using 'fuzzywuzzy.process.extract()', we got a 'Close Name' for 208 of the remaining 288 names in 'rankings'. To be safe, I’ll visually look through these 208 name matches before renaming and joining. WARNING: Prepare for a wall of a table!

pd.set_option('display.max_rows', 208)
rankings[['Close Name']].dropna()
Close Name
Name
Aquinas College Aquinas College (MI)
Arizona State University–Tempe Arizona State University
Auburn University–Montgomery Auburn University at Montgomery
Binghamton University–SUNY Binghamton University
Brigham Young University–Idaho Brigham Young University – Idaho
Brigham Young University–Provo Brigham Young University
Bryn Athyn College of the New Church Bryn Athyn College
CUNY–Hunter College Hunter College
CUNY–John Jay College of Criminal Justice John Jay College of Criminal Justice
CUNY–New York City College of Technology New York City College of Technology
CUNY–York College York College (NY)
California Polytechnic State University–San Luis Obispo California Polytechnic State University San Lu…
California State Polytechnic University–Pomona California State Polytechnic University Pomona
California State University–Bakersfield California State University Bakersfield
California State University–Channel Islands California State University Channel Islands
California State University–Chico California State University Chico
California State University–Dominguez Hills California State University Dominguez Hills
California State University–East Bay California State University East Bay
California State University–Fresno California State University Fresno
California State University–Fullerton California State University Fullerton
California State University–Long Beach California State University Long Beach
California State University–Los Angeles California State University Los Angeles
California State University–Monterey Bay California State University Monterey Bay
California State University–Northridge California State University Northridge
California State University–Sacramento California State University Sacramento
California State University–San Bernardino California State University San Bernardino
California State University–San Marcos California State University San Marcos
California State University–Stanislaus California State University Stanislaus
Central Christian College Central Christian College of Kansas
College at Brockport–SUNY College at Brockport
College of New Jersey The College of New Jersey
College of Saint Rose College of St. Rose
College of St. Mary College of Saint Mary
Colorado State University–Pueblo Colorado State University – Pueblo
Columbia College Columbia College (MO)
Concordia College–Moorhead Concordia College
Concordia University (CA) Concordia University Irvine
Concordia University (NE) Concordia University Nebraska
Concordia University–St. Paul Concordia University St. Paul
Cooper Union Cooper Union for the Advancement of Science an…
Edinboro University of Pennsylvania Edinboro University
Embry-Riddle Aeronautical University Embry-Riddle Aeronautical University – Daytona…
Emory and Henry College Emory & Henry College
Everglades University Everglades University – Boca Raton
Fairleigh Dickinson University Fairleigh Dickinson University – Metropolitan …
Farmingdale State College–SUNY Farmingdale State College
Florida A&M University Florida State University
Franklin and Marshall College Franklin & Marshall College
Georgia College & State University Georgia College and State University
Grace College and Seminary Grace College
Indiana University–Bloomington Indiana University Bloomington
Indiana University–Kokomo Indiana University Kokomo
Indiana University–South Bend Indiana University South Bend
Indiana University-Purdue University–Indianapolis Indiana University – Purdue University Indiana…
Johnson & Wales University Johnson & Wales University (RI)
Louisiana State University–Alexandria Louisiana State University Alexandria
Louisiana State University–Baton Rouge Louisiana State University
Louisiana State University–Shreveport Louisiana State University Shreveport
Miami University–Oxford Miami University
Minnesota State University–Mankato Minnesota State University Mankato
Minnesota State University–Moorhead Minnesota State University Moorhead
Missouri University of Science & Technology Missouri University of Science and Technology
Montana State University–Billings Montana State University – Billings
Montana State University–Northern Montana State University – Northern
Mount Saint Mary’s University Mount St. Mary’s University (CA)
Mount St. Mary’s University Mount St. Mary’s University (MD)
North Carolina State University–Raleigh North Carolina State University
Ohio State University–Columbus Columbus State University
Purchase College–SUNY Purchase College
Purdue University–Fort Wayne Indiana University – Purdue University Fort Wayne
Purdue University–Northwest Purdue University Northwest
Purdue University–West Lafayette Purdue University
SUNY College of Agriculture and Technology–Cobleskill State University of New York College of Agricu…
SUNY College of Environmental Science and Forestry State University of New York College of Enviro…
SUNY College–Old Westbury College at Old Westbury
SUNY–Fredonia State University of New York at Fredonia
SUNY–Geneseo State University of New York College at Geneseo
SUNY–Morrisville Morrisville State College
SUNY–Oswego State University of New York at Oswego
SUNY–Plattsburgh State University of New York Plattsburgh
Sewanee–University of the South Sewanee: University of the South
Silver Lake College Silver Lake College of the Holy Family
South University South University – Savannah
Southern Illinois University–Carbondale Southern Illinois University Carbondale
Southern Illinois University–Edwardsville Southern Illinois University Edwardsville
Southern University–New Orleans Southern University at New Orleans
St. Anselm College Saint Anselm College
St. Francis University Saint Francis University
St. John’s University (NY) St. John’s University
St. Joseph’s College St. Joseph’s College of Maine
St. Joseph’s University Saint Joseph’s University
St. Mary’s College Saint Mary’s College
St. Mary’s College of California Saint Mary’s College of California
St. Mary’s University of Minnesota Saint Mary’s University of Minnesota
St. Mary’s University of San Antonio St. Mary’s University
St. Michael’s College Saint Michael’s College
St. Xavier University Saint Xavier University
Sterling College Sterling College (KS)
Stony Brook University–SUNY Stony Brook University
Texas A&M University–College Station Texas A&M University
Texas A&M University–Commerce Texas A&M University – Commerce
Texas A&M University–Corpus Christi Texas A&M University – Corpus Christi
Texas A&M University–Kingsville Texas A&M University – Kingsville
Texas A&M University–Texarkana Texas A&M University – Texarkana
The Catholic University of America Catholic University of America
The New School Eugene Lang College The New School for Liberal…
University at Albany–SUNY University at Albany
University at Buffalo–SUNY University at Buffalo
University of Alabama–Birmingham University of Alabama at Birmingham
University of Alabama–Huntsville University of Alabama in Huntsville
University of Alaska–Anchorage University of Alaska Anchorage
University of Alaska–Fairbanks University of Alaska Fairbanks
University of Alaska–Southeast University of Alaska Southeast
University of Arkansas–Fort Smith University of Arkansas at Fort Smith
University of Arkansas–Little Rock University of Arkansas at Little Rock
University of Arkansas–Pine Bluff University of Arkansas at Pine Bluff
University of California–Berkeley University of California Berkeley
University of California–Davis University of California Davis
University of California–Irvine University of California Irvine
University of California–Los Angeles University of California Los Angeles
University of California–Merced University of California Merced
University of California–Riverside University of California Riverside
University of California–San Diego University of California San Diego
University of California–Santa Barbara University of California Santa Barbara
University of California–Santa Cruz University of California Santa Cruz
University of Colorado–Boulder University of Colorado Boulder
University of Colorado–Colorado Springs University of Colorado Colorado Springs
University of Colorado–Denver University of Colorado Denver
University of Hawaii–Hilo University of Hawaii at Hilo
University of Hawaii–Manoa University of Hawaii at Manoa
University of Hawaii–West Oahu University of Hawaii – West Oahu
University of Houston–Clear Lake University of Houston – Clear Lake
University of Houston–Downtown University of Houston – Downtown
University of Houston–Victoria University of Houston – Victoria
University of Illinois–Chicago University of Illinois at Chicago
University of Illinois–Springfield University of Illinois at Springfield
University of Illinois–Urbana-Champaign University of Illinois at Urbana-Champaign
University of Louisiana–Lafayette University of Louisiana at Lafayette
University of Louisiana–Monroe University of Louisiana at Monroe
University of Lynchburg Virginia University of Lynchburg
University of Maine–Augusta University of Maine at Augusta
University of Maine–Farmington University of Maine at Farmington
University of Maine–Fort Kent University of Maine at Fort Kent
University of Maine–Machias University of Maine at Machias
University of Maine–Presque Isle University of Maine at Presque Isle
University of Maryland–Baltimore County University of Maryland Baltimore County
University of Maryland–College Park University of Maryland
University of Maryland–Eastern Shore University of Maryland Eastern Shore
University of Massachusetts–Amherst University of Massachusetts Amherst
University of Massachusetts–Boston University of Massachusetts Boston
University of Massachusetts–Dartmouth University of Massachusetts Dartmouth
University of Massachusetts–Lowell University of Massachusetts Lowell
University of Michigan–Ann Arbor University of Michigan
University of Michigan–Dearborn University of Michigan – Dearborn
University of Michigan–Flint University of Michigan – Flint
University of Minnesota–Crookston University of Minnesota Crookston
University of Minnesota–Duluth University of Minnesota Duluth
University of Minnesota–Morris University of Minnesota Morris
University of Minnesota–Twin Cities University of Minnesota Twin Cities
University of Missouri University of Missouri – Columbia
University of Missouri–Kansas City University of Missouri – Kansas City
University of Missouri–St. Louis University of Missouri – St. Louis
University of Montana–Western University of Montana Western
University of Nebraska–Kearney University of Nebraska – Kearney
University of Nebraska–Lincoln University of Nebraska – Lincoln
University of Nebraska–Omaha University of Nebraska – Omaha
University of Nevada–Las Vegas University of Nevada Las Vegas
University of Nevada–Reno University of Nevada Reno
University of North Carolina–Asheville University of North Carolina at Asheville
University of North Carolina–Chapel Hill University of North Carolina at Chapel Hill
University of North Carolina–Charlotte University of North Carolina at Charlotte
University of North Carolina–Greensboro University of North Carolina at Greensboro
University of North Carolina–Pembroke University of North Carolina at Pembroke
University of North Carolina–Wilmington University of North Carolina at Wilmington
University of Northwestern–St. Paul University of Northwestern St. Paul
University of Saint Francis University of St. Francis (IN)
University of South Carolina–Aiken University of South Carolina Aiken
University of South Carolina–Beaufort University of South Carolina Beaufort
University of South Carolina–Upstate University of South Carolina Upstate
University of South Florida–St. Petersburg University of South Florida – Saint Petersburg
University of St. Francis University of St. Francis (IL)
University of St. Joseph University of Saint Joseph
University of Tennessee–Chattanooga University of Tennessee at Chattanooga
University of Tennessee–Martin University of Tennessee at Martin
University of Texas–Arlington University of Texas at Arlington
University of Texas–Austin University of Texas at Austin
University of Texas–Dallas University of Texas at Dallas
University of Texas–El Paso University of Texas at El Paso
University of Texas–San Antonio University of Texas at San Antonio
University of Texas–Tyler University of Texas at Tyler
University of Valley Forge Valley Forge University
University of Virginia–Wise University of Virginia’s College at Wise
University of Wisconsin–Eau Claire University of Wisconsin – Eau Claire
University of Wisconsin–Green Bay University of Wisconsin – Green Bay
University of Wisconsin–La Crosse University of Wisconsin – La Crosse
University of Wisconsin–Madison University of Wisconsin – Madison
University of Wisconsin–Milwaukee University of Wisconsin – Milwaukee
University of Wisconsin–Oshkosh University of Wisconsin – Oshkosh
University of Wisconsin–Parkside University of Wisconsin – Parkside
University of Wisconsin–Platteville University of Wisconsin – Platteville
University of Wisconsin–River Falls University of Wisconsin – River Falls
University of Wisconsin–Stevens Point University of Wisconsin – Stevens Point
University of Wisconsin–Stout University of Wisconsin – Stout
University of Wisconsin–Superior University of Wisconsin – Superior
University of Wisconsin–Whitewater University of Wisconsin – Whitewater
Washington and Jefferson College Washington & Jefferson College
William Paterson University of New Jersey William Paterson University
York College York College (NE)

The rare thing happened: the 'fuzzywuzzy.process.extract()' and city check still managed to match a 'ranking' name to a similar but incorrect 'college' name that just happened to be in the same city – three times!:

  • 'Florida A&M University' was incorrectly matched to 'Florida State University', both in Tallahassee, FL.

  • 'Ohio State University--Columbus' was incorrectly matched to 'Columbus State University', both in Columbus, OH.

  • 'University of Lynchburg' was incorrectly matched to 'Virginia University of Lynchburg', both in Lynchburg, VA.

By looking at all names in 'college' for each city – or by just directly searching CollegeData.com – and running a Google search when needed, it only takes a minute to find that these schools should be changed as follows:

rankings.loc['Florida A&M University',
             'Close Name'] = 'Florida Agricultural and Mechanical University'
rankings.loc['Ohio State University--Columbus',
             'Close Name'] = 'Ohio State University'
rankings.loc['University of Lynchburg',
             'Close Name'] = 'Lynchburg College'

We can now rename the index of these successfully matched 208 rows to the successful 'Close Name' field, drop 'Close Name', join them with 'college', and drop them from 'rankings'.

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)

len(rankings)
80

Even fuzzier matching

We have 80 rows left in 'rankings' that haven’t so far lined up with a row in 'college'. Maybe there is an issue where the city is written different between the two tables – we can try using 'fuzzywuzzy.process.extract()' again without checking the city this time.

def get_closest_name_modified(row):
    
    if row['City'] not in college['City'].values:
        city = process.extract(row['City'], college['City'].values)[0][0]
    else:
        city = row['City']
    
    names_in_city = college[college['City'] == city].index

    close_names = process.extract(row.name, names_in_city)
    for close_name in close_names:
        if row['State'] == college.loc[close_name[0],'State']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name_modified, axis=1)
rankings['Close Name'].count()
66

Our modified function matched 66 of the 80 remaining names. Let’s inspect the results. WARNING: Prepare for a slightly smaller wall of a table!

matched = ~rankings['Close Name'].isnull()
rankings.loc[matched, ['Close Name']]
Close Name
Name
Alderson Broaddus University Alderson-Broaddus College
Atlanta Metropolitan State College Carver College
Augsburg University University of Minnesota Twin Cities
Augusta University University of Phoenix Augusta Campus
Bay Path University Bay Path College
Bay State College Boston Architectural College
Bismarck State College University of Mary
Brigham Young University–Hawaii Brigham Young University – Hawaii
CUNY–Baruch College Baruch College (City University of New York)
CUNY–Brooklyn College Brooklyn College (City University of New York)
CUNY–City College Baruch College (City University of New York)
CUNY–College of Staten Island College of Staten Island (City University of N…
CUNY–Lehman College Lehman College (City University of New York)
CUNY–Medgar Evers College Brooklyn College (City University of New York)
CUNY–Queens College Queens College (City University of New York)
California State University–Maritime Academy California Maritime Academy
Castleton University Castleton State College
College of Our Lady of the Elms Elms College
Doane University Doane College
Dunwoody College of Technology Augsburg College
Felician University Felician College
Georgia Southern University–Armstrong Savannah State University
Greenville University Greenville College
Gwynedd Mercy University Gwynedd-Mercy College
Harvard University Lesley University
Iowa Wesleyan University Iowa Wesleyan College
Keiser University Nova Southeastern University
King University King College
Lincoln College Lincoln Christian University
Maranatha Baptist University Maranatha Baptist Bible College
Maryville University of St. Louis Maryville University of Saint Louis
Middle Georgia State University Mercer University
Midway University Midway College
Mount St. Joseph University Xavier University
Notre Dame College of Ohio Chamberlain College of Nursing (OH)
Oklahoma State University Institute of Technology–Okmulgee Bacone College
Pennsylvania State University–University Park Penn State University Park
Pensacola State College University of West Florida
Rutgers University–Camden Rutgers The State University of New Jersey – C…
Rutgers University–New Brunswick Rutgers The State University of New Jersey
Rutgers University–Newark Rutgers The State University of New Jersey – N…
SUNY Buffalo State Buffalo State College
SUNY College of Technology–Alfred Alfred State College
SUNY College of Technology–Canton St. Lawrence University
SUNY College–Cortland State University of New York College at Cortland
SUNY College–Oneonta State University of New York College at Oneonta
SUNY College–Potsdam State University of New York College at Potsdam
SUNY Maritime College State University of New York Maritime College
SUNY Polytechnic Institute–Albany/Utica Utica College
SUNY–New Paltz State University of New York at New Paltz
Southern University and A&M College Southern University and Agricultural and Mecha…
St. John’s University (MN) Saint John’s University
St. Joseph’s College New York Brooklyn College (City University of New York)
Tennessee Wesleyan University Tennessee Wesleyan College
The Master’s University and Seminary Master’s College
Truett McConnell University Truett-McConnell College
Tusculum University Tusculum College
University of Holy Cross Louisiana State University Health Sciences Center
University of Jamestown Jamestown College
University of Mount Olive Mount Olive College
University of North Texas at Dallas Art Institute of Dallas
University of Providence University of Great Falls
Virginia Tech Virginia Polytechnic Institute and State Unive…
Virginia Wesleyan University Virginia Wesleyan College
Waldorf University Waldorf College
Williams Baptist University Williams Baptist College

Some of these are obviously right or wrong, but many required searching Google and CollegeData.com to investigate. In many cases, the school has recently changed names. Unfortunately, for many cases the school is not listed in any way I could find on CollegeData.com. This process took significantly longer than the previous eye check – about 20 minutes of searching around. After manually renaming, let’s see how many remain unmatched:

renamings = {
    "Atlanta Metropolitan State College": np.nan,
    "Augsburg University": "Augsburg College",
    "Augusta University": "Georgia Regents University",
    "Bay State College": np.nan,
    "Bismarck State College": np.nan,
    "CUNY--City College": "City College of New York",
    "CUNY--Medgar Evers College": 
        "Medgar Evers College (City University of New York)",
    "Dunwoody College of Technology": np.nan,
    "Georgia Southern University--Armstrong": "Armstrong State University",
    "Harvard University": "Harvard College",
    "Keiser University": np.nan,
    "Middle Georgia State University": np.nan,
    "Mount St. Joseph University": "College of Mount St. Joseph",
    "Notre Dame College of Ohio": "Notre Dame College",
    "Oklahoma State University Institute of Technology--Okmulgee": np.nan,
    "Pensacola State College": np.nan,
    "SUNY College of Technology--Canton": np.nan,
    "SUNY Polytechnic Institute--Albany/Utica": 
        "State University of New York Institute of Technology",
    "St. Joseph's College New York":"St. Joseph's College - Brooklyn Campus",
    "University of Holy Cross":"Our Lady of Holy Cross College",
    "University of North Texas at Dallas": np.nan,
    "University of Providence": np.nan,
}

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)
len(rankings)
25

No matches at all

There are 25 remaining schools in 'rankings' that did not fuzzy match with any school name in 'college', including the 11 that we just previously found to not exist at CollegeData.com. The other 14 are:

rankings[~rankings.index.isin(list(renamings.keys()))][['City','State']]
City State
Name
Abraham Baldwin Agricultural College Tifton GA
Alabama Agricultural and Mechanical University Normal AL
Chipola College Marianna FL
College of St. Benedict St. Joseph MN
Cottey College Nevada MO
Crowley’s Ridge College Paragould AR
DeVry University Downers Grove IL
Dean College Franklin MA
East Georgia State College Swainsboro GA
Gordon State College Barnesville GA
Indian River State College Fort Pierce FL
SUNY College of Technology–Delhi Delhi NY
South Florida State College Avon Park FL
University of Texas–Rio Grande Valley Edinburg TX

After some searching, two of these schools were indeed in CollegeData.com and were appropriately renamed. The notorious 'DeVry University' is listed on CollegeData.com – but not as a single school. CollegeData.com has 14 separate listings whereas the US News ranking considers it a single school. Unfortunately, while CollegeData.com has 14 pages for these schools, there is no admissions data provided for a single one of them. Due to this, I just matched the DeVry 'rankings' with the name 'DeVry University - Chicago' in the 'college' table.

After making changes, let’s see what remains:

renamings = {
    "Alabama Agricultural and Mechanical University": 
        "Alabama A&M University",
    "College of St. Benedict": "College of Saint Benedict",
    "DeVry University":"DeVry University - Chicago"
}

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)
len(rankings)
22

We’ll return to these 22 unmatched schools in a moment.

Final inspection of cleaned rankings

For now, let’s check our 'cleaned_rankings' for any anomolies before we join it with 'college'. First, let’s see if there are any duplicates:

duplicated = cleaned_rankings.index.duplicated(keep=False)
cleaned_rankings[duplicated][['City','State']]
City State
Name
Concordia College Bronxville NY
Concordia College Moorhead MN

In terms of 'college' names, the Concordia in Bronxville, NY should be 'Concordia College New York' – the other one can stay as is.

second_duplicate = cleaned_rankings.index.duplicated(keep='last')
cleaned_rankings[second_duplicate]

idx = cleaned_rankings.index
idx = idx.where(~idx.duplicated(keep='last'), idx + ' New York')
cleaned_rankings.index = idx

cleaned_rankings[duplicated][['City','State']]
City State
Name
Concordia College New York Bronxville NY
Concordia College Moorhead MN

Let’s see any cities that don’t line up between the two tables:

unaligned_cities = college.loc[cleaned_rankings.index,'City'] != \
                                                    cleaned_rankings['City']
cleaned_rankings[unaligned_cities][['City']] \
                    .join(college[['City']], lsuffix=' (rankings)')
City (rankings) City
Name
American Jewish University Bel-Air Bel Air
Auburn University Auburn Auburn University
Bard College Annandale on Hudson Annandale-on-Hudson
Brewton-Parker College Mount Vernon Mt. Vernon
College of Mount St. Vincent Bronx Riverdale
Florida Memorial University Miami Miami-Dade
Lake Superior State University Sault Ste. Marie Sault Sainte Marie
Northern Illinois University DeKalb De Kalb
Nova Southeastern University Ft. Lauderdale Fort Lauderdale
Saint Mary-of-the-Woods College St. Mary-of-the-Woods Saint Mary of the Woods
St. Mary’s College of Maryland St. Marys City St. Mary’s City
University of Richmond Univ. of Richmond University of Richmond
Brigham Young University – Hawaii Laie Oahu Laie
Maryville University of Saint Louis St Louis St. Louis
Notre Dame College Cleveland South Euclid
Alabama A&M University Normal Huntsville
College of Saint Benedict St. Joseph Saint Joseph
DeVry University – Chicago Downers Grove Chicago

There’s nothing worrying here – I double checked that ‘Riverdale’ is a neighborhood in the ‘Bronx’, ‘Laie’ is on Oahu, ‘South Euclid’ is next to ‘Cleveland’, and ‘Normal’ is the post office name of the site of Alabama A&M inside ‘Huntsville’. Also recall we matched DeVry to the Chicago one. We can now drop the superfluous 'City' and 'State' columns from 'cleaned_rankings' in preparation for the join to 'college'.

cleaned_rankings = cleaned_rankings.drop(columns=['City','State'])

What’s being left behind?

Before we continue, let’s consider the 22 umatched schools that still remain in the uncleaned 'rankings' table that aren’t able to come along with the join. Many of them are recently converted community colleges though some, for whatever reason, just aren’t in the source dataset CollegeData.com used.

Let’s look at how big of a chunk these schools represent from our rankings in terms of undergrad population:

(rankings[['Type','Undergrads']].groupby('Type').sum() / 
    cleaned_rankings[['Type','Undergrads']].groupby('Type').sum())
Undergrads
Type
National Liberal Arts Colleges 0.000000
National Universities 0.005520
Regional Colleges 0.179471
Regional Universities 0.007486

It seems that the 'Regional Colleges' rank category in our soon-to-be-joined 'college' data set will be 18% short in terms of undergrads, but our other rank categories will be missing less than 1%.

Joining the tables

Finally, on with the join:

college = college.join(cleaned_rankings)

That’s it!

We’ll add a new rank category 'Unranked' for those schools in 'college' that were not on a US News ranking list, and then we’ll export the the whole thing to a .csv file:

college['Type'] = college['Type'].cat.add_categories('Unranked')
college['Type'] = college['Type'].fillna('Unranked')

college.to_csv('college.csv')

Now that our college admissions data is scraped, cleaned, and joined with rankings data, the hard part is over. We’ll move on to exploratory data analysis in the next post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.