Yield Rate Gender Gap, Part 3: Adding Rankings

This is Part 3 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

Our previously scraped and cleaned CollegeData.com table contains a lot of admissions information for each school, but it doesn’t contain anything about rankings. One of the most influential college rankings – for better or worse – are those put out by US News and World Report.

US News keeps multiple ranking lists, tracking what they call ‘National Universities’, ‘National Liberal Arts Colleges’, ‘Regional Universities’ (which are split into four regions), and ‘Regional Colleges’ (also split into four regions).

When scrolling down the page for each of these rankings, more rows will appear:

Scraping US News with BeautifulSoup

After scrolling down until the schools are no longer ranked, I opened Safari’s developer console and saved the generated HTML in its final state as a local .html files. I looked in the .html files to find the <tr> tags that correspond to a single row of results, and copied its attribute data-view = "colleges-search-results-table-row" to help identify all the <tr> tags each containing a row of rankings data. I then used BeautifulSoup to scrape the page and return a pandas DataFrame.

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

filepath = 'usnews.html'
row_id_attr = {"data-view":"colleges-search-results-table-row"}
field_labels = ['Name','Location','Rank','Tuition','Undergrads']

scraped = []
with open(filepath,'r') as file:
    page = BeautifulSoup(file.read(), "lxml")
    for tag in page('tr', attrs = row_id_attr):
        values = list(tag.stripped_strings)
        if '(tie)' in values: values.remove('(tie)') # unwanted field
        if '1' in values: values.remove('1') # unwanted field
        scraped.append(dict(zip(field_labels, values)))

rankings = pd.DataFrame(scraped)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 5 columns):
Location      1400 non-null object
Name          1400 non-null object
Rank          1400 non-null object
Tuition       1400 non-null object
Undergrads    1399 non-null object
dtypes: object(5)
memory usage: 54.8+ KB

US News had rankings for 1400 schools. Each listing also contains the tuition and the total number of undergraduates, which are helpful pieces of data that we did not get from CollegeData.com (they were both available on another one of the five pages for each school that we didn’t scrape).

We will want to join this new 'rankings' DataFrame with our saved 'college' DataFrame we scraped from CollegeData.com. We will discover that when we try to join tables on the 'Name' that US News and CollegeData.com don’t use the exact same names for each school, making it tricky to merge the tables together.

Cleaning the rankings

First, let’s clean 'rankings' column by column like we did for 'college' back in Part 2.

We’ll check the .mode() to find potential null values and replace them with np.nan:

Location                             Chicago, IL
Name                               Union College
Rank          #231-#300 in National Universities
Tuition                                      N/A
Undergrads                                   N/A
Name: 0, dtype: object
rankings = rankings.replace('N/A',np.nan)

Next, we’ll extract the numeric 'Rank' as well as save the 'Type' of rank, that is, whether the school is a ‘National University’ or a ‘Regional Colleges North’, etc. We will also drop any schools we accidentally scraped from the end of the list that did not contain a ranking.

0 #1 in Regional Colleges Midwest
1 #1 in Regional Colleges West
2 #1 in Regional Colleges North
3 #1 in Regional Universities Midwest
4 #1 in Regional Universities South
regexp = '([A-Z].*)$'
rankings['Type'] = rankings['Rank'].str.extract(regexp)
rankings['Type'] = rankings['Type'].fillna('Unranked')

unranked = rankings['Type'].str.contains('Unranked')
rankings = rankings.drop(rankings[unranked].index)
regional_colleges = rankings['Type'].str.contains('Regional Colleges')
rankings.loc[regional_colleges,'Type'] = 'Regional Colleges'
regional_universities = rankings['Type'].str.contains('Regional Universities')
rankings.loc[regional_universities,'Type'] = 'Regional Universities'

rankings['Type'] = rankings['Type'].astype('category')

rankings['Rank'] = rankings['Rank'].str.extract('#(\d*)').astype('float')
rankings = rankings.drop(rankings[rankings['Rank'].isnull()].index)

Rank Type
0 1.0 Regional Colleges
1 1.0 Regional Colleges
2 1.0 Regional Colleges
3 1.0 Regional Universities
4 1.0 Regional Universities

Next, we will split the 'City' and 'State' from the 'Location' column before dropping it:

0 Grand Rapids, MI
1 Helena, MT
2 New York, NY
3 Omaha, NE
4 Elon, NC
regexp = '(.*), ([A-Z]{2})'
rankings[['City','State']] = rankings['Location'].str.extract(regexp)
rankings['State'] = rankings['State'].astype('category')
rankings = rankings.drop(columns='Location')
City State
0 Grand Rapids MI
1 Helena MT
2 New York NY
3 Omaha NE
4 Elon NC

Next, we’ll extract the 'Tuition' and, when available, the 'Tuition (in-state)':

rankings.loc[slice(20,24),['Tuition']] # picked a slice w/in-state tuition
20 $31,960
21 $51,848
22 $17,968 (out-of-state), $6,808 (in-state)
23 $39,062
24 $47,165
newcols = ['Tuition','Tuition (in-state)']
rankings[newcols] = rankings['Tuition'].str.replace(',','') \
                                       .str.extractall('\$(\d*)') \
rankings.loc[slice(20,24), newcols] 
Tuition Tuition (in-state)
20 31960.0 NaN
21 51848.0 NaN
22 17968.0 6808.0
23 39062.0 NaN
24 47165.0 NaN

Looking at our last column:

0 3,807
1 1,373
2 876
3 4,203
4 6,008

Finally, we just need to remove the commas from the 'Undergrads' column and convert it to a float:

rankings['Undergrads'] = rankings['Undergrads'].str.replace(',','')
rankings['Undergrads'] = rankings['Undergrads'].astype('float')
0 3807.0
1 1373.0
2 876.0
3 4203.0
4 6008.0

Like in Part 2, we’ll append the 'State' in parenthesis to any duplicated 'Name':

duplicated = rankings['Name'].duplicated(keep=False)
duplicated.sum()  # check num of duplicates
for i, row in rankings[rankings['Name'].duplicated(keep=False)].iterrows():
    rankings.loc[i,'Name'] += ' (' + row['State'] + ')'
rankings.loc[duplicated, 'Name'].duplicated(keep=False).sum() # check again

Finally, set the 'Name' to be the DataFrame index.

rankings = rankings.set_index('Name', verify_integrity=True).sort_index()
<class 'pandas.core.frame.DataFrame'>
Index: 1381 entries, Abilene Christian University to Youngstown State University
Data columns (total 7 columns):
Rank                  1381 non-null float64
Tuition               1373 non-null float64
Undergrads            1268 non-null float64
Type                  1381 non-null category
City                  1381 non-null object
State                 1381 non-null category
Tuition (in-state)    515 non-null float64
dtypes: category(2), float64(4), object(1)
memory usage: 70.5+ KB

Matching the names in both tables

We have our final cleaned 'rankings' dataset from US News. Unfortunately, because of slight differences in our index 'Name', it won’t be so straightforward to join 'rankings' with our previously scraped 'college' dataset from CollegeData.com.


# Import
college = pd.read_csv('collegedata_cleaned_backup.csv')
college['State'] = college['State'].astype('category')

# Differentiate dulicate names and set name as index
college = college.rename(columns={'schoolname':'Name'})
duplicated = college['Name'].duplicated(keep=False)
for i, row in college[college['Name'].duplicated(keep=False)].iterrows():
    college.loc[i,'Name'] += ' (' + row['City'] + ')'
college = college.set_index('Name', verify_integrity=True).sort_index()
matched = rankings.index.isin(college.index)

1093 of the 1381 'rankings' names are listed verbatim in 'college'. We will create a new temporary dataframe 'cleaned_rankings' to hold them for now, and then drop them from 'rankings'.

cleaned_rankings = rankings[matched]
rankings = rankings.drop(rankings[matched].index)

Using ‘fuzzywuzzy’ to match names

There is a high probability many of the remaining 288 unmatched names in 'rankings' are written slightly differently than their yet-to-be-matched counterparts in 'college'. Using fuzzywuzzy, which was put together by the people at SeatGeek and built on difflib, we can find a list of closest names in the 'college' index for each name in the 'rankings' table. We’ll then check that the city of the matched 'college' name lines up with the city in the 'rankings' table, and if so, output the name to a new column callled 'Close Name' that we’ll visually inspect:

from fuzzywuzzy import process

def get_closest_name(row):
    close_names = process.extract(row.name, college.index)
    for close_name in close_names:
        if row['City'] == college.loc[close_name[0],'City']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name, axis=1)
rankings['Close Name'].count()

Using 'fuzzywuzzy.process.extract()', we got a 'Close Name' for 208 of the remaining 288 names in 'rankings'. To be safe, I’ll visually look through these 208 name matches before renaming and joining. WARNING: Prepare for a wall of a table!

pd.set_option('display.max_rows', 208)
rankings[['Close Name']].dropna()
Close Name
Aquinas College Aquinas College (MI)
Arizona State University–Tempe Arizona State University
Auburn University–Montgomery Auburn University at Montgomery
Binghamton University–SUNY Binghamton University
Brigham Young University–Idaho Brigham Young University – Idaho
Brigham Young University–Provo Brigham Young University
Bryn Athyn College of the New Church Bryn Athyn College
CUNY–Hunter College Hunter College
CUNY–John Jay College of Criminal Justice John Jay College of Criminal Justice
CUNY–New York City College of Technology New York City College of Technology
CUNY–York College York College (NY)
California Polytechnic State University–San Luis Obispo California Polytechnic State University San Lu…
California State Polytechnic University–Pomona California State Polytechnic University Pomona
California State University–Bakersfield California State University Bakersfield
California State University–Channel Islands California State University Channel Islands
California State University–Chico California State University Chico
California State University–Dominguez Hills California State University Dominguez Hills
California State University–East Bay California State University East Bay
California State University–Fresno California State University Fresno
California State University–Fullerton California State University Fullerton
California State University–Long Beach California State University Long Beach
California State University–Los Angeles California State University Los Angeles
California State University–Monterey Bay California State University Monterey Bay
California State University–Northridge California State University Northridge
California State University–Sacramento California State University Sacramento
California State University–San Bernardino California State University San Bernardino
California State University–San Marcos California State University San Marcos
California State University–Stanislaus California State University Stanislaus
Central Christian College Central Christian College of Kansas
College at Brockport–SUNY College at Brockport
College of New Jersey The College of New Jersey
College of Saint Rose College of St. Rose
College of St. Mary College of Saint Mary
Colorado State University–Pueblo Colorado State University – Pueblo
Columbia College Columbia College (MO)
Concordia College–Moorhead Concordia College
Concordia University (CA) Concordia University Irvine
Concordia University (NE) Concordia University Nebraska
Concordia University–St. Paul Concordia University St. Paul
Cooper Union Cooper Union for the Advancement of Science an…
Edinboro University of Pennsylvania Edinboro University
Embry-Riddle Aeronautical University Embry-Riddle Aeronautical University – Daytona…
Emory and Henry College Emory & Henry College
Everglades University Everglades University – Boca Raton
Fairleigh Dickinson University Fairleigh Dickinson University – Metropolitan …
Farmingdale State College–SUNY Farmingdale State College
Florida A&M University Florida State University
Franklin and Marshall College Franklin & Marshall College
Georgia College & State University Georgia College and State University
Grace College and Seminary Grace College
Indiana University–Bloomington Indiana University Bloomington
Indiana University–Kokomo Indiana University Kokomo
Indiana University–South Bend Indiana University South Bend
Indiana University-Purdue University–Indianapolis Indiana University – Purdue University Indiana…
Johnson & Wales University Johnson & Wales University (RI)
Louisiana State University–Alexandria Louisiana State University Alexandria
Louisiana State University–Baton Rouge Louisiana State University
Louisiana State University–Shreveport Louisiana State University Shreveport
Miami University–Oxford Miami University
Minnesota State University–Mankato Minnesota State University Mankato
Minnesota State University–Moorhead Minnesota State University Moorhead
Missouri University of Science & Technology Missouri University of Science and Technology
Montana State University–Billings Montana State University – Billings
Montana State University–Northern Montana State University – Northern
Mount Saint Mary’s University Mount St. Mary’s University (CA)
Mount St. Mary’s University Mount St. Mary’s University (MD)
North Carolina State University–Raleigh North Carolina State University
Ohio State University–Columbus Columbus State University
Purchase College–SUNY Purchase College
Purdue University–Fort Wayne Indiana University – Purdue University Fort Wayne
Purdue University–Northwest Purdue University Northwest
Purdue University–West Lafayette Purdue University
SUNY College of Agriculture and Technology–Cobleskill State University of New York College of Agricu…
SUNY College of Environmental Science and Forestry State University of New York College of Enviro…
SUNY College–Old Westbury College at Old Westbury
SUNY–Fredonia State University of New York at Fredonia
SUNY–Geneseo State University of New York College at Geneseo
SUNY–Morrisville Morrisville State College
SUNY–Oswego State University of New York at Oswego
SUNY–Plattsburgh State University of New York Plattsburgh
Sewanee–University of the South Sewanee: University of the South
Silver Lake College Silver Lake College of the Holy Family
South University South University – Savannah
Southern Illinois University–Carbondale Southern Illinois University Carbondale
Southern Illinois University–Edwardsville Southern Illinois University Edwardsville
Southern University–New Orleans Southern University at New Orleans
St. Anselm College Saint Anselm College
St. Francis University Saint Francis University
St. John’s University (NY) St. John’s University
St. Joseph’s College St. Joseph’s College of Maine
St. Joseph’s University Saint Joseph’s University
St. Mary’s College Saint Mary’s College
St. Mary’s College of California Saint Mary’s College of California
St. Mary’s University of Minnesota Saint Mary’s University of Minnesota
St. Mary’s University of San Antonio St. Mary’s University
St. Michael’s College Saint Michael’s College
St. Xavier University Saint Xavier University
Sterling College Sterling College (KS)
Stony Brook University–SUNY Stony Brook University
Texas A&M University–College Station Texas A&M University
Texas A&M University–Commerce Texas A&M University – Commerce
Texas A&M University–Corpus Christi Texas A&M University – Corpus Christi
Texas A&M University–Kingsville Texas A&M University – Kingsville
Texas A&M University–Texarkana Texas A&M University – Texarkana
The Catholic University of America Catholic University of America
The New School Eugene Lang College The New School for Liberal…
University at Albany–SUNY University at Albany
University at Buffalo–SUNY University at Buffalo
University of Alabama–Birmingham University of Alabama at Birmingham
University of Alabama–Huntsville University of Alabama in Huntsville
University of Alaska–Anchorage University of Alaska Anchorage
University of Alaska–Fairbanks University of Alaska Fairbanks
University of Alaska–Southeast University of Alaska Southeast
University of Arkansas–Fort Smith University of Arkansas at Fort Smith
University of Arkansas–Little Rock University of Arkansas at Little Rock
University of Arkansas–Pine Bluff University of Arkansas at Pine Bluff
University of California–Berkeley University of California Berkeley
University of California–Davis University of California Davis
University of California–Irvine University of California Irvine
University of California–Los Angeles University of California Los Angeles
University of California–Merced University of California Merced
University of California–Riverside University of California Riverside
University of California–San Diego University of California San Diego
University of California–Santa Barbara University of California Santa Barbara
University of California–Santa Cruz University of California Santa Cruz
University of Colorado–Boulder University of Colorado Boulder
University of Colorado–Colorado Springs University of Colorado Colorado Springs
University of Colorado–Denver University of Colorado Denver
University of Hawaii–Hilo University of Hawaii at Hilo
University of Hawaii–Manoa University of Hawaii at Manoa
University of Hawaii–West Oahu University of Hawaii – West Oahu
University of Houston–Clear Lake University of Houston – Clear Lake
University of Houston–Downtown University of Houston – Downtown
University of Houston–Victoria University of Houston – Victoria
University of Illinois–Chicago University of Illinois at Chicago
University of Illinois–Springfield University of Illinois at Springfield
University of Illinois–Urbana-Champaign University of Illinois at Urbana-Champaign
University of Louisiana–Lafayette University of Louisiana at Lafayette
University of Louisiana–Monroe University of Louisiana at Monroe
University of Lynchburg Virginia University of Lynchburg
University of Maine–Augusta University of Maine at Augusta
University of Maine–Farmington University of Maine at Farmington
University of Maine–Fort Kent University of Maine at Fort Kent
University of Maine–Machias University of Maine at Machias
University of Maine–Presque Isle University of Maine at Presque Isle
University of Maryland–Baltimore County University of Maryland Baltimore County
University of Maryland–College Park University of Maryland
University of Maryland–Eastern Shore University of Maryland Eastern Shore
University of Massachusetts–Amherst University of Massachusetts Amherst
University of Massachusetts–Boston University of Massachusetts Boston
University of Massachusetts–Dartmouth University of Massachusetts Dartmouth
University of Massachusetts–Lowell University of Massachusetts Lowell
University of Michigan–Ann Arbor University of Michigan
University of Michigan–Dearborn University of Michigan – Dearborn
University of Michigan–Flint University of Michigan – Flint
University of Minnesota–Crookston University of Minnesota Crookston
University of Minnesota–Duluth University of Minnesota Duluth
University of Minnesota–Morris University of Minnesota Morris
University of Minnesota–Twin Cities University of Minnesota Twin Cities
University of Missouri University of Missouri – Columbia
University of Missouri–Kansas City University of Missouri – Kansas City
University of Missouri–St. Louis University of Missouri – St. Louis
University of Montana–Western University of Montana Western
University of Nebraska–Kearney University of Nebraska – Kearney
University of Nebraska–Lincoln University of Nebraska – Lincoln
University of Nebraska–Omaha University of Nebraska – Omaha
University of Nevada–Las Vegas University of Nevada Las Vegas
University of Nevada–Reno University of Nevada Reno
University of North Carolina–Asheville University of North Carolina at Asheville
University of North Carolina–Chapel Hill University of North Carolina at Chapel Hill
University of North Carolina–Charlotte University of North Carolina at Charlotte
University of North Carolina–Greensboro University of North Carolina at Greensboro
University of North Carolina–Pembroke University of North Carolina at Pembroke
University of North Carolina–Wilmington University of North Carolina at Wilmington
University of Northwestern–St. Paul University of Northwestern St. Paul
University of Saint Francis University of St. Francis (IN)
University of South Carolina–Aiken University of South Carolina Aiken
University of South Carolina–Beaufort University of South Carolina Beaufort
University of South Carolina–Upstate University of South Carolina Upstate
University of South Florida–St. Petersburg University of South Florida – Saint Petersburg
University of St. Francis University of St. Francis (IL)
University of St. Joseph University of Saint Joseph
University of Tennessee–Chattanooga University of Tennessee at Chattanooga
University of Tennessee–Martin University of Tennessee at Martin
University of Texas–Arlington University of Texas at Arlington
University of Texas–Austin University of Texas at Austin
University of Texas–Dallas University of Texas at Dallas
University of Texas–El Paso University of Texas at El Paso
University of Texas–San Antonio University of Texas at San Antonio
University of Texas–Tyler University of Texas at Tyler
University of Valley Forge Valley Forge University
University of Virginia–Wise University of Virginia’s College at Wise
University of Wisconsin–Eau Claire University of Wisconsin – Eau Claire
University of Wisconsin–Green Bay University of Wisconsin – Green Bay
University of Wisconsin–La Crosse University of Wisconsin – La Crosse
University of Wisconsin–Madison University of Wisconsin – Madison
University of Wisconsin–Milwaukee University of Wisconsin – Milwaukee
University of Wisconsin–Oshkosh University of Wisconsin – Oshkosh
University of Wisconsin–Parkside University of Wisconsin – Parkside
University of Wisconsin–Platteville University of Wisconsin – Platteville
University of Wisconsin–River Falls University of Wisconsin – River Falls
University of Wisconsin–Stevens Point University of Wisconsin – Stevens Point
University of Wisconsin–Stout University of Wisconsin – Stout
University of Wisconsin–Superior University of Wisconsin – Superior
University of Wisconsin–Whitewater University of Wisconsin – Whitewater
Washington and Jefferson College Washington & Jefferson College
William Paterson University of New Jersey William Paterson University
York College York College (NE)

The rare thing happened: the 'fuzzywuzzy.process.extract()' and city check still managed to match a 'ranking' name to a similar but incorrect 'college' name that just happened to be in the same city – three times!:

  • 'Florida A&M University' was incorrectly matched to 'Florida State University', both in Tallahassee, FL.

  • 'Ohio State University--Columbus' was incorrectly matched to 'Columbus State University', both in Columbus, OH.

  • 'University of Lynchburg' was incorrectly matched to 'Virginia University of Lynchburg', both in Lynchburg, VA.

By looking at all names in 'college' for each city – or by just directly searching CollegeData.com – and running a Google search when needed, it only takes a minute to find that these schools should be changed as follows:

rankings.loc['Florida A&M University',
             'Close Name'] = 'Florida Agricultural and Mechanical University'
rankings.loc['Ohio State University--Columbus',
             'Close Name'] = 'Ohio State University'
rankings.loc['University of Lynchburg',
             'Close Name'] = 'Lynchburg College'

We can now rename the index of these successfully matched 208 rows to the successful 'Close Name' field, drop 'Close Name', join them with 'college', and drop them from 'rankings'.

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)


Even fuzzier matching

We have 80 rows left in 'rankings' that haven’t so far lined up with a row in 'college'. Maybe there is an issue where the city is written different between the two tables – we can try using 'fuzzywuzzy.process.extract()' again without checking the city this time.

def get_closest_name_modified(row):
    if row['City'] not in college['City'].values:
        city = process.extract(row['City'], college['City'].values)[0][0]
        city = row['City']
    names_in_city = college[college['City'] == city].index

    close_names = process.extract(row.name, names_in_city)
    for close_name in close_names:
        if row['State'] == college.loc[close_name[0],'State']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name_modified, axis=1)
rankings['Close Name'].count()

Our modified function matched 66 of the 80 remaining names. Let’s inspect the results. WARNING: Prepare for a slightly smaller wall of a table!

matched = ~rankings['Close Name'].isnull()
rankings.loc[matched, ['Close Name']]
Close Name
Alderson Broaddus University Alderson-Broaddus College
Atlanta Metropolitan State College Carver College
Augsburg University University of Minnesota Twin Cities
Augusta University University of Phoenix Augusta Campus
Bay Path University Bay Path College
Bay State College Boston Architectural College
Bismarck State College University of Mary
Brigham Young University–Hawaii Brigham Young University – Hawaii
CUNY–Baruch College Baruch College (City University of New York)
CUNY–Brooklyn College Brooklyn College (City University of New York)
CUNY–City College Baruch College (City University of New York)
CUNY–College of Staten Island College of Staten Island (City University of N…
CUNY–Lehman College Lehman College (City University of New York)
CUNY–Medgar Evers College Brooklyn College (City University of New York)
CUNY–Queens College Queens College (City University of New York)
California State University–Maritime Academy California Maritime Academy
Castleton University Castleton State College
College of Our Lady of the Elms Elms College
Doane University Doane College
Dunwoody College of Technology Augsburg College
Felician University Felician College
Georgia Southern University–Armstrong Savannah State University
Greenville University Greenville College
Gwynedd Mercy University Gwynedd-Mercy College
Harvard University Lesley University
Iowa Wesleyan University Iowa Wesleyan College
Keiser University Nova Southeastern University
King University King College
Lincoln College Lincoln Christian University
Maranatha Baptist University Maranatha Baptist Bible College
Maryville University of St. Louis Maryville University of Saint Louis
Middle Georgia State University Mercer University
Midway University Midway College
Mount St. Joseph University Xavier University
Notre Dame College of Ohio Chamberlain College of Nursing (OH)
Oklahoma State University Institute of Technology–Okmulgee Bacone College
Pennsylvania State University–University Park Penn State University Park
Pensacola State College University of West Florida
Rutgers University–Camden Rutgers The State University of New Jersey – C…
Rutgers University–New Brunswick Rutgers The State University of New Jersey
Rutgers University–Newark Rutgers The State University of New Jersey – N…
SUNY Buffalo State Buffalo State College
SUNY College of Technology–Alfred Alfred State College
SUNY College of Technology–Canton St. Lawrence University
SUNY College–Cortland State University of New York College at Cortland
SUNY College–Oneonta State University of New York College at Oneonta
SUNY College–Potsdam State University of New York College at Potsdam
SUNY Maritime College State University of New York Maritime College
SUNY Polytechnic Institute–Albany/Utica Utica College
SUNY–New Paltz State University of New York at New Paltz
Southern University and A&M College Southern University and Agricultural and Mecha…
St. John’s University (MN) Saint John’s University
St. Joseph’s College New York Brooklyn College (City University of New York)
Tennessee Wesleyan University Tennessee Wesleyan College
The Master’s University and Seminary Master’s College
Truett McConnell University Truett-McConnell College
Tusculum University Tusculum College
University of Holy Cross Louisiana State University Health Sciences Center
University of Jamestown Jamestown College
University of Mount Olive Mount Olive College
University of North Texas at Dallas Art Institute of Dallas
University of Providence University of Great Falls
Virginia Tech Virginia Polytechnic Institute and State Unive…
Virginia Wesleyan University Virginia Wesleyan College
Waldorf University Waldorf College
Williams Baptist University Williams Baptist College

Some of these are obviously right or wrong, but many required searching Google and CollegeData.com to investigate. In many cases, the school has recently changed names. Unfortunately, for many cases the school is not listed in any way I could find on CollegeData.com. This process took significantly longer than the previous eye check – about 20 minutes of searching around. After manually renaming, let’s see how many remain unmatched:

renamings = {
    "Atlanta Metropolitan State College": np.nan,
    "Augsburg University": "Augsburg College",
    "Augusta University": "Georgia Regents University",
    "Bay State College": np.nan,
    "Bismarck State College": np.nan,
    "CUNY--City College": "City College of New York",
    "CUNY--Medgar Evers College": 
        "Medgar Evers College (City University of New York)",
    "Dunwoody College of Technology": np.nan,
    "Georgia Southern University--Armstrong": "Armstrong State University",
    "Harvard University": "Harvard College",
    "Keiser University": np.nan,
    "Middle Georgia State University": np.nan,
    "Mount St. Joseph University": "College of Mount St. Joseph",
    "Notre Dame College of Ohio": "Notre Dame College",
    "Oklahoma State University Institute of Technology--Okmulgee": np.nan,
    "Pensacola State College": np.nan,
    "SUNY College of Technology--Canton": np.nan,
    "SUNY Polytechnic Institute--Albany/Utica": 
        "State University of New York Institute of Technology",
    "St. Joseph's College New York":"St. Joseph's College - Brooklyn Campus",
    "University of Holy Cross":"Our Lady of Holy Cross College",
    "University of North Texas at Dallas": np.nan,
    "University of Providence": np.nan,

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)

No matches at all

There are 25 remaining schools in 'rankings' that did not fuzzy match with any school name in 'college', including the 11 that we just previously found to not exist at CollegeData.com. The other 14 are:

City State
Abraham Baldwin Agricultural College Tifton GA
Alabama Agricultural and Mechanical University Normal AL
Chipola College Marianna FL
College of St. Benedict St. Joseph MN
Cottey College Nevada MO
Crowley’s Ridge College Paragould AR
DeVry University Downers Grove IL
Dean College Franklin MA
East Georgia State College Swainsboro GA
Gordon State College Barnesville GA
Indian River State College Fort Pierce FL
SUNY College of Technology–Delhi Delhi NY
South Florida State College Avon Park FL
University of Texas–Rio Grande Valley Edinburg TX

After some searching, two of these schools were indeed in CollegeData.com and were appropriately renamed. The notorious 'DeVry University' is listed on CollegeData.com – but not as a single school. CollegeData.com has 14 separate listings whereas the US News ranking considers it a single school. Unfortunately, while CollegeData.com has 14 pages for these schools, there is no admissions data provided for a single one of them. Due to this, I just matched the DeVry 'rankings' with the name 'DeVry University - Chicago' in the 'college' table.

After making changes, let’s see what remains:

renamings = {
    "Alabama Agricultural and Mechanical University": 
        "Alabama A&M University",
    "College of St. Benedict": "College of Saint Benedict",
    "DeVry University":"DeVry University - Chicago"

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)

We’ll return to these 22 unmatched schools in a moment.

Final inspection of cleaned rankings

For now, let’s check our 'cleaned_rankings' for any anomolies before we join it with 'college'. First, let’s see if there are any duplicates:

duplicated = cleaned_rankings.index.duplicated(keep=False)
City State
Concordia College Bronxville NY
Concordia College Moorhead MN

In terms of 'college' names, the Concordia in Bronxville, NY should be 'Concordia College New York' – the other one can stay as is.

second_duplicate = cleaned_rankings.index.duplicated(keep='last')

idx = cleaned_rankings.index
idx = idx.where(~idx.duplicated(keep='last'), idx + ' New York')
cleaned_rankings.index = idx

City State
Concordia College New York Bronxville NY
Concordia College Moorhead MN

Let’s see any cities that don’t line up between the two tables:

unaligned_cities = college.loc[cleaned_rankings.index,'City'] != \
cleaned_rankings[unaligned_cities][['City']] \
                    .join(college[['City']], lsuffix=' (rankings)')
City (rankings) City
American Jewish University Bel-Air Bel Air
Auburn University Auburn Auburn University
Bard College Annandale on Hudson Annandale-on-Hudson
Brewton-Parker College Mount Vernon Mt. Vernon
College of Mount St. Vincent Bronx Riverdale
Florida Memorial University Miami Miami-Dade
Lake Superior State University Sault Ste. Marie Sault Sainte Marie
Northern Illinois University DeKalb De Kalb
Nova Southeastern University Ft. Lauderdale Fort Lauderdale
Saint Mary-of-the-Woods College St. Mary-of-the-Woods Saint Mary of the Woods
St. Mary’s College of Maryland St. Marys City St. Mary’s City
University of Richmond Univ. of Richmond University of Richmond
Brigham Young University – Hawaii Laie Oahu Laie
Maryville University of Saint Louis St Louis St. Louis
Notre Dame College Cleveland South Euclid
Alabama A&M University Normal Huntsville
College of Saint Benedict St. Joseph Saint Joseph
DeVry University – Chicago Downers Grove Chicago

There’s nothing worrying here – I double checked that ‘Riverdale’ is a neighborhood in the ‘Bronx’, ‘Laie’ is on Oahu, ‘South Euclid’ is next to ‘Cleveland’, and ‘Normal’ is the post office name of the site of Alabama A&M inside ‘Huntsville’. Also recall we matched DeVry to the Chicago one. We can now drop the superfluous 'City' and 'State' columns from 'cleaned_rankings' in preparation for the join to 'college'.

cleaned_rankings = cleaned_rankings.drop(columns=['City','State'])

What’s being left behind?

Before we continue, let’s consider the 22 umatched schools that still remain in the uncleaned 'rankings' table that aren’t able to come along with the join. Many of them are recently converted community colleges though some, for whatever reason, just aren’t in the source dataset CollegeData.com used.

Let’s look at how big of a chunk these schools represent from our rankings in terms of undergrad population:

(rankings[['Type','Undergrads']].groupby('Type').sum() / 
National Liberal Arts Colleges 0.000000
National Universities 0.005520
Regional Colleges 0.179471
Regional Universities 0.007486

It seems that the 'Regional Colleges' rank category in our soon-to-be-joined 'college' data set will be 18% short in terms of undergrads, but our other rank categories will be missing less than 1%.

Joining the tables

Finally, on with the join:

college = college.join(cleaned_rankings)

That’s it!

We’ll add a new rank category 'Unranked' for those schools in 'college' that were not on a US News ranking list, and then we’ll export the the whole thing to a .csv file:

college['Type'] = college['Type'].cat.add_categories('Unranked')
college['Type'] = college['Type'].fillna('Unranked')


Now that our college admissions data is scraped, cleaned, and joined with rankings data, the hard part is over. We’ll move on to exploratory data analysis in the next post.

Yield Rate Gender Gap, Part 2: Cleaning the Data

This is Part 2 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

Let’s import and check the shape of the the raw data we scraped from CollegeData.com back in Part 1:

import numpy as np
import pandas as pd

college = pd.read_csv('collegedata_scraped_backup.csv', index_col=0)

(1813, 53)

We scraped 53 columns of data from 1813 different schools. For reference, this is what Penn State’s data looks like in its original form on CollegeData.com.

Identifying null values

We scraped the school name, the location, and everything in the “Profile of Fall Admission” table. The column labels of our raw scraped data are:

 'Overall Admission Rate',
 'Students Enrolled',
 'Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank',
 'National Merit Scholar',
 'Class President',
 'Student Government Officer']

If you look up enough schools on CollegeData.com, you’ll notice not every school has information available for each column here. Very sparsely populated columns – that is columns filled with null values – may not be very useful in helping us analyze the yield rate disparity.

Pandas saves null values with the np.NaN value. CollegeData.com probably uses something more friendly to indicate a null value to a human reader. Assuming there is a bunch of null values in the dataset, can get an idea of the potential null value CollegeData.com uses by looking at the most common values that appear in each column of the DataFrame with the .mode() method.

['Chamberlain College of Nursing ',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Top tenth:   94% Top quarter:   99% Top half:  100%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported']

It looks like the string “Not reported” is what CollegeData.com is using to indicate a null value in this dataset. We will replace all instances of “Not reported” with np.NaN.

college = college.replace('Not reported',np.NaN)

Dropping sparse columns

Now that we’ve got null data properly marked, we can check what columns are filled with null data and potentially drop them. I’m going to consider dropping columns that are more than 75% null.

drop_cols = college.columns[college.isnull().sum() > 0.75*len(college)]
['Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'National Merit Scholar',
 'Class President',
 'Student Government Officer']

In a future analysis of only the most selective schools, it could be useful to keep track of the few schools that offer early admission. Regarding the remaining columns, there is just too little data inconsistently reported for them to serve as useful predictors for any models we build to predict the yield rate disparity across all schools.

After dropping these columns, let’s see what remains:

college = college.drop(columns=drop_cols)
 'Overall Admission Rate',
 'Students Enrolled',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank']

Dropping related columns

Referring back to Penn State’s CollegeData.com page, each column in the pattern of “Score of ### – ###” show the percentage of applicants who scored in that range on one of the standardized tests. While this is potentially interesting for a future analysis, we already get the mean and middle 50% for each test in the SAT Math, SAT Critical Reading, SAT Writing, and ACT Composite columns. Similarly, each “#.## – #.##” column contains the percentage of students who earned GPA in that ranges, but we already get the mean GPA of applicants in the Average GPA column.

Since these score range columns likely correlate with the mean score columns, we can drop the score range columns without losing the ability to use SAT/ACT/GPA scores as a potential predictor for the yield rate disparity.

As it turns out, to drop these columns, we can simply remove any column from the list that has a label containing a number.

drop_cols = college.columns[college.columns.str.contains(r'[\d]')]
college = college.drop(columns = drop_cols)

Removing commas

Large numbers in this dataset contain commas to separate thousands. We will remove all commas from the data in every column to make it easier to later extract numbers.

for col in college.columns:
    college[col] = college[col].str.replace(',','')

‘Students Enrolled’,’Women*’, and ‘Men*’ columns

Our surviving columns are:

original_columns = college.columns  # Save the names of the original columns for later so we can drop them.
 'Overall Admission Rate',
 'Students Enrolled',
 'Average GPA',
 'SAT Math',
 'SAT Critical Reading',
 'SAT Writing',
 'ACT Composite',
 'High School Class Rank']

Though we now only have these 14 columns for each school, there are more than 14 pieces of numeric data saved inside these columns.

For Penn State, for example, the 'Students Enrolled', 'Women*', and 'Men*' columns contain:

college.loc[59, 'Students Enrolled']
'9800 (33%) of 29878 admitted students enrolled'
college.loc[59, 'Women*']
'4710 (29%) of 16047 admitted students enrolled'
college.loc[59, 'Men*']
'5090 (37%) of 13831 admitted students enrolled'

We neeed to split these columns, and nearly all of the other columns as well, into separate columns:

We can use pandas .str.extract() method with a regular expression argument, which will output a new column of data for each group in the regular expression. We can then chain the .astype() method to convert the output columns to our desired data type. We can then convert any percentage columns to decimal numbers by dividing them by 100.

One at a time, we will search each of the columns 'Enrolled Students','Women*', and 'Men*' using a regular expression that captures three groups of numbers, outputting three new columns for each old column containing floating point numbes for ‘Enrolled’, ‘Yield Rate’, ‘Offers’.

regexp = '(\d*)\D*(\d*)%\D*(\d*)'

college[['Enrolled (all)','Yield Rate (all)','Offers (all)']] = college['Students Enrolled'].str.extract(regexp).astype('float')
college[['Enrolled (women)','Yield Rate (women)','Offers (women)']] = college['Women*'].str.extract(regexp).astype('float')
college[['Enrolled (men)','Yield Rate (men)','Offers (men)']] = college['Men*'].str.extract(regexp).astype('float')

college[['Yield Rate (all)','Yield Rate (women)','Yield Rate (men)']] /= 100   #to make rates a decimal percentage

The ‘location’ column

Let’s see what the raw location data looks like:

'University Park\xa0PA\xa0\xa016802'

We will split the ‘location’ string column into ‘City’ string, ‘State’ category, and ‘Zip’ integer columns.

regexp = '^(.*)\s([A-Z]{2})\s*(\d{5})'

college[['City','State','Zip']] = college['location'].str.extract(regexp)
college['State'] = college['State'].astype('category')
college['Zip'] = college['Zip'].astype('int')

The ‘Overall Admission Rate’, ‘Women’, and ‘Men’ columns

Let’s see what these columns look like:

college.loc[59,'Overall Admission Rate']
'56% of 52974 applicants were admitted'
'53% of 25925 applicants were admitted'
'59% of 27049 applicants were admitted'

We will split each of these columns into an ‘Offer Rate’ and ‘Apps’ column of floating point numbers.

regexp = '(\d*)%\D*(\d*)'

college[['Offer Rate (all)','Apps (all)']] = college['Overall Admission Rate'].str.extract(regexp).astype('float')
college[['Offer Rate (women)','Apps (women)']] = college['Women'].str.extract(regexp).astype('float')
college[['Offer Rate (men)','Apps (men)']] = college['Men'].str.extract(regexp).astype('float')

college[['Offer Rate (all)','Offer Rate (women)','Offer Rate (men)']] /= 100    #to make rates a decimal percentage

The ‘Average GPA’ column

Taking a look:

college.loc[59,'Average GPA']

The ‘Average GPA’ column doesn’t need to be split, but it does need to have it’s value extracted from any potential surrounding text that could accompany it at other schools. And it needs to be converted to a float column.

regexp = '(\d\.\d\d)'

college['GPA Mean'] = college['Average GPA'].str.extract(regexp).astype('float')

The ‘SAT Math’, ‘SAT Critical Reading’, ‘SAT Writing’, and ‘ACT Composite’ columns

Let’s look at these four columns:

college.loc[59,'SAT Math']
'612 average 560-670 range of middle 50%'
college.loc[59,'SAT Critical Reading']
'578 average 530-630 range of middle 50%'
college.loc[59,'SAT Writing']
college.loc[59,'ACT Composite']
'27 average 25-29 range of middle 50%'

CollegeData.com doesn’t report a value for Penn State’s 'SAT Writing' because Penn State does not consider that score in its admissions process. Looking at other schools, the 'SAT Writing' column does follow the same format as the other SAT columns.

We split each of the three ‘SAT’ columns – Math, Reading, and Writing – and the one ‘ACT’ columns into three float columns each, one for the Mean, one for the 25th percentile, and one for the 75th percentile.

regexp = '(\d*) average (\d*)-(\d*)'

college[['SAT Math Mean','SAT Math 25th','SAT Math 75th']] = college['SAT Math'].str.extract(regexp).astype('float')
college[['SAT Reading Mean','SAT Reading 25th','SAT Reading 75th']] = college['SAT Critical Reading'].str.extract(regexp).astype('float')
college[['SAT Writing Mean','SAT Writing 25th','SAT Writing 75th']] = college['SAT Writing'].str.extract(regexp).astype('float')
college[['ACT Mean','ACT 25th','ACT 75th']] = college['ACT Composite'].str.extract(regexp).astype('float')

The ‘High School Class Rank’ column

Finally, let’s look at the last column:

college.loc[59, 'High School Class Rank']
'Top tenth:   36% Top quarter:   77% Top half:   97% Bottom half:    3%'

We split this column into four float columns for the Top 10th, Top 25th, Top 50th, and Bottom 50th.

regexp = '(\d*)%\D*(\d*)%\D*(\d*)%\D*(\d*)%'

college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] = college['High School Class Rank'].str.extract(regexp).astype('float')
college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] /= 100   #decimal percentage

Changing the dataframe index

To make sure the name of each school is unique, we will append any duplicate school names with the state where the school is located.

for schoolId in college[college['schoolname'].duplicated(keep=False)].index:
    college.loc[schoolId,'schoolname'] = college.loc[schoolId,'schoolname']+' ('+college.loc[schoolId,'State']+')'

Instead of indexing our dataframe by CollegeData.com’s arbitrary schoolId, we will use the name of the school as the index.

college = college.set_index('schoolname', drop=False).sort_index()

Dropping old columns

And finally, we will delete all the old columns from the dataset.

college = college.drop(columns=original_columns)

Inspecting our cleaned dataframe

Let’s look at the .info() of our dataframe in its freshly cleaned state:

<class 'pandas.core.frame.DataFrame'>
Index: 1813 entries, Abilene Christian University to Youngstown State University
Data columns (total 35 columns):
Enrolled (all)        1303 non-null float64
Yield Rate (all)      1303 non-null float64
Offers (all)          1303 non-null float64
Enrolled (women)      1309 non-null float64
Yield Rate (women)    1309 non-null float64
Offers (women)        1309 non-null float64
Enrolled (men)        1265 non-null float64
Yield Rate (men)      1265 non-null float64
Offers (men)          1265 non-null float64
City                  1813 non-null object
State                 1813 non-null category
Zip                   1813 non-null int64
Offer Rate (all)      1548 non-null float64
Apps (all)            1548 non-null float64
Offer Rate (women)    1437 non-null float64
Apps (women)          1437 non-null float64
Offer Rate (men)      1431 non-null float64
Apps (men)            1431 non-null float64
GPA Mean              1343 non-null float64
SAT Math Mean         951 non-null float64
SAT Math 25th         951 non-null float64
SAT Math 75th         951 non-null float64
SAT Reading Mean      933 non-null float64
SAT Reading 25th      933 non-null float64
SAT Reading 75th      933 non-null float64
SAT Writing Mean      556 non-null float64
SAT Writing 25th      556 non-null float64
SAT Writing 75th      556 non-null float64
ACT Mean              1030 non-null float64
ACT 25th              1030 non-null float64
ACT 75th              1030 non-null float64
HS Top 10th           1035 non-null float64
HS Top 25th           1035 non-null float64
HS Top 50th           1035 non-null float64
HS Bottom 50th        1035 non-null float64
dtypes: category(1), float64(32), int64(1), object(1)
memory usage: 500.4+ KB

Our original, raw scraped dataframe was 1813 schools long and 53 dirty columns wide. In our cleaning, we dropped quite a few sparse columns. But we also split many original columns into clean, new columns.

We can check that Penn State’s data was properly cleaned. Since we’ve reset the index, we can select it directly by name now.

college.loc['Penn State University Park']
Enrolled (all)                   9800
Yield Rate (all)                 0.33
Offers (all)                    29878
Enrolled (women)                 4710
Yield Rate (women)               0.29
Offers (women)                  16047
Enrolled (men)                   5090
Yield Rate (men)                 0.37
Offers (men)                    13831
City                  University Park
State                              PA
Zip                             16802
Offer Rate (all)                 0.56
Apps (all)                      52974
Offer Rate (women)               0.59
Apps (women)                    27049
Offer Rate (men)                 0.53
Apps (men)                      25925
GPA Mean                         3.59
SAT Math Mean                     612
SAT Math 25th                     560
SAT Math 75th                     670
SAT Reading Mean                  578
SAT Reading 25th                  530
SAT Reading 75th                  630
SAT Writing Mean                  NaN
SAT Writing 25th                  NaN
SAT Writing 75th                  NaN
ACT Mean                           27
ACT 25th                           25
ACT 75th                           29
HS Top 10th                      0.36
HS Top 25th                      0.77
HS Top 50th                      0.97
HS Bottom 50th                   0.03
Name: Penn State University Park, dtype: object

Everything looks good here. The NaN for the SAT Writing is appropriate since Penn State did not consider the SAT Writing in admissions – CollegeData.com showed values of ‘Not reported’.

Exporting the dataframe

We will save the cleaned dataframe to a new .csv file.


In Part 3, we will import a separate dataset of US News and World Report’s college rankings and combine it with our cleaned school dataset.

Yield Rate Gender Gap, Part 1: Scraping the Web

This is Part 1 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

University admissions and the yield rate

In my previous work running the education startup Newton China, I frequently ended up using the website CollegeData.com as a reference – it lists a ton of data for a lot of schools.

As an example, searching my alma mater brings up Penn State’s page, which is actually one six seperate CollegeData pages of information: Overview, Admission, Money Matters, Academics, Campus Life, and Students.

Scrolling down, you’ll find a bunch of numbers organized in somewhat regularly into rows. Some table rows, like Average GPA, are easy enough, containing only one value. But others, like the row labeled Students Enrolled contain three pieces of data.

The first number in 'Students Enrolled' is the straightforward number of students enrolled, presumably 9,800 students attending school in the fall as incoming freshmen. The third number – 29,878 – is the number of students who were offered admission. The second number – 33% – is called the yield rate, and it is the number that raised questions that inspired me to collect and analyze this data.

But why care about the yield rate?

The yield rate is the percentage of offered students who actually enrolled and it is of considerable importance to many college admissions departments. If the admissions department enrolls too few students, the school will not receive enough tuition to cover its costs. If it enrolls too many students, the extra students will bring dorms and classrooms beyond capacity.

Not every student who is given an offer letter – a letter of acceptance – will accept it. In fact it seems only 33% of the Penn State’s offer letters were accepted. Presumably, the other 67% of students who were offered elected to attend another school. If this 33% yield rate is relatively consistent year to year, which it normally is, the school will have a good idea how many offer letters should be given out so they may end up with the right yield.

Curiously, the yield rate at Penn State for men (37%) seems a good amount higher than that for women (29%). Men seem more likely to accept Penn State’s offers than women are. After looking at a few other schools’ yield rates, I became curious if there actually is a significant disparity in yield rate among schools – and if so, how is this disparity distributed and what factors may predict it?

Scraping collegedata.com

The collegedata.com url for Penn State’s Admission information page looks like this:

The 59 at the end of the URL is the schoolId. The 02 part of the URL indicates you are on the 'Admission' information page. Not every schoolId corresponds to a school, but most do, especially in the lower range of numbers. Attempting to access a page for a number with no school will load a default page with the title "Retrieve a Saved Search". After poking around a bit, I found at larger schoolId, especially over 1000, real school pages became more sparse, and I’m fairly confident there are no schools listed with a schoolId over 2500.

BASE_URL = "http://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId="
END_ID = 2500

In order to determine if the yield disparity is real, I only need to scrape the yield rate from each page. But grabbing as much other data as possible from the six collegedata.com pages of information for each school at the same time could prove useful later when trying to find predictors for yield rate.

I decided a good balance would be to only get the data in the <td> tags under the heading 'Profile of Fall Admission', identified as <div id='section8'> in the HTML, in addition to the school’s name and its location. I can always come back to scrape more if I need it in the future.

Building a BeautifulSoup scraper

I used the wonderful BeautifulSoup library to help me get what I need from CollegeData’s pages. It maps the HTML into a tree that can be descended down (and up) and side to side to access the elements and strings you want.

The string 'schoolname' is conveniently located in the first <h1> tag. The 'location' is a bit more difficult to grab, as it isn’t nested inside a unique tag, but can be found easily by using the search feature to find the text 'City, State, Zip' and going from there.

Finding 'section8' is no problem, from there a relatively simple for loop through all of its <tr> descendants, each of which may have a <th> child to be used as a key and a <td> child to be used as a value. Though most of the <th> values are unique, some are not, including 'Women','Men', and the SAT Score information further down the section.

from bs4 import BeautifulSoup

def scrape(response, REJECT_PAGE_TITLE = "Retrieve a Saved Search"):
    # response is from requests.get()
    page = BeautifulSoup(response.text, "lxml")
    if page.h1 is None or page.h1.string == REJECT_PAGE_TITLE:
        return None

        scraped = {}        
        scraped['schoolname'] = page.h1.string
        scraped['location'] = page.find('th', text="City, State, Zip").next_sibling.string
        for tag in page.find(id="section8").descendants:
            if tag.name == 'tr' and tag.th is not None and tag.td is not None:
                key = " ".join(string for string in tag.th.stripped_strings)
                val = " ".join(string for string in tag.td.stripped_strings)
                while key in scraped:  # Temporarily deal with identical table headers cells
                    key += "*"
                scraped[key] = val
        return scraped

An empty dictionary 'scraped' is created to hold our scraped values. We add 'schoolname' and 'location' before looping through id=section8, saving every non-empty pair of <th> and <td>. To avoid duplicates, we add a '*' to the key as a temporary marker until we can later clean up the key names. Finally, a 'scraped' dictionary is returned to the caller.

Requesting 2500 pages!

I’ve read how to implement methods to evade anti-scraping measures, but I encountered no push back from CollegeData when making these basic requests. It took me about 40 minutes to run through this loop.

from requests import get
from IPython.core.display import clear_output
from warnings import warn

scraped_pages = {}
for schoolId in range(START_ID, END_ID+1):
    url = BASE_URL + str(schoolId) 
    print(url)   # a status update
    response = get(url)
    if response.status_code != 200:
        warn('Request {} caused status code {}.'.format(schoolId, response.status_code))
    scraped = scrape(response)
    if scraped:
        scraped_pages[schoolId] = scraped
    clear_output(wait = True) # clear the status update

print('Requested schoolId {} to {}, scraping {} pages.'.format(START_ID, END_ID, len(scraped_pages)))
Requested schoolId 0 to 2500, scraping 1813 pages.

We now have a dictionary, 'scraped_pages', filled with keys 'schoolId' associated with scraped dictionaries, which themselves are filled with scraped <th> keys and <td> values, in addition to 'location' and 'schoolname'. Pandas’ 'DataFrame.from_dict' method will quickly convert this to a DataFrame, which we will then backup to .csv before moving forward.

Inspecting the results

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1813 entries, 6 to 2099
Data columns (total 53 columns):
schoolname                               1813 non-null object
location                                 1813 non-null object
Overall Admission Rate                   1813 non-null object
Women                                    1813 non-null object
Men                                      1813 non-null object
Students Enrolled                        1813 non-null object
Women*                                   1813 non-null object
Men*                                     1813 non-null object
Early Decision Admission Rate            245 non-null object
Early Action Admission Rate              392 non-null object
Students Offered Wait List               335 non-null object
Students Accepting Wait List Position    295 non-null object
Students Admitted From Wait List         324 non-null object
Average GPA                              1813 non-null object
3.75 and Above                           1807 non-null object
3.50 - 3.74                              1807 non-null object
3.25 - 3.49                              1799 non-null object
3.00 - 3.24                              1804 non-null object
2.50 - 2.99                              1774 non-null object
2.00 - 2.49                              1557 non-null object
SAT Math                                 1813 non-null object
Score of 700 - 800                       1813 non-null object
Score of 600 - 700                       1813 non-null object
Score of 500 - 600                       1813 non-null object
Score of 400 - 500                       1813 non-null object
Score of 300 - 400                       1813 non-null object
Score of 200 - 300                       1813 non-null object
SAT Critical Reading                     1813 non-null object
Score of 700 - 800*                      1813 non-null object
Score of 600 - 700*                      1813 non-null object
Score of 500 - 600*                      1813 non-null object
Score of 400 - 500*                      1813 non-null object
Score of 300 - 400*                      1813 non-null object
Score of 200 - 300*                      1813 non-null object
SAT Writing                              1813 non-null object
Score of 700 - 800**                     1813 non-null object
Score of 600 - 700**                     1813 non-null object
Score of 500 - 600**                     1813 non-null object
Score of 400 - 500**                     1813 non-null object
Score of 300 - 400**                     1813 non-null object
Score of 200 - 300**                     1813 non-null object
ACT Composite                            1813 non-null object
Score of 30 - 36                         1813 non-null object
Score of 24 - 29                         1813 non-null object
Score of 18 - 23                         1813 non-null object
Score of 12 - 17                         1813 non-null object
Score of 6 - 11                          1813 non-null object
Score of 5 or Below                      1813 non-null object
High School Class Rank                   1107 non-null object
National Merit Scholar                   1813 non-null object
Valedictorian                            1813 non-null object
Class President                          1813 non-null object
Student Government Officer               1813 non-null object
dtypes: object(53)
memory usage: 764.9+ KB

I scraped 53 seperate columns for 1813 schools. According to the US Dept. of Education’s National Center for Education Statistics, there were 2870 4-year colleges as of 2011. It would be very helpful to know what criteria the people behind CollegeData used to add schools to their site. Are CollegeData’s schools only the largest? Only those who filled out a survey? Only those who paid?

I’m not sure how accurately this sample represents the overall population of 4-year colleges – and how accurately insights drawn from analysis of it can be generalized to the greater population of all schools – and more investigation into CollegeData’s data acquisition process is warranted. Be warned!

import pandas as pd
college = pd.DataFrame.from_dict(scraped_pages, orient='index') # convert scraped pages to a pandas DataFrame
college.to_csv('collegedata_scraped_backup.csv')   # backup

With that said, I will move on to cleaning the data. Perhaps after some exploration we will have a better idea the quality of this sample and the quality of inferences made about the yield rate disparity.