Yield Rate Gender Gap, Part 4: Iterative imputation

This is Part 4 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

Let’s load our scraped, cleaned, and combined CollegeData.com and US News dataset and remind ourselves of what it contains:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style of plots.
sns.set_style("whitegrid")

# Load CSV and set categorical variables.
df = pd.read_csv('data/college.csv', index_col=0, header=[0,1])

# Create new rank type, "Top 50 Universities".
cat = ('Category','Rank Type')
top_50 = (df[cat] == 'National Universities') & (df['Rank','Rank Number']<=50)
df.loc[top_50, cat] = 'Top 50 Universities'
cats = ['Top 50 Universities','National Universities',
        'National Liberal Arts Colleges','Regional Universities',
        'Regional Colleges','Unranked']
df[cat] = df[cat].astype('category')
df[cat].cat.reorder_categories(cats, inplace=True)

# Plot the amount of missingness per column.
f, ax = plt.subplots(1, 1, figsize=(6,9))
df.isna().sum().plot(kind='barh', ax=ax)
plt.title('Missing values, by column')
plt.xlabel('Missing count')
plt.xlim(0, df.shape[0])
plt.show()

# Show the shape.
df.shape

(2030, 42)

We have 42 different columns of values for 2030 schools, but there are a good amount of missing values in those columns. Before we deal with missing values, let's first check to make sure all of the existing values make sense.


Finding and removing bad values

Values that exceed limits

There could be nonsensical values in the dataframe from improper scraping back in Part 1 or even errors from the websites themselves that were never caught. We'll set lower and upper value limits for each column and check how many, if any, values exceed those limits for each column.

# Define column value limits.
limits = {
    'ACT':(1, 36),
    'Age':(18, 65),
    'Apps':(0, 500000),
    'Demographics':(0,1),
    'Enrolled':(0, 500000),
    'GPA':(1,5),
    'Grads':(0, 500000),
    'HS Rank':(0,1),
    'Intl':(0,1),
    'Offer Rate':(0,1),
    'Offers':(0, 500000),
    'SAT':(200, 800),
    'Undergrads':(0,500000),
    'Yield Rate':(0,1)
}

badvals = pd.DataFrame(False, index=df.index, columns=df.columns)
    
# Check if col vals exceed limits.
for col in limits.keys():
    low, high = limits[col]
    badvals[col] = (df[col] < low) | (df[col] > high)

# Plot bad values count, by column.
badvals.loc[badvals.any(axis=1), badvals.any(axis=0)].sum().plot(kind='barh')
plt.title('Bad values due to exceeding limits, by column')
plt.xlabel('Bad value count')
plt.show()

# Show bad values.
df.loc[badvals.any(axis=1), badvals.any(axis=0)]

SAT ACT GPA
Math Mean Reading Mean Mean Mean
Name
Dickinson State University 510.0 420.0 20.0 3120.00
Guilford College 513.0 22.0 488.0 3.18
Kuyper College 410.0 NaN 447.0 3.30
Lake Forest College 520.0 52.0 26.0 3.62
Louisiana State University Alexandria 17.0 429.0 20.0 3.16
Marian University (WI) 16.0 17.0 20.0 3.00
Oakland University 19.0 NaN 23.0 3.40
Ohio Valley University 44.0 410.0 18.0 3.17
Texas Woman's University 410.0 940.0 17.0 3.10
University of Pennsylvania 140.0 723.0 33.0 3.91

It's looks like some ACT and SAT scores might have been swapped, an a decimal place may be off for one of the GPAs. Instead of guessing the correct values, we'll just remove the bad values by setting them to NaN.

df = df.mask(badvals)

Improper admissions values

Our dataframe has 'All', 'Men', and 'Women' values for 'Apps', 'Offers', 'Enrolled', and 'Undergrads'. If the number of Offers exceeds the number of Apps, or the number of Enrolled exeeds the Offers, or the number of Enrolled exceeds the total Undergrads, then we have bad values somewhere.

badvals = ((df['Apps'] < df['Offers'])
           | (df['Offers'] < df['Enrolled'])
           | (df['Enrolled'] > df['Undergrads']))

# Plot count of schools with bad admissions values.
badvals.sum().plot(kind='barh')
plt.title('Count of schools with bad admissions values,\nby gender of values')
plt.xlabel('School count')
plt.show()

Let's see what's wrong with the one school with bad 'Women' admissions values:

# Look at the values at schools with bad Women admissions values.
cols = ['Apps','Offers','Enrolled','Undergrads']
df.loc[badvals['Women'], cols].dropna(axis=1)
Apps Undergrads
All Men Women All Men Women
Name
College of the Ozarks 2896.0 1090.0 1806.0 1512.0 684.0 828.0

It looks like this school is missing 'All' and 'Men' values for 'Offers' and 'Enrolled'. This itself is not a problem, but the fact that the number of Enrolled Women exceeds Offers Women is a problem. I'm not sure which value is bad, so we'll remove both of them.

Now we'll look at the erroneous 'All' values:

# Remove the Women values for Enrolled and Offers.
df.loc[badvals['Women'], [('Enrolled','Women'),('Offers','Women')]] = np.nan

# Look at the values at schools with bad All admissions values.
df.loc[badvals['All'], cols].dropna(axis=1)
Apps Enrolled Undergrads
All Men Women All All
Name
Blessing-Reiman College of Nursing 546.0 30.0 516.0 362.0 233.0
Medgar Evers College (City University of New York) 10105.0 4005.0 6100.0 9864.0 6819.0
Northwestern Polytechnic University 249.0 150.0 99.0 248.0 164.0
Paine College 3964.0 1396.0 2568.0 977.0 502.0
University of the Ozarks 744.0 311.0 433.0 720.0 651.0

These schools also only have gender breakdown for Apps, and they all have no information at all regarding Offers. Again, this is not an error, but all of these schools do have a common error - the Enrolled is somehow higher than the Undergrads. It's difficult to determine which of these two columns is the 'bad' one, so we'll just remove both values for these schools:

df.loc[badvals_df['All'], [('Enrolled','All'),('Undergrads','All')]] = np.nan

Improper gender sums

Now let's check for any schools where the Apps, Offers, Enrolled, and Undergrads columns, when Men/Women gender breakdown is available, erroneously do not add up to the total All amount:

def check_bad_admissions_sums(data):
    admissions_cols = ['Apps','Offers','Enrolled','Undergrads']
    badvals = pd.DataFrame(False, index=data.index, columns=admissions_cols)
    
    for col in admissions_cols:
        has_all_val = data[col]['All'].notna()
        has_gender_val = data[col][['Men','Women']].notna().any(axis=1)
        gender_val_sum = data[col][['Men','Women']].sum(axis=1).round()
        all_val = data[col,'All'].round()
        
        badvals[col] = (has_all_val 
                        & has_gender_val 
                        & (gender_val_sum != all_val))
    return badvals
        

badvals = check_bad_admissions_sums(df)

# Plot count of schools with bad sums of gender values.
badvals.sum().plot(kind='barh')
plt.title('Count of bad admissions column sums,\nby column')
plt.xlabel('Bad sum count')
plt.show()

We'll 'trust' the All values for these schools and remove both of their Men/Women values:

for col in badvals.columns:
    df.loc[badvals[col], [(col,'Men'),(col,'Women')]] = np.nan

Improper demographics sums

Now let's check for any schools with domestic undergraduate demographics data summing to something far from 1.00:

# Find demographic sums not near 1.00.
has_some_data = df['Demographics'].notna().any(axis=1)
demo_sum = df['Demographics'].sum(axis=1)

bad_demos = has_some_data & ((demo_sum < 0.95) | (demo_sum > 1.05))

# Plot histogram of anomalous demographic sums.
hist_bins = np.concatenate((
    np.linspace(demo_sum.min(), 0.95, 5), 
    np.linspace(1.05, demo_sum.max(), 5)))

demo_sum.loc[bad_demos].plot.hist(bins=hist_bins)

plt.axvline(0.95, color='k', linestyle='dashed')
plt.axvline(1.05, color='k', linestyle='dashed')

plt.title('Distribution undergraduate domestic demographics sum,\n'
          'for schools with sum not <0.95 or >1.05')
plt.xlabel('Sum of undergrad domestic demographics')
plt.show()

I'm not going to be able to figure out what demographics values are correct or erroneous, so I'll just drop all demographics values for these schools:

df.loc[bad_demos, 'Demographics'] = np.nan

Dropping outlier schools

Now that we've dealt with erroneous values, we're still not ready to impute values to replace our missing data. During our imputation procedure, we'll rely on the values of other schools with the relevant data to provide an estimate for our missing data. This can become inaccurate if the other schools are so different from each other that they can be thought of as a seperate sub population.

Unusual gender ratios

Since our end goal is to analyze the gender gap in the yield rate, we're not going to be terribly interested in schools that are not coeducational. Fortunately, single gender schools are already identified in our dataset. Unfortunately, there are many schools that, while nominally coed, are over 85% one gender. These schools can be imagined as members of a subpopulation different from those we are interested in analyzing.

We can identify any school with more than 85% of the student body as one gender and classify them as "Mostly Men" or "Mostly Women", in addition to our existing "Men Only" and "Women Only" categories:

coed_cat = ('Category','Coed')
rank_cat = ('Category','Rank Type')

# Set limits for unusual gender ratios.
LOW = 0.15
HIGH = 0.85

# Mark nominally 'coed' schools with unusual ratio as 'mostly' men or women.
pct_women = df['Undergrads','Women'] / df['Undergrads','All']
mostly_women = ((pct_women > HIGH) & (df[coed_cat] != 'Women Only'))
mostly_men = ((pct_women < LOW) & (df[coed_cat] != 'Men Only'))  
df.loc[mostly_women, coed_cat] = 'Women Mostly'
df.loc[mostly_men, coed_cat] = 'Men Mostly'

# Make coed status a categorical variable.
df[coed_cat] = df[coed_cat].astype('category')

# Get a count of non-coed schools by rank type.
non_coed_schools = df[coed_cat] != 'Yes'
non_coed_count = df.loc[non_coed_schools].groupby(rank_cat).size()

# Get a count of all schools by rank type.
all_count = df.groupby(rank_cat).size()

# Plot non-coed fraction of schools by US News rank type.
(non_coed_count / all_count).plot(kind='barh')
plt.xlim(0,1)
plt.title('Schools with heavy gender imbalance,\nby US News rank type')
plt.xlabel('Fraction of schools')
plt.ylabel('')
plt.show()

It seems none of our Top 50 Universities or National Universities contain any of these schools, though around 15% of the unranked schools are pretty much single gendered, officially or not.


Unusual freshmen fraction of undergraduates

The Freshmen Ratio - the ratio of 'Enrolled' to 'Undergrads' - can be thought of as the incoming freshmen percentage of the total undergraduate population. Ideally, this would be 0.25 if every enrolled student remained enrolled for four years and then graduated with no increase or decrease in incoming class size from year to year. In practice, this can vary due to a rapid increase or decrease in enrollment as well as large student dropout or incoming transfers.

We're going to arbitrarily mark schools as 'unusual' that have a Freshmen Ratio lower than 0.10 or higher than 0.50. If less than 0.1 of undergraduates are freshmen, enrollment could be plummeting and the school is dying or students are transferring in at a tremendous rate. If the ratio is above 0.5, then students enrollment might be skyrocketing or student are transferring out en masse. Either way, these schools don't seem likely to have the normal values we would want to use for our imputation procedure.

# Compute new freshmen ratio column.
for gender in ['All','Men','Women']:
    df['Freshmen Ratio',gender] = \
                            df['Enrolled',gender] / df['Undergrads',gender]

# Create new category.
df['Category','Unusual Freshmen Ratio'] = False

# Set limits for unusual freshmen ratios.
LOW = 0.15
HIGH = 0.50
df['Category','Unusual Freshmen Ratio'] = \
      (df['Freshmen Ratio','All'] < LOW) | (df['Freshmen Ratio','All'] > HIGH)
    
# Get a count of schools with unusual freshmen ratio by US News rank type.
unusual_ratio = df['Category','Unusual Freshmen Ratio'] == True
unusual_ratio_count = df.loc[unusual_ratio].groupby(rank_cat).size()

# Plot unusual freshmen ratio fraction of schools by US News rank type.
(unusual_ratio_count / all_count).plot(kind='barh')
plt.title('Schools with unusual freshman ratio by US News rank type,\n'
          '(as defined by freshman <15% or >50% of total undergards')
plt.xlim(0,1)
plt.xlabel('Fraction of schools')
plt.ylabel('')
plt.show()

Around 5% of schools were identified as having 'unusual' Freshmen Ratios.


Missing all admissions data

Finally, let's see how many schools are completely missing all of the admissions data - the columns Apps, Offers, Enrolled, and Undergrads:

# Assess missingness of admissions columns.
cols = ['Apps','Offers','Enrolled','Undergrads']
missing_all = df[cols].isna().all(axis=1)
missing_some = df[cols].isna().any(axis=1) & ~missing_all
complete_case = df[cols].notna().all(axis=1)

# Add new category for missing status.
col = ('Category','Admissions Missing')
df.loc[missing_all, col] = 'Missing All'
df.loc[missing_some, col] = 'Missing Some'
df.loc[complete_case, col] = 'Complete Case'
df[col] = df[col].astype('category')

# Get a count of schools missing all admissions data by US News rank type.
missing_all_count = df.loc[missing_all].groupby(rank_cat).size()

# Plot unusual freshmen ratio fraction of schools by US News rank type.
(missing_all_count / all_count).plot(kind='barh')
plt.title('Schools missing all admissions data,\nby US News rank type')
plt.xlim(0,1)
plt.xlabel('Fraction of schools')
plt.ylabel('')
plt.show()

About 15% of unranked schools have no admissions data whatsoever. There's very little accurate imputation we could make at these schools.


Outlier summary

When imputing missing values, we will want to only consider schools that are coed, schools that don't have an unusual Freshmen Ratio, and schools that aren't missing all of their admissions data. Counting up these schools by US News rank type:

# Keep only schools that are coed, not heavily graduate,
#   not unusual freshmen ratio, and not missing all admissions values.
to_drop = non_coed_schools | unusual_ratio | missing_all

# Get count of schools in reduced dataframe by US News rank type.
drop_count = df.loc[to_drop].groupby(rank_cat).size()

# Plot fraction of schools dropped by US News rank type.
(drop_count/all_count).plot(kind='barh')
plt.title('Fraction of schools dropped,\nby US News rank type')
plt.xlabel('Fraction of schools')
plt.ylabel('')
plt.xlim(0,1)
plt.show()

# Show shape of reduced dataframe
print("Original schools: {}".format(len(df)))
print("Schools to drop: {}".format(len(df.loc[to_drop])))
print("Total fraction to be dropped: {:.2f}".format(len(df.loc[to_drop])/len(df)))

Original schools: 2030
Schools to drop: 354
Total fraction to be dropped: 0.17

We're going to cut 17% of schools from the imputation procedure.


Imputation

Reducing dependent columns

Before we get started, many of our variables depend on other variables. For example, the three columns 'Undergrads','All', 'Undergrads','Men', and 'Undergrads','Women' - are not independent. They can be reduced to two columns, a single 'Undergrads' column holding the All values, and a 'Undergrads % Women' column. By multiplying these two independent columns together, you can recover the original 'Undergrads','Women' column, and by subtracting the Women values from the All value, the 'Undergrads','Men' can be recovered.

With the end goal of removing all dependent columns in mind, as well as dropping the schools from the prior section's analysis, we create a new reduced dataframe:

# Create new blank dataframe with the same index as the original.
reduced_df = pd.DataFrame(index=df.index)

# Copy the total undergrad count as it is.
reduced_df['Undergrads'] = df['Undergrads','All']

# Compute rates connecting Undergrads, Enrolled, Offers, and Apps.
reduced_df['Freshmen Ratio'] = df['Enrolled','All'] / df['Undergrads','All']
reduced_df['Yield Rate'] = df['Enrolled','All'] / df['Offers','All']
reduced_df['Offer Rate'] = df['Offers','All'] / df['Apps','All']

# Compute rates connecting All, Men, and Women values of the prior columns.
reduced_df['Undergrads % Women'] = \
                             df['Undergrads','Women'] / df['Undergrads','All']
reduced_df['Enrolled % Women'] = df['Enrolled','Women'] / df['Enrolled','All']
reduced_df['Offers % Women'] = df['Offers','Women'] / df['Offers','All']
reduced_df['Apps % Women'] = df['Apps','Women'] / df['Apps','All']

# Copy the Demographics values as they are.
for col in df['Demographics']:
    reduced_df[col] = df['Demographics',col]

# Copy the SAT values as they are.
for col in df['SAT']:
    reduced_df['SAT '+col] = df['SAT',col]

# Copy the HS Rank values as they are.
for col in df['HS Rank']:
    reduced_df['HS Rank '+col] = df['HS Rank',col]
    
# Copy the ACT, GPA, Intl, Age, and Rank values as they are.
reduced_df['ACT'] = df['ACT','Mean']
reduced_df['GPA'] = df['GPA','Mean']
reduced_df['Intl'] = df['Intl','Intl']
reduced_df['Age'] = df['Age','Mean']
reduced_df['Rank'] = df['Rank','Rank Number']

# Copy the Grad % as it is.
reduced_df['Grads %'] = df['Grads','All'] / (df['Undergrads','All']
                                             + df['Grads','All'])
reduced_df = reduced_df.loc[~to_drop]

reduced_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1676 entries, Abilene Christian University to Youngstown State University
Data columns (total 27 columns):
Undergrads             1668 non-null float64
Freshmen Ratio         1222 non-null float64
Yield Rate             1196 non-null float64
Offer Rate             1195 non-null float64
Undergrads % Women     1584 non-null float64
Enrolled % Women       1177 non-null float64
Offers % Women         1177 non-null float64
Apps % Women           1316 non-null float64
American Indian        1424 non-null float64
Asian                  1427 non-null float64
Black                  1430 non-null float64
Hispanic               1439 non-null float64
Multi                  1406 non-null float64
Native Hawaiian        1406 non-null float64
White                  1432 non-null float64
Unknown                1417 non-null float64
SAT Math Mean          1201 non-null float64
SAT Reading Mean       1191 non-null float64
SAT Writing Mean       777 non-null float64
HS Rank Top 10th       1016 non-null float64
HS Rank Bottom 50th    984 non-null float64
ACT                    1283 non-null float64
GPA                    1258 non-null float64
Intl                   1108 non-null float64
Age                    1499 non-null float64
Rank                   1248 non-null float64
Grads %                1254 non-null float64
dtypes: float64(27)
memory usage: 366.6+ KB

We've reduce our original 2030 schools with 42 columns to 1676 schools with 27 independent numeric columns.


Simple mean imputation

Now there are still missing values in this reduced dataframe that need to be imputed. Let's look at the relationship between Yield Rate and GPA with a scatter plot. We'll first plot the original data, and next to it we'll plot the data with the column means imputed to replace the missing values:

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))

# Plot left figure - original vals only.
reduced_df.plot.scatter('GPA', 'SAT Math Mean', label='Original vals', ax=ax1)

# Plot right figure - original vals with mean imputed vals.
reduced_df.plot.scatter('GPA', 'SAT Math Mean', label='Original vals', ax=ax2)
reduced_df.fillna(reduced_df.mean()).where(reduced_df.isna())\
                            .plot.scatter('GPA', 'SAT Math Mean', ax=ax2, 
                                          color='red', label='Imputed vals')
plt.show()

# Descriptive statistics without and with mean imputation.
reduced_df[['GPA','SAT Math Mean']].describe().join(\
    reduced_df.fillna(reduced_df.mean())[['GPA','SAT Math Mean']]\
                                .describe(), rsuffix=' with imputed').round(2)

GPA SAT Math Mean GPA with imputed SAT Math Mean with imputed
count 1258.00 1201.00 1676.00 1676.00
mean 3.35 521.19 3.35 521.19
std 0.32 78.06 0.28 66.07
min 2.00 248.00 2.00 248.00
25% 3.18 470.00 3.24 487.00
50% 3.37 511.00 3.35 521.19
75% 3.56 562.00 3.50 542.00
max 4.66 772.00 4.66 772.00

There are hundreds of missing values for each of these two variables. Notice that since the mean is used to fill the missing value, the hundreds of missing values all have a GPA value of 3.35 and SAT Math Mean of 521.16, and these hundreds of points all map on to the same location on the right plot - the red dot in the middle.

While simple, this imputation method artificially lowers the standard deviation and tightens the distribution of column values, which could wash out or overemphasize possible correlations these columsn could have with with other columns. We can do better.


Iterative imputation

An improved method to deal with missing values called iterative imputation involves holding one target column constant at a time while imputing the column means to fill the missing values in the other predictor columns. These imputed predictor columns are then used to run a multiple regression to impute values for the missing data in the target column.

We then repeat the process using the second column as the target, removing the previously imputed values, and using the remaining columns (with their imputed values) as predictors in another multiple regression.

We run this procedure for every column in the dataframe - this is one cycle. We will run 10 cycles, then check how it turned out for the GPA and SAT Math Mean scatterplot:

from sklearn import linear_model

# Mark the originally missing values.
missing_df = reduced_df.isna()

# Fill missing values with column means.
reduced_df = reduced_df.fillna(reduced_df.mean())

# Set new col limits to our current col min and max
reduced_limits = {}
for col in reduced_df.columns:
    reduced_limits[col] = (reduced_df[col].min(), reduced_df[col].max())

# This is the number of imputation cycles through the columns.
CYCLES = 20
for i in range(0, CYCLES):
    for col in reduced_df.columns:

        # The target column's original missingness is restored.
        Y = reduced_df[col].mask(missing_df[col])

        # The predictor columns are all cols except the target.
        X = reduced_df.drop(columns=col)

        # Fit lin. reg. model using non-missing Y vals and associated X vals.
        myRegressor = linear_model.LinearRegression()
        myRegressor.fit(X[Y.notna()], Y[Y.notna()])
        
        # Get errors between predicted Y and actual Y for non-missing Y vals.
        err = myRegressor.predict(X[Y.notna()]) - Y[Y.notna()]

        # Randomly draw errors for all the Y values with missing vals.
        random_err = np.random.choice(err, Y.isna().sum())

        # Impute missing Y vals with predicted Y plus randomly drawn error.
        if Y.isna().sum() > 0:
            Y.loc[Y.isna()] = myRegressor.predict(X[Y.isna()]) + random_err
            
            # Clip imputed Y value if it exceeds pre-defined col limits.
            if col in reduced_limits.keys():
                low, high = reduced_limits[col]
                Y.clip(low, high, inplace=True)
        
        # Overwrite target col in df with new col with missing vals imputed.
        reduced_df.loc[:,col] = Y
        

# Check imputation:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))

# Plot left figure - original vals only.
col1 = 'GPA'
col2 = 'SAT Math Mean'
label='Original vals'
reduced_df.mask(missing_df).plot.scatter(col1, col2, label='Original vals', ax=ax1)
ax1.set_title('SAT Math Mean vs. GPA')

# Plot right figure - original vals with mean imputed vals.
reduced_df.mask(missing_df).plot.scatter('GPA', 'SAT Math Mean', 
                                         label='Original vals', ax=ax2)
reduced_df.where(missing_df).plot.scatter('GPA', 'SAT Math Mean', color='red',
                                          label='Imputed vals', ax=ax2)
ax2.set_title('SAT Math Mean vs. GPA\nwith imputed values')
plt.show()

# Descriptive statistics without and with mean imputation.
reduced_df.mask(missing_df)[['GPA','SAT Math Mean']].describe().join(\
    reduced_df[['GPA','SAT Math Mean']].describe(), \
    rsuffix=' with imputed').round(2)

GPA SAT Math Mean GPA with imputed SAT Math Mean with imputed
count 1258.00 1201.00 1676.00 1676.00
mean 3.35 521.19 3.34 509.31
std 0.32 78.06 0.34 76.75
min 2.00 248.00 2.00 248.00
25% 3.18 470.00 3.14 459.94
50% 3.37 511.00 3.35 503.00
75% 3.56 562.00 3.56 551.00
max 4.66 772.00 4.66 772.00

The iterative imputed values have a more natural distribution than those from the simple mean imputation, but their distributions are not identical to the distributions of the original values. For example, while the mean GPA remained roughly unchanged after imputation, the SAT Math Mean decreased (perhaps significantly) with imputation. This means imputed SAT Math values were on average lower than the original SAT Math values.

The imputation procedure has suggested that schools with missing SAT scores are more likely have reported lower SAT scores on average had they reported them. What particular factors in the iterative imputation led to these results are not immediately obvious without futher exploration of the data set, which we will another day in the next part.


Expanding the imputed reduced dataframe

To facilitate easier analysis, we will expand our reduced data frame to include the same columns as our original dataframe:

# Un-reduce the imputed dataframe back to a new dataframe with original cols.
imputed_df = pd.DataFrame(columns=df.columns)

# Copy the total Undergrads as it is.
imputed_df['Undergrads','All'] = reduced_df['Undergrads']

# Reconstruct the total Enrolled/Offers/Apps from the rates.
imputed_df['Enrolled','All'] = \
            imputed_df['Undergrads','All'] * reduced_df['Freshmen Ratio']
imputed_df['Offers','All'] = \
            imputed_df['Enrolled','All'] / reduced_df['Yield Rate']
imputed_df['Apps','All'] = \
            imputed_df['Offers','All'] / reduced_df['Offer Rate']

# Reconstruct the Women columns using the Women percentages.
imputed_df['Undergrads','Women'] = \
            imputed_df['Undergrads','All'] * reduced_df['Undergrads % Women']
imputed_df['Enrolled','Women'] = \
            imputed_df['Enrolled','All'] * reduced_df['Enrolled % Women']
imputed_df['Offers','Women'] = \
            imputed_df['Offers','All'] * reduced_df['Offers % Women']
imputed_df['Apps','Women'] = \
            imputed_df['Apps','All'] * reduced_df['Apps % Women']

# Subtract Women from All to get the Men columns.
for col in ['Undergrads','Enrolled','Offers','Apps']:
    imputed_df[col,'Men'] = imputed_df[col,'All'] - imputed_df[col,'Women']

# Compute the Offer Rates and Yield Rates, by gender.
imputed_df['Offer Rate'] = imputed_df['Offers'] / imputed_df['Apps']
imputed_df['Yield Rate'] = imputed_df['Enrolled'] / imputed_df['Offers']
imputed_df['Freshmen Ratio'] =\
                            imputed_df['Enrolled'] / imputed_df['Undergrads']

# Compute the total Grads.
imputed_df['Grads','All'] = \
            imputed_df['Undergrads','All'] * reduced_df['Grads %'] \
            / (1 - reduced_df['Grads %'])
imputed_df['Grads','Percent'] = reduced_df['Grads %']

# Copy the Demographics values as they are.
for col in imputed_df['Demographics']:
    imputed_df['Demographics',col] = reduced_df[col]

# Copy the SAT values as they are.
for col in imputed_df['SAT']:
    imputed_df['SAT',col] = reduced_df['SAT '+col]

# Copy the HS Rank values as they are.
for col in imputed_df['HS Rank']:
    imputed_df['HS Rank',col] = reduced_df['HS Rank '+col]
    
# Copy the ACT, GPA, Intl, Age, and Rank values as they are.
imputed_df['ACT','Mean'] = reduced_df['ACT']
imputed_df['GPA','Mean'] = reduced_df['GPA']
imputed_df['Intl','Intl'] = reduced_df['Intl']
imputed_df['Age','Mean'] = reduced_df['Age']
imputed_df['Rank','Rank Number'] = reduced_df['Rank']

# Remove unused columns from imputed dataframe.
imputed_df.columns = imputed_df.columns.remove_unused_levels()

# If imputed yield rate > 1 for a gender, set its offers = enrollments.
for gender in ['Men','Women']:
    bad_yld_rate = imputed_df['Yield Rate',gender] > 1
    imputed_df.loc[bad_yld_rate, ('Offers',gender)] = \
                            imputed_df.loc[bad_yld_rate, ('Enrolled',gender)]
    imputed_df['Offers','All'] = \
                    imputed_df['Offers','Men'] + imputed_df['Offers','Women']
    imputed_df['Yield Rate'] = imputed_df['Enrolled'] / imputed_df['Offers']
    imputed_df['Offer Rate'] = imputed_df['Offers'] / imputed_df['Apps']
    
# If imputed offer rate > 1 for a gender, set its apps = offers.
for gender in ['Men','Women']:
    bad_off_rate = imputed_df['Offer Rate',gender] > 1
    imputed_df.loc[bad_off_rate, ('Apps',gender)] \
                            = imputed_df.loc[bad_off_rate, ('Offers',gender)]
    imputed_df['Apps','All'] = \
                        imputed_df['Apps','Men'] + imputed_df['Apps','Women']
    imputed_df['Offer Rate'] = imputed_df['Offers'] / imputed_df['Apps']
    

# Fill remaining values in the imputed dataframe with the original values.
imputed_df = imputed_df.fillna(df)

Yield rate gap

Distribution of yield rates by case

Not counting the 105 schools that are missing all admissions values, we have four cases:
- The original population of 1925 schools, including 1238 with originally complete values.
- The dropped subpopulation of 249 schools, including 84 with originally complete values.
- The kept subpopulation of 1676 schools, with 1154 with originally complete values.
- The imputed kept subpopulation of 1676 schools, all with complete (imputed) admissions values.

We will look at the emperical cumulative distribution functions (ECDF) for the 'Yield Rate (men)' and 'Yield Rate (men)' for each of these four cases. We will also create a new column 'Yield Rate (diff)' that shows the difference between these two and check its ECDFs for the four cases.

cases = ['All','Dropped','Kept','Imputed']

def get_rate_diffs(col):
    # Get the all, men, women, diff rate columns for each of the four cases.
    col1 = (col,'Men')
    m_cols = [df[col1], df.loc[to_drop,col1], 
              df.loc[~to_drop,col1], imputed_df[col1]]
    col2 = (col,'Women')
    w_cols = [df[col2], df.loc[to_drop,col2], 
              df.loc[~to_drop,col2], imputed_df[col2]]
    gap_col = (col,'Diff')
    df[gap_col] = df[col,'Men'] - df[col,'Women']
    imputed_df[gap_col] = imputed_df[col,'Men'] - imputed_df[col,'Women']
    d_cols = [df[gap_col], df.loc[to_drop,gap_col], 
              df.loc[~to_drop,gap_col], imputed_df[gap_col]]
    
    # Create all, men, women, diff rate dfs, each w/rates for each case.
    idx = pd.MultiIndex.from_product([[col+' (men)'],cases])
    men = pd.DataFrame(m_cols, index=idx).T
    idx = pd.MultiIndex.from_product([[col+' (women)'],cases])
    women = pd.DataFrame(w_cols, index=idx).T
    idx = pd.MultiIndex.from_product([[col+' (diff)'],cases])
    diff = pd.DataFrame(d_cols, index=idx).T

    # Create yield rate table 
    rates = diff.join([men, women])
    
    return rates


def plot_rate_ECDFs(rates, col):
    
    # Plot ECDFs of men and women yield rates, by case.
    f, axes = plt.subplots(1, 4, figsize=(10, 4), sharey=True)
    ax_dict = dict(zip(labels, axes))
    for label in labels:
        ax = ax_dict[label]
        ECDF(rates[col+' (men)'][label].rename('Men'), ax=ax)
        ECDF(rates[col+' (women)'][label].rename('Women'), ax=ax)
        ax.set_title(col+',\n{}'.format(label))
        ax.set_xlabel(col)
        ax.set_ylabel('')
        ax.set_xlim((0,1))
    plt.tight_layout()
    plt.show()
    
    
    # Plot ECDFs for Yield Rate (diff), by case.
    f, axes = plt.subplots(1, 4, figsize=(10, 4), sharey=True)
    ax_dict = dict(zip(labels, axes))
    for label in labels:
        ax = ax_dict[label]
        ECDF(rates[col+' (diff)'][label].rename('Diff'), ax=ax)
        ax.set_title(col+' Difference,\n{}'.format(label))
        ax.set_xlabel(col)
        ax.axvline(0, color='k')
        ax.set_ylabel('')
        ax.set_xlim((-0.5,0.5))
    plt.tight_layout()
    plt.show()
    
    
# Function to plot ECDF for a series on current axes named 'ax'.
def ECDF(series, ax):
    X = series.dropna().sort_values()
    Y = np.linspace(0, 1, series.dropna().shape[0])
    ax.scatter(X, Y, label=series.name)
    ax.set_title('ECDF of ' + series.name)
    ax.set_xlabel(series.name)
    ax.set_ylabel('ECDF')
    ax.legend()

# Get yield rate differences.
yld_rates = get_rate_diffs('Yield Rate')

# Plot ECDFs of yield rate differences.
plot_rate_ECDFs(yld_rates, 'Yield Rate')

# Show table of summary statistics for men and women yield rates.
yld_rates.describe().round(2)

Yield Rate (diff) Yield Rate (men) Yield Rate (women)
All Dropped Kept Imputed All Dropped Kept Imputed All Dropped Kept Imputed
count 1264.00 87.00 1177.00 1676.00 1286.00 97.00 1189.00 1676.00 1328.00 120.00 1208.00 1676.00
mean 0.05 0.05 0.05 0.05 0.42 0.49 0.42 0.45 0.38 0.41 0.38 0.40
std 0.08 0.11 0.08 0.09 0.20 0.23 0.20 0.22 0.20 0.19 0.20 0.21
min -0.41 -0.34 -0.41 -0.50 0.08 0.08 0.08 0.07 0.05 0.05 0.09 0.06
25% 0.01 0.00 0.01 0.01 0.26 0.32 0.26 0.27 0.23 0.29 0.23 0.24
50% 0.05 0.05 0.05 0.05 0.39 0.48 0.38 0.43 0.34 0.38 0.34 0.37
75% 0.09 0.09 0.09 0.09 0.54 0.62 0.54 0.59 0.49 0.51 0.49 0.53
max 0.64 0.50 0.64 0.64 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

It seems that there is a consistent 5% yield rate difference, no matter the case. This means that across all schools, the 'Yield Rate (men)' is 5% higher than the 'Yield Rate (women)', meaning male applicants are overall more likely to accept offer letters from a university than female applicants. But is this 5% yield rate gap significant?


Hypothesis testing the yield rate gender gap

Our null hypothesis $H_0$ is the claim that the population mean ${\mu}w$ of the 'Yield Rate (women)' $=X{w}$ is not any different than the population mean $\mu_{m}$ of 'Yield Rate (men)'$=X_{m}$, or in terms of their difference $d$:

$$H_0: {\mu}{m} - {\mu}{w} = {\mu}{d} = 0$$
$$H_a: {\mu}
{m} - {\mu}{w} = {\mu}{d} \neq 0$$

Since each observation of the 'Yield Rate (diff)'$=d_i=X_{m,i}-X_{w,i}$ requires a linked pair of values from 'Yield Rate (men)' and 'Yield Rate (women)', we should perform a two-tailed matched pair t-test. However, since our number of schools $n \gg 30$ for each of our four subpopulations, we could approximate each t-test using a z-test.

For the matched pair z-test to be valid, our $X_m$ and $X_w$ do not necessarily need to be normally distributed, but their differences $d_i = X_{m,i} - X_{w,i}$ should be approximately normal for all populations:

plt.title('Distribution of Yield Rate (diff),\nby case')
yld_rates['Yield Rate (diff)'].boxplot()
plt.ylabel('Yield Rate (diff)')
plt.axhline(0, color='r')
plt.show()

Since these differences look approximately normal, we can continue with our hypothesis testing.

from scipy.stats import ttest_rel

def test_diff(series1, series2, ALPHA=0.01):
    t, ttest_pval = ttest_rel(series1, series2, nan_policy='omit')
    reject_equal_means = ttest_pval < ALPHA
    return ttest_pval, reject_equal_means    

def diff_summary(rates, col):
    # Create new dataframe to hold summary of equal means hypothesis tests.
    summary = pd.DataFrame(columns=labels)

    # Get yield rate means and difference for each case.
    for case in cases:
        has_all = rates.xs(case, axis=1, level=1).notna().all(axis=1)
        col_m = (col+' (men)', case)
        col_w = (col+' (women)', case)
        col_d = (col+' (diff)', case)
        summary.loc['Count', case] = has_all.sum()
        summary.loc[col_m] = rates.loc[has_all, col_m].mean()
        summary.loc[col_w] = rates.loc[has_all, col_w].mean()
        summary.loc[col_d] = rates.loc[has_all, col_d].mean()
        
    
    # Test null hyp., mean yld. rate men = mean yld. rate women for each case.
    for label in labels:
        summary.loc['t-test pval',label], summary.loc['reject null',label] =\
           test_diff(rates[col+' (men)', label], rates[col+' (women)', label])
    
    
    return summary

# Show test result summary for each case.
diff_summary(yld_rates, 'Yield Rate')
All Dropped Kept Imputed
Count 1264 87 1177 1676
Yield Rate (men) 0.419256 0.471609 0.415387 0.454624
Yield Rate (women) 0.370245 0.41908 0.366636 0.4021
Yield Rate (diff) 0.0490111 0.0525287 0.0487511 0.0525241
t-test pval 1.83247e-86 1.72852e-05 1.15855e-83 5.50018e-101
reject null True True True True

Seems there is indeed a significant yield rate gender gap of about 5% seen across all of cases.


Offer rate gap and freshmen ratio gap

While the machinery is all built, we might as well look at the offer rate gaps:

# Get offer rate differences.
offer_rates = get_rate_diffs('Offer Rate')
diff_summary(offer_rates, 'Offer Rate')
All Dropped Kept Imputed
Count 1450 128 1322 1676
Offer Rate (men) 0.647648 0.668047 0.645673 0.657006
Offer Rate (women) 0.663097 0.681094 0.661354 0.672795
Offer Rate (diff) -0.0154483 -0.0130469 -0.0156808 -0.0157887
t-test pval 3.23483e-18 0.0763218 5.25908e-18 1.03297e-21
reject null True False True True

Though not significant in the lower (N=128) 'Dropped' case, all other cases show a significant -1.5% offer rate gap, this time in favor of women. Checking the freshmen ratio:

# Get freshmen ratio differences.
fresh_ratios = get_rate_diffs('Freshmen Ratio')
diff_summary(fresh_ratios, 'Freshmen Ratio')
All Dropped Kept Imputed
Count 1509 108 1401 1676
Freshmen Ratio (men) 0.278088 0.208837 0.283426 0.285906
Freshmen Ratio (women) 0.263552 0.197856 0.268616 0.265727
Freshmen Ratio (diff) 0.0145359 0.0109805 0.01481 0.0201795
t-test pval 1.00217e-30 0.0583493 8.7037e-31 7.94735e-63
reject null True False True True

Again, with the exception of the smaller (N=108) 'Dropped' case, all other cases show a significant 1-2% freshmen ratio gap, in favor of men.


Apps per enrollment

Before wrapping up, let's create one new set of columns - the number of apps per enrollment - and check if there is a significant gender gap with it:

# Create new columns in both dataframes.
for gender in ['All','Men','Women']:
    df['Apps per Enrolled',gender] = df['Apps',gender] / df['Enrolled',gender]
    imputed_df['Apps per Enrolled',gender] = \
                             imputed_df['Apps',gender] / df['Enrolled',gender]

# Get apps per enrollment differences.
apps_per_enrollment_rates = get_rate_diffs('Apps per Enrolled')
diff_summary(apps_per_enrollment_rates, 'Apps per Enrolled')
All Dropped Kept Imputed
Count 1240 86 1154 1402
Apps per Enrolled (men) 5.40016 4.86352 5.44016 5.62797
Apps per Enrolled (women) 6.00226 5.45748 6.04286 6.19887
Apps per Enrolled (diff) -0.602093 -0.593954 -0.602699 -0.570904
t-test pval 3.32147e-32 0.0230773 6.16879e-32 2.15517e-20
reject null True False True True

Again, ignoring the 'Dropped' case, there is a significant difference between the number of apps per enrollment for women, roughly 6, and the number of apps per enrollment for men, roughly 5.5.


Interpretation - what do these gaps mean?

Though there is no proof to support the assumption that men and women who apply to college are going to college at the same rates, if that assumption is true, then men are enrolling in college after putting in fewer applications than women. Even though women are putting in more applications than men, they are still, surprisingly, given more offer letters at a higher rate than men. With these increased offer letters, women then need to decline those offers from more schools than men, causing them to have a lower yield rate than men.

Let's save the original (fixed) dataframe as well as the new imputed dataframe to CSV:

df.to_csv('data/fixed_college.csv')
imputed_df.to_csv('data/imputed_college.csv')

The next post will contain further exploratory data analysis to find potential variations in these gaps and explore potential methods to predict them.

Yield Rate Gender Gap, Part 3: Adding Rankings

This is Part 3 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

Our previously scraped and cleaned CollegeData.com table contains a lot of admissions information for each school, but it doesn’t contain anything about rankings. One of the most influential college rankings – for better or worse – are those put out by US News and World Report.

US News keeps multiple ranking lists, tracking what they call ‘National Universities’, ‘National Liberal Arts Colleges’, ‘Regional Universities’ (which are split into four regions), and ‘Regional Colleges’ (also split into four regions).

When scrolling down the page for each of these rankings, more rows will appear:

Scraping US News with BeautifulSoup

After scrolling down until the schools are no longer ranked, I opened Safari’s developer console and saved the generated HTML in its final state as a local .html files. I looked in the .html files to find the <tr> tags that correspond to a single row of results, and copied its attribute data-view = "colleges-search-results-table-row" to help identify all the <tr> tags each containing a row of rankings data. I then used BeautifulSoup to scrape the page and return a pandas DataFrame.

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

filepath = 'usnews.html'
row_id_attr = {"data-view":"colleges-search-results-table-row"}
field_labels = ['Name','Location','Rank','Tuition','Undergrads']

scraped = []
with open(filepath,'r') as file:
    page = BeautifulSoup(file.read(), "lxml")
    for tag in page('tr', attrs = row_id_attr):
        values = list(tag.stripped_strings)
        if '(tie)' in values: values.remove('(tie)') # unwanted field
        if '1' in values: values.remove('1') # unwanted field
        scraped.append(dict(zip(field_labels, values)))

rankings = pd.DataFrame(scraped)

rankings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 5 columns):
Location      1400 non-null object
Name          1400 non-null object
Rank          1400 non-null object
Tuition       1400 non-null object
Undergrads    1399 non-null object
dtypes: object(5)
memory usage: 54.8+ KB

US News had rankings for 1400 schools. Each listing also contains the tuition and the total number of undergraduates, which are helpful pieces of data that we did not get from CollegeData.com (they were both available on another one of the five pages for each school that we didn’t scrape).

We will want to join this new 'rankings' DataFrame with our saved 'college' DataFrame we scraped from CollegeData.com. We will discover that when we try to join tables on the 'Name' that US News and CollegeData.com don’t use the exact same names for each school, making it tricky to merge the tables together.


Cleaning the rankings

First, let’s clean 'rankings' column by column like we did for 'college' back in Part 2.

We’ll check the .mode() to find potential null values and replace them with np.nan:

rankings.mode().loc[0]
Location                             Chicago, IL
Name                               Union College
Rank          #231-#300 in National Universities
Tuition                                      N/A
Undergrads                                   N/A
Name: 0, dtype: object
rankings = rankings.replace('N/A',np.nan)

Next, we’ll extract the numeric 'Rank' as well as save the 'Type' of rank, that is, whether the school is a ‘National University’ or a ‘Regional Colleges North’, etc. We will also drop any schools we accidentally scraped from the end of the list that did not contain a ranking.

rankings[['Rank']].head()
Rank
0 #1 in Regional Colleges Midwest
1 #1 in Regional Colleges West
2 #1 in Regional Colleges North
3 #1 in Regional Universities Midwest
4 #1 in Regional Universities South
regexp = '([A-Z].*)$'
rankings['Type'] = rankings['Rank'].str.extract(regexp)
rankings['Type'] = rankings['Type'].fillna('Unranked')

unranked = rankings['Type'].str.contains('Unranked')
rankings = rankings.drop(rankings[unranked].index)
regional_colleges = rankings['Type'].str.contains('Regional Colleges')
rankings.loc[regional_colleges,'Type'] = 'Regional Colleges'
regional_universities = rankings['Type'].str.contains('Regional Universities')
rankings.loc[regional_universities,'Type'] = 'Regional Universities'

rankings['Type'] = rankings['Type'].astype('category')

rankings['Rank'] = rankings['Rank'].str.extract('#(\d*)').astype('float')
rankings = rankings.drop(rankings[rankings['Rank'].isnull()].index)

rankings[['Rank','Type']].head()
Rank Type
0 1.0 Regional Colleges
1 1.0 Regional Colleges
2 1.0 Regional Colleges
3 1.0 Regional Universities
4 1.0 Regional Universities

Next, we will split the 'City' and 'State' from the 'Location' column before dropping it:

rankings[['Location']].head()
Location
0 Grand Rapids, MI
1 Helena, MT
2 New York, NY
3 Omaha, NE
4 Elon, NC
regexp = '(.*), ([A-Z]{2})'
rankings[['City','State']] = rankings['Location'].str.extract(regexp)
rankings['State'] = rankings['State'].astype('category')
rankings = rankings.drop(columns='Location')
rankings[['City','State']].head()
City State
0 Grand Rapids MI
1 Helena MT
2 New York NY
3 Omaha NE
4 Elon NC

Next, we’ll extract the 'Tuition' and, when available, the 'Tuition (in-state)':

rankings.loc[slice(20,24),['Tuition']] # picked a slice w/in-state tuition
Tuition
20 $31,960
21 $51,848
22 $17,968 (out-of-state), $6,808 (in-state)
23 $39,062
24 $47,165
newcols = ['Tuition','Tuition (in-state)']
rankings[newcols] = rankings['Tuition'].str.replace(',','') \
                                       .str.extractall('\$(\d*)') \
                                       .astype('float').unstack()[0]
        
rankings.loc[slice(20,24), newcols] 
Tuition Tuition (in-state)
20 31960.0 NaN
21 51848.0 NaN
22 17968.0 6808.0
23 39062.0 NaN
24 47165.0 NaN

Looking at our last column:

rankings[['Undergrads']].head()
Undergrads
0 3,807
1 1,373
2 876
3 4,203
4 6,008

Finally, we just need to remove the commas from the 'Undergrads' column and convert it to a float:

rankings['Undergrads'] = rankings['Undergrads'].str.replace(',','')
rankings['Undergrads'] = rankings['Undergrads'].astype('float')
rankings[['Undergrads']].head()
Undergrads
0 3807.0
1 1373.0
2 876.0
3 4203.0
4 6008.0

Like in Part 2, we’ll append the 'State' in parenthesis to any duplicated 'Name':

duplicated = rankings['Name'].duplicated(keep=False)
duplicated.sum()  # check num of duplicates
32
for i, row in rankings[rankings['Name'].duplicated(keep=False)].iterrows():
    rankings.loc[i,'Name'] += ' (' + row['State'] + ')'
    
rankings.loc[duplicated, 'Name'].duplicated(keep=False).sum() # check again
0

Finally, set the 'Name' to be the DataFrame index.

rankings = rankings.set_index('Name', verify_integrity=True).sort_index()
rankings.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1381 entries, Abilene Christian University to Youngstown State University
Data columns (total 7 columns):
Rank                  1381 non-null float64
Tuition               1373 non-null float64
Undergrads            1268 non-null float64
Type                  1381 non-null category
City                  1381 non-null object
State                 1381 non-null category
Tuition (in-state)    515 non-null float64
dtypes: category(2), float64(4), object(1)
memory usage: 70.5+ KB

Matching the names in both tables

We have our final cleaned 'rankings' dataset from US News. Unfortunately, because of slight differences in our index 'Name', it won’t be so straightforward to join 'rankings' with our previously scraped 'college' dataset from CollegeData.com.

# IMPORT AND CLEAN UP COLLEGE DATAFRAME (NEED TO UPDATE PART 2)

# Import
college = pd.read_csv('collegedata_cleaned_backup.csv')
college['State'] = college['State'].astype('category')

# Differentiate dulicate names and set name as index
college = college.rename(columns={'schoolname':'Name'})
duplicated = college['Name'].duplicated(keep=False)
for i, row in college[college['Name'].duplicated(keep=False)].iterrows():
    college.loc[i,'Name'] += ' (' + row['City'] + ')'
college = college.set_index('Name', verify_integrity=True).sort_index()
matched = rankings.index.isin(college.index)
matched.sum()
1093

1093 of the 1381 'rankings' names are listed verbatim in 'college'. We will create a new temporary dataframe 'cleaned_rankings' to hold them for now, and then drop them from 'rankings'.

cleaned_rankings = rankings[matched]
rankings = rankings.drop(rankings[matched].index)
len(rankings)
288

Using ‘fuzzywuzzy’ to match names

There is a high probability many of the remaining 288 unmatched names in 'rankings' are written slightly differently than their yet-to-be-matched counterparts in 'college'. Using fuzzywuzzy, which was put together by the people at SeatGeek and built on difflib, we can find a list of closest names in the 'college' index for each name in the 'rankings' table. We’ll then check that the city of the matched 'college' name lines up with the city in the 'rankings' table, and if so, output the name to a new column callled 'Close Name' that we’ll visually inspect:

from fuzzywuzzy import process

def get_closest_name(row):
    close_names = process.extract(row.name, college.index)
    for close_name in close_names:
        if row['City'] == college.loc[close_name[0],'City']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name, axis=1)
rankings['Close Name'].count()
208

Using 'fuzzywuzzy.process.extract()', we got a 'Close Name' for 208 of the remaining 288 names in 'rankings'. To be safe, I’ll visually look through these 208 name matches before renaming and joining. WARNING: Prepare for a wall of a table!

pd.set_option('display.max_rows', 208)
rankings[['Close Name']].dropna()
Close Name
Name
Aquinas College Aquinas College (MI)
Arizona State University–Tempe Arizona State University
Auburn University–Montgomery Auburn University at Montgomery
Binghamton University–SUNY Binghamton University
Brigham Young University–Idaho Brigham Young University – Idaho
Brigham Young University–Provo Brigham Young University
Bryn Athyn College of the New Church Bryn Athyn College
CUNY–Hunter College Hunter College
CUNY–John Jay College of Criminal Justice John Jay College of Criminal Justice
CUNY–New York City College of Technology New York City College of Technology
CUNY–York College York College (NY)
California Polytechnic State University–San Luis Obispo California Polytechnic State University San Lu…
California State Polytechnic University–Pomona California State Polytechnic University Pomona
California State University–Bakersfield California State University Bakersfield
California State University–Channel Islands California State University Channel Islands
California State University–Chico California State University Chico
California State University–Dominguez Hills California State University Dominguez Hills
California State University–East Bay California State University East Bay
California State University–Fresno California State University Fresno
California State University–Fullerton California State University Fullerton
California State University–Long Beach California State University Long Beach
California State University–Los Angeles California State University Los Angeles
California State University–Monterey Bay California State University Monterey Bay
California State University–Northridge California State University Northridge
California State University–Sacramento California State University Sacramento
California State University–San Bernardino California State University San Bernardino
California State University–San Marcos California State University San Marcos
California State University–Stanislaus California State University Stanislaus
Central Christian College Central Christian College of Kansas
College at Brockport–SUNY College at Brockport
College of New Jersey The College of New Jersey
College of Saint Rose College of St. Rose
College of St. Mary College of Saint Mary
Colorado State University–Pueblo Colorado State University – Pueblo
Columbia College Columbia College (MO)
Concordia College–Moorhead Concordia College
Concordia University (CA) Concordia University Irvine
Concordia University (NE) Concordia University Nebraska
Concordia University–St. Paul Concordia University St. Paul
Cooper Union Cooper Union for the Advancement of Science an…
Edinboro University of Pennsylvania Edinboro University
Embry-Riddle Aeronautical University Embry-Riddle Aeronautical University – Daytona…
Emory and Henry College Emory & Henry College
Everglades University Everglades University – Boca Raton
Fairleigh Dickinson University Fairleigh Dickinson University – Metropolitan …
Farmingdale State College–SUNY Farmingdale State College
Florida A&M University Florida State University
Franklin and Marshall College Franklin & Marshall College
Georgia College & State University Georgia College and State University
Grace College and Seminary Grace College
Indiana University–Bloomington Indiana University Bloomington
Indiana University–Kokomo Indiana University Kokomo
Indiana University–South Bend Indiana University South Bend
Indiana University-Purdue University–Indianapolis Indiana University – Purdue University Indiana…
Johnson & Wales University Johnson & Wales University (RI)
Louisiana State University–Alexandria Louisiana State University Alexandria
Louisiana State University–Baton Rouge Louisiana State University
Louisiana State University–Shreveport Louisiana State University Shreveport
Miami University–Oxford Miami University
Minnesota State University–Mankato Minnesota State University Mankato
Minnesota State University–Moorhead Minnesota State University Moorhead
Missouri University of Science & Technology Missouri University of Science and Technology
Montana State University–Billings Montana State University – Billings
Montana State University–Northern Montana State University – Northern
Mount Saint Mary’s University Mount St. Mary’s University (CA)
Mount St. Mary’s University Mount St. Mary’s University (MD)
North Carolina State University–Raleigh North Carolina State University
Ohio State University–Columbus Columbus State University
Purchase College–SUNY Purchase College
Purdue University–Fort Wayne Indiana University – Purdue University Fort Wayne
Purdue University–Northwest Purdue University Northwest
Purdue University–West Lafayette Purdue University
SUNY College of Agriculture and Technology–Cobleskill State University of New York College of Agricu…
SUNY College of Environmental Science and Forestry State University of New York College of Enviro…
SUNY College–Old Westbury College at Old Westbury
SUNY–Fredonia State University of New York at Fredonia
SUNY–Geneseo State University of New York College at Geneseo
SUNY–Morrisville Morrisville State College
SUNY–Oswego State University of New York at Oswego
SUNY–Plattsburgh State University of New York Plattsburgh
Sewanee–University of the South Sewanee: University of the South
Silver Lake College Silver Lake College of the Holy Family
South University South University – Savannah
Southern Illinois University–Carbondale Southern Illinois University Carbondale
Southern Illinois University–Edwardsville Southern Illinois University Edwardsville
Southern University–New Orleans Southern University at New Orleans
St. Anselm College Saint Anselm College
St. Francis University Saint Francis University
St. John’s University (NY) St. John’s University
St. Joseph’s College St. Joseph’s College of Maine
St. Joseph’s University Saint Joseph’s University
St. Mary’s College Saint Mary’s College
St. Mary’s College of California Saint Mary’s College of California
St. Mary’s University of Minnesota Saint Mary’s University of Minnesota
St. Mary’s University of San Antonio St. Mary’s University
St. Michael’s College Saint Michael’s College
St. Xavier University Saint Xavier University
Sterling College Sterling College (KS)
Stony Brook University–SUNY Stony Brook University
Texas A&M University–College Station Texas A&M University
Texas A&M University–Commerce Texas A&M University – Commerce
Texas A&M University–Corpus Christi Texas A&M University – Corpus Christi
Texas A&M University–Kingsville Texas A&M University – Kingsville
Texas A&M University–Texarkana Texas A&M University – Texarkana
The Catholic University of America Catholic University of America
The New School Eugene Lang College The New School for Liberal…
University at Albany–SUNY University at Albany
University at Buffalo–SUNY University at Buffalo
University of Alabama–Birmingham University of Alabama at Birmingham
University of Alabama–Huntsville University of Alabama in Huntsville
University of Alaska–Anchorage University of Alaska Anchorage
University of Alaska–Fairbanks University of Alaska Fairbanks
University of Alaska–Southeast University of Alaska Southeast
University of Arkansas–Fort Smith University of Arkansas at Fort Smith
University of Arkansas–Little Rock University of Arkansas at Little Rock
University of Arkansas–Pine Bluff University of Arkansas at Pine Bluff
University of California–Berkeley University of California Berkeley
University of California–Davis University of California Davis
University of California–Irvine University of California Irvine
University of California–Los Angeles University of California Los Angeles
University of California–Merced University of California Merced
University of California–Riverside University of California Riverside
University of California–San Diego University of California San Diego
University of California–Santa Barbara University of California Santa Barbara
University of California–Santa Cruz University of California Santa Cruz
University of Colorado–Boulder University of Colorado Boulder
University of Colorado–Colorado Springs University of Colorado Colorado Springs
University of Colorado–Denver University of Colorado Denver
University of Hawaii–Hilo University of Hawaii at Hilo
University of Hawaii–Manoa University of Hawaii at Manoa
University of Hawaii–West Oahu University of Hawaii – West Oahu
University of Houston–Clear Lake University of Houston – Clear Lake
University of Houston–Downtown University of Houston – Downtown
University of Houston–Victoria University of Houston – Victoria
University of Illinois–Chicago University of Illinois at Chicago
University of Illinois–Springfield University of Illinois at Springfield
University of Illinois–Urbana-Champaign University of Illinois at Urbana-Champaign
University of Louisiana–Lafayette University of Louisiana at Lafayette
University of Louisiana–Monroe University of Louisiana at Monroe
University of Lynchburg Virginia University of Lynchburg
University of Maine–Augusta University of Maine at Augusta
University of Maine–Farmington University of Maine at Farmington
University of Maine–Fort Kent University of Maine at Fort Kent
University of Maine–Machias University of Maine at Machias
University of Maine–Presque Isle University of Maine at Presque Isle
University of Maryland–Baltimore County University of Maryland Baltimore County
University of Maryland–College Park University of Maryland
University of Maryland–Eastern Shore University of Maryland Eastern Shore
University of Massachusetts–Amherst University of Massachusetts Amherst
University of Massachusetts–Boston University of Massachusetts Boston
University of Massachusetts–Dartmouth University of Massachusetts Dartmouth
University of Massachusetts–Lowell University of Massachusetts Lowell
University of Michigan–Ann Arbor University of Michigan
University of Michigan–Dearborn University of Michigan – Dearborn
University of Michigan–Flint University of Michigan – Flint
University of Minnesota–Crookston University of Minnesota Crookston
University of Minnesota–Duluth University of Minnesota Duluth
University of Minnesota–Morris University of Minnesota Morris
University of Minnesota–Twin Cities University of Minnesota Twin Cities
University of Missouri University of Missouri – Columbia
University of Missouri–Kansas City University of Missouri – Kansas City
University of Missouri–St. Louis University of Missouri – St. Louis
University of Montana–Western University of Montana Western
University of Nebraska–Kearney University of Nebraska – Kearney
University of Nebraska–Lincoln University of Nebraska – Lincoln
University of Nebraska–Omaha University of Nebraska – Omaha
University of Nevada–Las Vegas University of Nevada Las Vegas
University of Nevada–Reno University of Nevada Reno
University of North Carolina–Asheville University of North Carolina at Asheville
University of North Carolina–Chapel Hill University of North Carolina at Chapel Hill
University of North Carolina–Charlotte University of North Carolina at Charlotte
University of North Carolina–Greensboro University of North Carolina at Greensboro
University of North Carolina–Pembroke University of North Carolina at Pembroke
University of North Carolina–Wilmington University of North Carolina at Wilmington
University of Northwestern–St. Paul University of Northwestern St. Paul
University of Saint Francis University of St. Francis (IN)
University of South Carolina–Aiken University of South Carolina Aiken
University of South Carolina–Beaufort University of South Carolina Beaufort
University of South Carolina–Upstate University of South Carolina Upstate
University of South Florida–St. Petersburg University of South Florida – Saint Petersburg
University of St. Francis University of St. Francis (IL)
University of St. Joseph University of Saint Joseph
University of Tennessee–Chattanooga University of Tennessee at Chattanooga
University of Tennessee–Martin University of Tennessee at Martin
University of Texas–Arlington University of Texas at Arlington
University of Texas–Austin University of Texas at Austin
University of Texas–Dallas University of Texas at Dallas
University of Texas–El Paso University of Texas at El Paso
University of Texas–San Antonio University of Texas at San Antonio
University of Texas–Tyler University of Texas at Tyler
University of Valley Forge Valley Forge University
University of Virginia–Wise University of Virginia’s College at Wise
University of Wisconsin–Eau Claire University of Wisconsin – Eau Claire
University of Wisconsin–Green Bay University of Wisconsin – Green Bay
University of Wisconsin–La Crosse University of Wisconsin – La Crosse
University of Wisconsin–Madison University of Wisconsin – Madison
University of Wisconsin–Milwaukee University of Wisconsin – Milwaukee
University of Wisconsin–Oshkosh University of Wisconsin – Oshkosh
University of Wisconsin–Parkside University of Wisconsin – Parkside
University of Wisconsin–Platteville University of Wisconsin – Platteville
University of Wisconsin–River Falls University of Wisconsin – River Falls
University of Wisconsin–Stevens Point University of Wisconsin – Stevens Point
University of Wisconsin–Stout University of Wisconsin – Stout
University of Wisconsin–Superior University of Wisconsin – Superior
University of Wisconsin–Whitewater University of Wisconsin – Whitewater
Washington and Jefferson College Washington & Jefferson College
William Paterson University of New Jersey William Paterson University
York College York College (NE)

The rare thing happened: the 'fuzzywuzzy.process.extract()' and city check still managed to match a 'ranking' name to a similar but incorrect 'college' name that just happened to be in the same city – three times!:

  • 'Florida A&M University' was incorrectly matched to 'Florida State University', both in Tallahassee, FL.

  • 'Ohio State University--Columbus' was incorrectly matched to 'Columbus State University', both in Columbus, OH.

  • 'University of Lynchburg' was incorrectly matched to 'Virginia University of Lynchburg', both in Lynchburg, VA.

By looking at all names in 'college' for each city – or by just directly searching CollegeData.com – and running a Google search when needed, it only takes a minute to find that these schools should be changed as follows:

rankings.loc['Florida A&M University',
             'Close Name'] = 'Florida Agricultural and Mechanical University'
rankings.loc['Ohio State University--Columbus',
             'Close Name'] = 'Ohio State University'
rankings.loc['University of Lynchburg',
             'Close Name'] = 'Lynchburg College'

We can now rename the index of these successfully matched 208 rows to the successful 'Close Name' field, drop 'Close Name', join them with 'college', and drop them from 'rankings'.

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)

len(rankings)
80

Even fuzzier matching

We have 80 rows left in 'rankings' that haven’t so far lined up with a row in 'college'. Maybe there is an issue where the city is written different between the two tables – we can try using 'fuzzywuzzy.process.extract()' again without checking the city this time.

def get_closest_name_modified(row):
    
    if row['City'] not in college['City'].values:
        city = process.extract(row['City'], college['City'].values)[0][0]
    else:
        city = row['City']
    
    names_in_city = college[college['City'] == city].index

    close_names = process.extract(row.name, names_in_city)
    for close_name in close_names:
        if row['State'] == college.loc[close_name[0],'State']:
            return close_name[0]
    return np.nan

rankings['Close Name'] = rankings.apply(get_closest_name_modified, axis=1)
rankings['Close Name'].count()
66

Our modified function matched 66 of the 80 remaining names. Let’s inspect the results. WARNING: Prepare for a slightly smaller wall of a table!

matched = ~rankings['Close Name'].isnull()
rankings.loc[matched, ['Close Name']]
Close Name
Name
Alderson Broaddus University Alderson-Broaddus College
Atlanta Metropolitan State College Carver College
Augsburg University University of Minnesota Twin Cities
Augusta University University of Phoenix Augusta Campus
Bay Path University Bay Path College
Bay State College Boston Architectural College
Bismarck State College University of Mary
Brigham Young University–Hawaii Brigham Young University – Hawaii
CUNY–Baruch College Baruch College (City University of New York)
CUNY–Brooklyn College Brooklyn College (City University of New York)
CUNY–City College Baruch College (City University of New York)
CUNY–College of Staten Island College of Staten Island (City University of N…
CUNY–Lehman College Lehman College (City University of New York)
CUNY–Medgar Evers College Brooklyn College (City University of New York)
CUNY–Queens College Queens College (City University of New York)
California State University–Maritime Academy California Maritime Academy
Castleton University Castleton State College
College of Our Lady of the Elms Elms College
Doane University Doane College
Dunwoody College of Technology Augsburg College
Felician University Felician College
Georgia Southern University–Armstrong Savannah State University
Greenville University Greenville College
Gwynedd Mercy University Gwynedd-Mercy College
Harvard University Lesley University
Iowa Wesleyan University Iowa Wesleyan College
Keiser University Nova Southeastern University
King University King College
Lincoln College Lincoln Christian University
Maranatha Baptist University Maranatha Baptist Bible College
Maryville University of St. Louis Maryville University of Saint Louis
Middle Georgia State University Mercer University
Midway University Midway College
Mount St. Joseph University Xavier University
Notre Dame College of Ohio Chamberlain College of Nursing (OH)
Oklahoma State University Institute of Technology–Okmulgee Bacone College
Pennsylvania State University–University Park Penn State University Park
Pensacola State College University of West Florida
Rutgers University–Camden Rutgers The State University of New Jersey – C…
Rutgers University–New Brunswick Rutgers The State University of New Jersey
Rutgers University–Newark Rutgers The State University of New Jersey – N…
SUNY Buffalo State Buffalo State College
SUNY College of Technology–Alfred Alfred State College
SUNY College of Technology–Canton St. Lawrence University
SUNY College–Cortland State University of New York College at Cortland
SUNY College–Oneonta State University of New York College at Oneonta
SUNY College–Potsdam State University of New York College at Potsdam
SUNY Maritime College State University of New York Maritime College
SUNY Polytechnic Institute–Albany/Utica Utica College
SUNY–New Paltz State University of New York at New Paltz
Southern University and A&M College Southern University and Agricultural and Mecha…
St. John’s University (MN) Saint John’s University
St. Joseph’s College New York Brooklyn College (City University of New York)
Tennessee Wesleyan University Tennessee Wesleyan College
The Master’s University and Seminary Master’s College
Truett McConnell University Truett-McConnell College
Tusculum University Tusculum College
University of Holy Cross Louisiana State University Health Sciences Center
University of Jamestown Jamestown College
University of Mount Olive Mount Olive College
University of North Texas at Dallas Art Institute of Dallas
University of Providence University of Great Falls
Virginia Tech Virginia Polytechnic Institute and State Unive…
Virginia Wesleyan University Virginia Wesleyan College
Waldorf University Waldorf College
Williams Baptist University Williams Baptist College

Some of these are obviously right or wrong, but many required searching Google and CollegeData.com to investigate. In many cases, the school has recently changed names. Unfortunately, for many cases the school is not listed in any way I could find on CollegeData.com. This process took significantly longer than the previous eye check – about 20 minutes of searching around. After manually renaming, let’s see how many remain unmatched:

renamings = {
    "Atlanta Metropolitan State College": np.nan,
    "Augsburg University": "Augsburg College",
    "Augusta University": "Georgia Regents University",
    "Bay State College": np.nan,
    "Bismarck State College": np.nan,
    "CUNY--City College": "City College of New York",
    "CUNY--Medgar Evers College": 
        "Medgar Evers College (City University of New York)",
    "Dunwoody College of Technology": np.nan,
    "Georgia Southern University--Armstrong": "Armstrong State University",
    "Harvard University": "Harvard College",
    "Keiser University": np.nan,
    "Middle Georgia State University": np.nan,
    "Mount St. Joseph University": "College of Mount St. Joseph",
    "Notre Dame College of Ohio": "Notre Dame College",
    "Oklahoma State University Institute of Technology--Okmulgee": np.nan,
    "Pensacola State College": np.nan,
    "SUNY College of Technology--Canton": np.nan,
    "SUNY Polytechnic Institute--Albany/Utica": 
        "State University of New York Institute of Technology",
    "St. Joseph's College New York":"St. Joseph's College - Brooklyn Campus",
    "University of Holy Cross":"Our Lady of Holy Cross College",
    "University of North Texas at Dallas": np.nan,
    "University of Providence": np.nan,
}

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)
len(rankings)
25

No matches at all

There are 25 remaining schools in 'rankings' that did not fuzzy match with any school name in 'college', including the 11 that we just previously found to not exist at CollegeData.com. The other 14 are:

rankings[~rankings.index.isin(list(renamings.keys()))][['City','State']]
City State
Name
Abraham Baldwin Agricultural College Tifton GA
Alabama Agricultural and Mechanical University Normal AL
Chipola College Marianna FL
College of St. Benedict St. Joseph MN
Cottey College Nevada MO
Crowley’s Ridge College Paragould AR
DeVry University Downers Grove IL
Dean College Franklin MA
East Georgia State College Swainsboro GA
Gordon State College Barnesville GA
Indian River State College Fort Pierce FL
SUNY College of Technology–Delhi Delhi NY
South Florida State College Avon Park FL
University of Texas–Rio Grande Valley Edinburg TX

After some searching, two of these schools were indeed in CollegeData.com and were appropriately renamed. The notorious 'DeVry University' is listed on CollegeData.com – but not as a single school. CollegeData.com has 14 separate listings whereas the US News ranking considers it a single school. Unfortunately, while CollegeData.com has 14 pages for these schools, there is no admissions data provided for a single one of them. Due to this, I just matched the DeVry 'rankings' with the name 'DeVry University - Chicago' in the 'college' table.

After making changes, let’s see what remains:

renamings = {
    "Alabama Agricultural and Mechanical University": 
        "Alabama A&M University",
    "College of St. Benedict": "College of Saint Benedict",
    "DeVry University":"DeVry University - Chicago"
}

rankings.loc[list(renamings.keys()), 'Close Name'] = list(renamings.values())

matched = ~rankings['Close Name'].isnull()
rankings = rankings.rename(dict(rankings.loc[matched,'Close Name']))
matched = ~rankings['Close Name'].isnull()
rankings = rankings.drop(columns='Close Name')

cleaned_rankings = cleaned_rankings.append(rankings[matched])

rankings = rankings.drop(rankings[matched].index)
len(rankings)
22

We’ll return to these 22 unmatched schools in a moment.

Final inspection of cleaned rankings

For now, let’s check our 'cleaned_rankings' for any anomolies before we join it with 'college'. First, let’s see if there are any duplicates:

duplicated = cleaned_rankings.index.duplicated(keep=False)
cleaned_rankings[duplicated][['City','State']]
City State
Name
Concordia College Bronxville NY
Concordia College Moorhead MN

In terms of 'college' names, the Concordia in Bronxville, NY should be 'Concordia College New York' – the other one can stay as is.

second_duplicate = cleaned_rankings.index.duplicated(keep='last')
cleaned_rankings[second_duplicate]

idx = cleaned_rankings.index
idx = idx.where(~idx.duplicated(keep='last'), idx + ' New York')
cleaned_rankings.index = idx

cleaned_rankings[duplicated][['City','State']]
City State
Name
Concordia College New York Bronxville NY
Concordia College Moorhead MN

Let’s see any cities that don’t line up between the two tables:

unaligned_cities = college.loc[cleaned_rankings.index,'City'] != \
                                                    cleaned_rankings['City']
cleaned_rankings[unaligned_cities][['City']] \
                    .join(college[['City']], lsuffix=' (rankings)')
City (rankings) City
Name
American Jewish University Bel-Air Bel Air
Auburn University Auburn Auburn University
Bard College Annandale on Hudson Annandale-on-Hudson
Brewton-Parker College Mount Vernon Mt. Vernon
College of Mount St. Vincent Bronx Riverdale
Florida Memorial University Miami Miami-Dade
Lake Superior State University Sault Ste. Marie Sault Sainte Marie
Northern Illinois University DeKalb De Kalb
Nova Southeastern University Ft. Lauderdale Fort Lauderdale
Saint Mary-of-the-Woods College St. Mary-of-the-Woods Saint Mary of the Woods
St. Mary’s College of Maryland St. Marys City St. Mary’s City
University of Richmond Univ. of Richmond University of Richmond
Brigham Young University – Hawaii Laie Oahu Laie
Maryville University of Saint Louis St Louis St. Louis
Notre Dame College Cleveland South Euclid
Alabama A&M University Normal Huntsville
College of Saint Benedict St. Joseph Saint Joseph
DeVry University – Chicago Downers Grove Chicago

There’s nothing worrying here – I double checked that ‘Riverdale’ is a neighborhood in the ‘Bronx’, ‘Laie’ is on Oahu, ‘South Euclid’ is next to ‘Cleveland’, and ‘Normal’ is the post office name of the site of Alabama A&M inside ‘Huntsville’. Also recall we matched DeVry to the Chicago one. We can now drop the superfluous 'City' and 'State' columns from 'cleaned_rankings' in preparation for the join to 'college'.

cleaned_rankings = cleaned_rankings.drop(columns=['City','State'])

What’s being left behind?

Before we continue, let’s consider the 22 umatched schools that still remain in the uncleaned 'rankings' table that aren’t able to come along with the join. Many of them are recently converted community colleges though some, for whatever reason, just aren’t in the source dataset CollegeData.com used.

Let’s look at how big of a chunk these schools represent from our rankings in terms of undergrad population:

(rankings[['Type','Undergrads']].groupby('Type').sum() / 
    cleaned_rankings[['Type','Undergrads']].groupby('Type').sum())
Undergrads
Type
National Liberal Arts Colleges 0.000000
National Universities 0.005520
Regional Colleges 0.179471
Regional Universities 0.007486

It seems that the 'Regional Colleges' rank category in our soon-to-be-joined 'college' data set will be 18% short in terms of undergrads, but our other rank categories will be missing less than 1%.

Joining the tables

Finally, on with the join:

college = college.join(cleaned_rankings)

That’s it!

We’ll add a new rank category 'Unranked' for those schools in 'college' that were not on a US News ranking list, and then we’ll export the the whole thing to a .csv file:

college['Type'] = college['Type'].cat.add_categories('Unranked')
college['Type'] = college['Type'].fillna('Unranked')

college.to_csv('college.csv')

Now that our college admissions data is scraped, cleaned, and joined with rankings data, the hard part is over. We’ll move on to exploratory data analysis in the next post.

Yield Rate Gender Gap, Part 2: Cleaning the Data

This is Part 2 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!


Let’s import and check the shape of the the raw data we scraped from CollegeData.com back in Part 1:

import numpy as np
import pandas as pd

college = pd.read_csv('collegedata_scraped_backup.csv', index_col=0)

college.shape
(1813, 53)

We scraped 53 columns of data from 1813 different schools. For reference, this is what Penn State’s data looks like in its original form on CollegeData.com.


Identifying null values

We scraped the school name, the location, and everything in the “Profile of Fall Admission” table. The column labels of our raw scraped data are:

list(college.columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank',
 'National Merit Scholar',
 'Valedictorian',
 'Class President',
 'Student Government Officer']

If you look up enough schools on CollegeData.com, you’ll notice not every school has information available for each column here. Very sparsely populated columns – that is columns filled with null values – may not be very useful in helping us analyze the yield rate disparity.

Pandas saves null values with the np.NaN value. CollegeData.com probably uses something more friendly to indicate a null value to a human reader. Assuming there is a bunch of null values in the dataset, can get an idea of the potential null value CollegeData.com uses by looking at the most common values that appear in each column of the DataFrame with the .mode() method.

list(college.mode().loc[0])
['Chamberlain College of Nursing ',
 'Phoenix,\xa0AZ\xa0\xa085021',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0',
 '0',
 '0',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 '0%',
 'Top tenth:   94% Top quarter:   99% Top half:  100%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported']

It looks like the string “Not reported” is what CollegeData.com is using to indicate a null value in this dataset. We will replace all instances of “Not reported” with np.NaN.

college = college.replace('Not reported',np.NaN)

Dropping sparse columns

Now that we’ve got null data properly marked, we can check what columns are filled with null data and potentially drop them. I’m going to consider dropping columns that are more than 75% null.

drop_cols = college.columns[college.isnull().sum() > 0.75*len(college)]
list(drop_cols)
['Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'National Merit Scholar',
 'Valedictorian',
 'Class President',
 'Student Government Officer']

In a future analysis of only the most selective schools, it could be useful to keep track of the few schools that offer early admission. Regarding the remaining columns, there is just too little data inconsistently reported for them to serve as useful predictors for any models we build to predict the yield rate disparity across all schools.

After dropping these columns, let’s see what remains:

college = college.drop(columns=drop_cols)
list(college.columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank']

Dropping related columns

Referring back to Penn State’s CollegeData.com page, each column in the pattern of “Score of ### – ###” show the percentage of applicants who scored in that range on one of the standardized tests. While this is potentially interesting for a future analysis, we already get the mean and middle 50% for each test in the SAT Math, SAT Critical Reading, SAT Writing, and ACT Composite columns. Similarly, each “#.## – #.##” column contains the percentage of students who earned GPA in that ranges, but we already get the mean GPA of applicants in the Average GPA column.

Since these score range columns likely correlate with the mean score columns, we can drop the score range columns without losing the ability to use SAT/ACT/GPA scores as a potential predictor for the yield rate disparity.

As it turns out, to drop these columns, we can simply remove any column from the list that has a label containing a number.

drop_cols = college.columns[college.columns.str.contains(r'[\d]')]
college = college.drop(columns = drop_cols)

Removing commas

Large numbers in this dataset contain commas to separate thousands. We will remove all commas from the data in every column to make it easier to later extract numbers.

for col in college.columns:
    college[col] = college[col].str.replace(',','')

‘Students Enrolled’,’Women*’, and ‘Men*’ columns

Our surviving columns are:

original_columns = college.columns  # Save the names of the original columns for later so we can drop them.
list(original_columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Average GPA',
 'SAT Math',
 'SAT Critical Reading',
 'SAT Writing',
 'ACT Composite',
 'High School Class Rank']

Though we now only have these 14 columns for each school, there are more than 14 pieces of numeric data saved inside these columns.

For Penn State, for example, the 'Students Enrolled', 'Women*', and 'Men*' columns contain:

college.loc[59, 'Students Enrolled']
'9800 (33%) of 29878 admitted students enrolled'
college.loc[59, 'Women*']
'4710 (29%) of 16047 admitted students enrolled'
college.loc[59, 'Men*']
'5090 (37%) of 13831 admitted students enrolled'

We neeed to split these columns, and nearly all of the other columns as well, into separate columns:

We can use pandas .str.extract() method with a regular expression argument, which will output a new column of data for each group in the regular expression. We can then chain the .astype() method to convert the output columns to our desired data type. We can then convert any percentage columns to decimal numbers by dividing them by 100.

One at a time, we will search each of the columns 'Enrolled Students','Women*', and 'Men*' using a regular expression that captures three groups of numbers, outputting three new columns for each old column containing floating point numbes for ‘Enrolled’, ‘Yield Rate’, ‘Offers’.

regexp = '(\d*)\D*(\d*)%\D*(\d*)'

college[['Enrolled (all)','Yield Rate (all)','Offers (all)']] = college['Students Enrolled'].str.extract(regexp).astype('float')
college[['Enrolled (women)','Yield Rate (women)','Offers (women)']] = college['Women*'].str.extract(regexp).astype('float')
college[['Enrolled (men)','Yield Rate (men)','Offers (men)']] = college['Men*'].str.extract(regexp).astype('float')

college[['Yield Rate (all)','Yield Rate (women)','Yield Rate (men)']] /= 100   #to make rates a decimal percentage

The ‘location’ column

Let’s see what the raw location data looks like:

college.loc[59,'location']
'University Park\xa0PA\xa0\xa016802'

We will split the ‘location’ string column into ‘City’ string, ‘State’ category, and ‘Zip’ integer columns.

regexp = '^(.*)\s([A-Z]{2})\s*(\d{5})'

college[['City','State','Zip']] = college['location'].str.extract(regexp)
college['State'] = college['State'].astype('category')
college['Zip'] = college['Zip'].astype('int')

The ‘Overall Admission Rate’, ‘Women’, and ‘Men’ columns

Let’s see what these columns look like:

college.loc[59,'Overall Admission Rate']
'56% of 52974 applicants were admitted'
college.loc[59,'Men']
'53% of 25925 applicants were admitted'
college.loc[59,'Women']
'59% of 27049 applicants were admitted'

We will split each of these columns into an ‘Offer Rate’ and ‘Apps’ column of floating point numbers.

regexp = '(\d*)%\D*(\d*)'

college[['Offer Rate (all)','Apps (all)']] = college['Overall Admission Rate'].str.extract(regexp).astype('float')
college[['Offer Rate (women)','Apps (women)']] = college['Women'].str.extract(regexp).astype('float')
college[['Offer Rate (men)','Apps (men)']] = college['Men'].str.extract(regexp).astype('float')

college[['Offer Rate (all)','Offer Rate (women)','Offer Rate (men)']] /= 100    #to make rates a decimal percentage

The ‘Average GPA’ column

Taking a look:

college.loc[59,'Average GPA']
'3.59'

The ‘Average GPA’ column doesn’t need to be split, but it does need to have it’s value extracted from any potential surrounding text that could accompany it at other schools. And it needs to be converted to a float column.

regexp = '(\d\.\d\d)'

college['GPA Mean'] = college['Average GPA'].str.extract(regexp).astype('float')

The ‘SAT Math’, ‘SAT Critical Reading’, ‘SAT Writing’, and ‘ACT Composite’ columns

Let’s look at these four columns:

college.loc[59,'SAT Math']
'612 average 560-670 range of middle 50%'
college.loc[59,'SAT Critical Reading']
'578 average 530-630 range of middle 50%'
college.loc[59,'SAT Writing']
nan
college.loc[59,'ACT Composite']
'27 average 25-29 range of middle 50%'

CollegeData.com doesn’t report a value for Penn State’s 'SAT Writing' because Penn State does not consider that score in its admissions process. Looking at other schools, the 'SAT Writing' column does follow the same format as the other SAT columns.

We split each of the three ‘SAT’ columns – Math, Reading, and Writing – and the one ‘ACT’ columns into three float columns each, one for the Mean, one for the 25th percentile, and one for the 75th percentile.

regexp = '(\d*) average (\d*)-(\d*)'

college[['SAT Math Mean','SAT Math 25th','SAT Math 75th']] = college['SAT Math'].str.extract(regexp).astype('float')
college[['SAT Reading Mean','SAT Reading 25th','SAT Reading 75th']] = college['SAT Critical Reading'].str.extract(regexp).astype('float')
college[['SAT Writing Mean','SAT Writing 25th','SAT Writing 75th']] = college['SAT Writing'].str.extract(regexp).astype('float')
college[['ACT Mean','ACT 25th','ACT 75th']] = college['ACT Composite'].str.extract(regexp).astype('float')

The ‘High School Class Rank’ column

Finally, let’s look at the last column:

college.loc[59, 'High School Class Rank']
'Top tenth:   36% Top quarter:   77% Top half:   97% Bottom half:    3%'

We split this column into four float columns for the Top 10th, Top 25th, Top 50th, and Bottom 50th.

regexp = '(\d*)%\D*(\d*)%\D*(\d*)%\D*(\d*)%'

college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] = college['High School Class Rank'].str.extract(regexp).astype('float')
college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] /= 100   #decimal percentage

Changing the dataframe index

To make sure the name of each school is unique, we will append any duplicate school names with the state where the school is located.

for schoolId in college[college['schoolname'].duplicated(keep=False)].index:
    college.loc[schoolId,'schoolname'] = college.loc[schoolId,'schoolname']+' ('+college.loc[schoolId,'State']+')'

Instead of indexing our dataframe by CollegeData.com’s arbitrary schoolId, we will use the name of the school as the index.

college = college.set_index('schoolname', drop=False).sort_index()

Dropping old columns

And finally, we will delete all the old columns from the dataset.

college = college.drop(columns=original_columns)

Inspecting our cleaned dataframe

Let’s look at the .info() of our dataframe in its freshly cleaned state:

college.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1813 entries, Abilene Christian University to Youngstown State University
Data columns (total 35 columns):
Enrolled (all)        1303 non-null float64
Yield Rate (all)      1303 non-null float64
Offers (all)          1303 non-null float64
Enrolled (women)      1309 non-null float64
Yield Rate (women)    1309 non-null float64
Offers (women)        1309 non-null float64
Enrolled (men)        1265 non-null float64
Yield Rate (men)      1265 non-null float64
Offers (men)          1265 non-null float64
City                  1813 non-null object
State                 1813 non-null category
Zip                   1813 non-null int64
Offer Rate (all)      1548 non-null float64
Apps (all)            1548 non-null float64
Offer Rate (women)    1437 non-null float64
Apps (women)          1437 non-null float64
Offer Rate (men)      1431 non-null float64
Apps (men)            1431 non-null float64
GPA Mean              1343 non-null float64
SAT Math Mean         951 non-null float64
SAT Math 25th         951 non-null float64
SAT Math 75th         951 non-null float64
SAT Reading Mean      933 non-null float64
SAT Reading 25th      933 non-null float64
SAT Reading 75th      933 non-null float64
SAT Writing Mean      556 non-null float64
SAT Writing 25th      556 non-null float64
SAT Writing 75th      556 non-null float64
ACT Mean              1030 non-null float64
ACT 25th              1030 non-null float64
ACT 75th              1030 non-null float64
HS Top 10th           1035 non-null float64
HS Top 25th           1035 non-null float64
HS Top 50th           1035 non-null float64
HS Bottom 50th        1035 non-null float64
dtypes: category(1), float64(32), int64(1), object(1)
memory usage: 500.4+ KB

Our original, raw scraped dataframe was 1813 schools long and 53 dirty columns wide. In our cleaning, we dropped quite a few sparse columns. But we also split many original columns into clean, new columns.

We can check that Penn State’s data was properly cleaned. Since we’ve reset the index, we can select it directly by name now.

college.loc['Penn State University Park']
Enrolled (all)                   9800
Yield Rate (all)                 0.33
Offers (all)                    29878
Enrolled (women)                 4710
Yield Rate (women)               0.29
Offers (women)                  16047
Enrolled (men)                   5090
Yield Rate (men)                 0.37
Offers (men)                    13831
City                  University Park
State                              PA
Zip                             16802
Offer Rate (all)                 0.56
Apps (all)                      52974
Offer Rate (women)               0.59
Apps (women)                    27049
Offer Rate (men)                 0.53
Apps (men)                      25925
GPA Mean                         3.59
SAT Math Mean                     612
SAT Math 25th                     560
SAT Math 75th                     670
SAT Reading Mean                  578
SAT Reading 25th                  530
SAT Reading 75th                  630
SAT Writing Mean                  NaN
SAT Writing 25th                  NaN
SAT Writing 75th                  NaN
ACT Mean                           27
ACT 25th                           25
ACT 75th                           29
HS Top 10th                      0.36
HS Top 25th                      0.77
HS Top 50th                      0.97
HS Bottom 50th                   0.03
Name: Penn State University Park, dtype: object

Everything looks good here. The NaN for the SAT Writing is appropriate since Penn State did not consider the SAT Writing in admissions – CollegeData.com showed values of ‘Not reported’.


Exporting the dataframe

We will save the cleaned dataframe to a new .csv file.

college.to_csv('collegedata_cleaned_backup.csv')

In Part 3, we will import a separate dataset of US News and World Report’s college rankings and combine it with our cleaned school dataset.

Yield Rate Gender Gap, Part 1: Scraping the Web

This is Part 1 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!


University admissions and the yield rate

In my previous work running the education startup Newton China, I frequently ended up using the website CollegeData.com as a reference – it lists a ton of data for a lot of schools.

As an example, searching my alma mater brings up Penn State’s page, which is actually one six seperate CollegeData pages of information: Overview, Admission, Money Matters, Academics, Campus Life, and Students.

Scrolling down, you’ll find a bunch of numbers organized in somewhat regularly into rows. Some table rows, like Average GPA, are easy enough, containing only one value. But others, like the row labeled Students Enrolled contain three pieces of data.

The first number in 'Students Enrolled' is the straightforward number of students enrolled, presumably 9,800 students attending school in the fall as incoming freshmen. The third number – 29,878 – is the number of students who were offered admission. The second number – 33% – is called the yield rate, and it is the number that raised questions that inspired me to collect and analyze this data.


But why care about the yield rate?

The yield rate is the percentage of offered students who actually enrolled and it is of considerable importance to many college admissions departments. If the admissions department enrolls too few students, the school will not receive enough tuition to cover its costs. If it enrolls too many students, the extra students will bring dorms and classrooms beyond capacity.

Not every student who is given an offer letter – a letter of acceptance – will accept it. In fact it seems only 33% of the Penn State’s offer letters were accepted. Presumably, the other 67% of students who were offered elected to attend another school. If this 33% yield rate is relatively consistent year to year, which it normally is, the school will have a good idea how many offer letters should be given out so they may end up with the right yield.

Curiously, the yield rate at Penn State for men (37%) seems a good amount higher than that for women (29%). Men seem more likely to accept Penn State’s offers than women are. After looking at a few other schools’ yield rates, I became curious if there actually is a significant disparity in yield rate among schools – and if so, how is this disparity distributed and what factors may predict it?


Scraping collegedata.com

The collegedata.com url for Penn State’s Admission information page looks like this:
https://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId=59

The 59 at the end of the URL is the schoolId. The 02 part of the URL indicates you are on the 'Admission' information page. Not every schoolId corresponds to a school, but most do, especially in the lower range of numbers. Attempting to access a page for a number with no school will load a default page with the title "Retrieve a Saved Search". After poking around a bit, I found at larger schoolId, especially over 1000, real school pages became more sparse, and I’m fairly confident there are no schools listed with a schoolId over 2500.

BASE_URL = "http://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId="
START_ID = 0
END_ID = 2500

In order to determine if the yield disparity is real, I only need to scrape the yield rate from each page. But grabbing as much other data as possible from the six collegedata.com pages of information for each school at the same time could prove useful later when trying to find predictors for yield rate.

I decided a good balance would be to only get the data in the <td> tags under the heading 'Profile of Fall Admission', identified as <div id='section8'> in the HTML, in addition to the school’s name and its location. I can always come back to scrape more if I need it in the future.

Building a BeautifulSoup scraper

I used the wonderful BeautifulSoup library to help me get what I need from CollegeData’s pages. It maps the HTML into a tree that can be descended down (and up) and side to side to access the elements and strings you want.

The string 'schoolname' is conveniently located in the first <h1> tag. The 'location' is a bit more difficult to grab, as it isn’t nested inside a unique tag, but can be found easily by using the search feature to find the text 'City, State, Zip' and going from there.

Finding 'section8' is no problem, from there a relatively simple for loop through all of its <tr> descendants, each of which may have a <th> child to be used as a key and a <td> child to be used as a value. Though most of the <th> values are unique, some are not, including 'Women','Men', and the SAT Score information further down the section.

from bs4 import BeautifulSoup

def scrape(response, REJECT_PAGE_TITLE = "Retrieve a Saved Search"):
    # response is from requests.get()
            
    page = BeautifulSoup(response.text, "lxml")
    
    if page.h1 is None or page.h1.string == REJECT_PAGE_TITLE:
        
        return None

    else:        
        scraped = {}        
        scraped['schoolname'] = page.h1.string
        scraped['location'] = page.find('th', text="City, State, Zip").next_sibling.string
        
        for tag in page.find(id="section8").descendants:
            
            if tag.name == 'tr' and tag.th is not None and tag.td is not None:
                key = " ".join(string for string in tag.th.stripped_strings)
                val = " ".join(string for string in tag.td.stripped_strings)
                
                while key in scraped:  # Temporarily deal with identical table headers cells
                    key += "*"
                
                scraped[key] = val
                
        return scraped

An empty dictionary 'scraped' is created to hold our scraped values. We add 'schoolname' and 'location' before looping through id=section8, saving every non-empty pair of <th> and <td>. To avoid duplicates, we add a '*' to the key as a temporary marker until we can later clean up the key names. Finally, a 'scraped' dictionary is returned to the caller.


Requesting 2500 pages!

I’ve read how to implement methods to evade anti-scraping measures, but I encountered no push back from CollegeData when making these basic requests. It took me about 40 minutes to run through this loop.

from requests import get
from IPython.core.display import clear_output
from warnings import warn

scraped_pages = {}
    
for schoolId in range(START_ID, END_ID+1):
    
    url = BASE_URL + str(schoolId) 
    print(url)   # a status update
    
    response = get(url)
    
    if response.status_code != 200:
        warn('Request {} caused status code {}.'.format(schoolId, response.status_code))
    
    scraped = scrape(response)
    
    if scraped:
        scraped_pages[schoolId] = scraped
        
    clear_output(wait = True) # clear the status update

print('Requested schoolId {} to {}, scraping {} pages.'.format(START_ID, END_ID, len(scraped_pages)))
Requested schoolId 0 to 2500, scraping 1813 pages.

We now have a dictionary, 'scraped_pages', filled with keys 'schoolId' associated with scraped dictionaries, which themselves are filled with scraped <th> keys and <td> values, in addition to 'location' and 'schoolname'. Pandas’ 'DataFrame.from_dict' method will quickly convert this to a DataFrame, which we will then backup to .csv before moving forward.


Inspecting the results

scraped_pages.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1813 entries, 6 to 2099
Data columns (total 53 columns):
schoolname                               1813 non-null object
location                                 1813 non-null object
Overall Admission Rate                   1813 non-null object
Women                                    1813 non-null object
Men                                      1813 non-null object
Students Enrolled                        1813 non-null object
Women*                                   1813 non-null object
Men*                                     1813 non-null object
Early Decision Admission Rate            245 non-null object
Early Action Admission Rate              392 non-null object
Students Offered Wait List               335 non-null object
Students Accepting Wait List Position    295 non-null object
Students Admitted From Wait List         324 non-null object
Average GPA                              1813 non-null object
3.75 and Above                           1807 non-null object
3.50 - 3.74                              1807 non-null object
3.25 - 3.49                              1799 non-null object
3.00 - 3.24                              1804 non-null object
2.50 - 2.99                              1774 non-null object
2.00 - 2.49                              1557 non-null object
SAT Math                                 1813 non-null object
Score of 700 - 800                       1813 non-null object
Score of 600 - 700                       1813 non-null object
Score of 500 - 600                       1813 non-null object
Score of 400 - 500                       1813 non-null object
Score of 300 - 400                       1813 non-null object
Score of 200 - 300                       1813 non-null object
SAT Critical Reading                     1813 non-null object
Score of 700 - 800*                      1813 non-null object
Score of 600 - 700*                      1813 non-null object
Score of 500 - 600*                      1813 non-null object
Score of 400 - 500*                      1813 non-null object
Score of 300 - 400*                      1813 non-null object
Score of 200 - 300*                      1813 non-null object
SAT Writing                              1813 non-null object
Score of 700 - 800**                     1813 non-null object
Score of 600 - 700**                     1813 non-null object
Score of 500 - 600**                     1813 non-null object
Score of 400 - 500**                     1813 non-null object
Score of 300 - 400**                     1813 non-null object
Score of 200 - 300**                     1813 non-null object
ACT Composite                            1813 non-null object
Score of 30 - 36                         1813 non-null object
Score of 24 - 29                         1813 non-null object
Score of 18 - 23                         1813 non-null object
Score of 12 - 17                         1813 non-null object
Score of 6 - 11                          1813 non-null object
Score of 5 or Below                      1813 non-null object
High School Class Rank                   1107 non-null object
National Merit Scholar                   1813 non-null object
Valedictorian                            1813 non-null object
Class President                          1813 non-null object
Student Government Officer               1813 non-null object
dtypes: object(53)
memory usage: 764.9+ KB

I scraped 53 seperate columns for 1813 schools. According to the US Dept. of Education’s National Center for Education Statistics, there were 2870 4-year colleges as of 2011. It would be very helpful to know what criteria the people behind CollegeData used to add schools to their site. Are CollegeData’s schools only the largest? Only those who filled out a survey? Only those who paid?

I’m not sure how accurately this sample represents the overall population of 4-year colleges – and how accurately insights drawn from analysis of it can be generalized to the greater population of all schools – and more investigation into CollegeData’s data acquisition process is warranted. Be warned!

import pandas as pd
college = pd.DataFrame.from_dict(scraped_pages, orient='index') # convert scraped pages to a pandas DataFrame
college.to_csv('collegedata_scraped_backup.csv')   # backup

With that said, I will move on to cleaning the data. Perhaps after some exploration we will have a better idea the quality of this sample and the quality of inferences made about the yield rate disparity.