Yield Rate Gender Gap, Part 2: Cleaning the Data

This is Part 2 of an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!


Let’s import and check the shape of the the raw data we scraped from CollegeData.com back in Part 1:

import numpy as np
import pandas as pd

college = pd.read_csv('collegedata_scraped_backup.csv', index_col=0)

college.shape
(1813, 53)

We scraped 53 columns of data from 1813 different schools. For reference, this is what Penn State’s data looks like in its original form on CollegeData.com.


Identifying null values

We scraped the school name, the location, and everything in the “Profile of Fall Admission” table. The column labels of our raw scraped data are:

list(college.columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank',
 'National Merit Scholar',
 'Valedictorian',
 'Class President',
 'Student Government Officer']

If you look up enough schools on CollegeData.com, you’ll notice not every school has information available for each column here. Very sparsely populated columns – that is columns filled with null values – may not be very useful in helping us analyze the yield rate disparity.

Pandas saves null values with the np.NaN value. CollegeData.com probably uses something more friendly to indicate a null value to a human reader. Assuming there is a bunch of null values in the dataset, can get an idea of the potential null value CollegeData.com uses by looking at the most common values that appear in each column of the DataFrame with the .mode() method.

list(college.mode().loc[0])
['Chamberlain College of Nursing ',
 'Phoenix,\xa0AZ\xa0\xa085021',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0',
 '0',
 '0',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported',
 '0%',
 '0%',
 'Top tenth:   94% Top quarter:   99% Top half:  100%',
 'Not reported',
 'Not reported',
 'Not reported',
 'Not reported']

It looks like the string “Not reported” is what CollegeData.com is using to indicate a null value in this dataset. We will replace all instances of “Not reported” with np.NaN.

college = college.replace('Not reported',np.NaN)

Dropping sparse columns

Now that we’ve got null data properly marked, we can check what columns are filled with null data and potentially drop them. I’m going to consider dropping columns that are more than 75% null.

drop_cols = college.columns[college.isnull().sum() > 0.75*len(college)]
list(drop_cols)
['Early Decision Admission Rate',
 'Early Action Admission Rate',
 'Students Offered Wait List',
 'Students Accepting Wait List Position',
 'Students Admitted From Wait List',
 'National Merit Scholar',
 'Valedictorian',
 'Class President',
 'Student Government Officer']

In a future analysis of only the most selective schools, it could be useful to keep track of the few schools that offer early admission. Regarding the remaining columns, there is just too little data inconsistently reported for them to serve as useful predictors for any models we build to predict the yield rate disparity across all schools.

After dropping these columns, let’s see what remains:

college = college.drop(columns=drop_cols)
list(college.columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Average GPA',
 '3.75 and Above',
 '3.50 - 3.74',
 '3.25 - 3.49',
 '3.00 - 3.24',
 '2.50 - 2.99',
 '2.00 - 2.49',
 'SAT Math',
 'Score of 700 - 800',
 'Score of 600 - 700',
 'Score of 500 - 600',
 'Score of 400 - 500',
 'Score of 300 - 400',
 'Score of 200 - 300',
 'SAT Critical Reading',
 'Score of 700 - 800*',
 'Score of 600 - 700*',
 'Score of 500 - 600*',
 'Score of 400 - 500*',
 'Score of 300 - 400*',
 'Score of 200 - 300*',
 'SAT Writing',
 'Score of 700 - 800**',
 'Score of 600 - 700**',
 'Score of 500 - 600**',
 'Score of 400 - 500**',
 'Score of 300 - 400**',
 'Score of 200 - 300**',
 'ACT Composite',
 'Score of 30 - 36',
 'Score of 24 - 29',
 'Score of 18 - 23',
 'Score of 12 - 17',
 'Score of 6 - 11',
 'Score of 5 or Below',
 'High School Class Rank']

Dropping related columns

Referring back to Penn State’s CollegeData.com page, each column in the pattern of “Score of ### – ###” show the percentage of applicants who scored in that range on one of the standardized tests. While this is potentially interesting for a future analysis, we already get the mean and middle 50% for each test in the SAT Math, SAT Critical Reading, SAT Writing, and ACT Composite columns. Similarly, each “#.## – #.##” column contains the percentage of students who earned GPA in that ranges, but we already get the mean GPA of applicants in the Average GPA column.

Since these score range columns likely correlate with the mean score columns, we can drop the score range columns without losing the ability to use SAT/ACT/GPA scores as a potential predictor for the yield rate disparity.

As it turns out, to drop these columns, we can simply remove any column from the list that has a label containing a number.

drop_cols = college.columns[college.columns.str.contains(r'[\d]')]
college = college.drop(columns = drop_cols)

Removing commas

Large numbers in this dataset contain commas to separate thousands. We will remove all commas from the data in every column to make it easier to later extract numbers.

for col in college.columns:
    college[col] = college[col].str.replace(',','')

‘Students Enrolled’,’Women*’, and ‘Men*’ columns

Our surviving columns are:

original_columns = college.columns  # Save the names of the original columns for later so we can drop them.
list(original_columns)
['schoolname',
 'location',
 'Overall Admission Rate',
 'Women',
 'Men',
 'Students Enrolled',
 'Women*',
 'Men*',
 'Average GPA',
 'SAT Math',
 'SAT Critical Reading',
 'SAT Writing',
 'ACT Composite',
 'High School Class Rank']

Though we now only have these 14 columns for each school, there are more than 14 pieces of numeric data saved inside these columns.

For Penn State, for example, the 'Students Enrolled', 'Women*', and 'Men*' columns contain:

college.loc[59, 'Students Enrolled']
'9800 (33%) of 29878 admitted students enrolled'
college.loc[59, 'Women*']
'4710 (29%) of 16047 admitted students enrolled'
college.loc[59, 'Men*']
'5090 (37%) of 13831 admitted students enrolled'

We neeed to split these columns, and nearly all of the other columns as well, into separate columns:

We can use pandas .str.extract() method with a regular expression argument, which will output a new column of data for each group in the regular expression. We can then chain the .astype() method to convert the output columns to our desired data type. We can then convert any percentage columns to decimal numbers by dividing them by 100.

One at a time, we will search each of the columns 'Enrolled Students','Women*', and 'Men*' using a regular expression that captures three groups of numbers, outputting three new columns for each old column containing floating point numbes for ‘Enrolled’, ‘Yield Rate’, ‘Offers’.

regexp = '(\d*)\D*(\d*)%\D*(\d*)'

college[['Enrolled (all)','Yield Rate (all)','Offers (all)']] = college['Students Enrolled'].str.extract(regexp).astype('float')
college[['Enrolled (women)','Yield Rate (women)','Offers (women)']] = college['Women*'].str.extract(regexp).astype('float')
college[['Enrolled (men)','Yield Rate (men)','Offers (men)']] = college['Men*'].str.extract(regexp).astype('float')

college[['Yield Rate (all)','Yield Rate (women)','Yield Rate (men)']] /= 100   #to make rates a decimal percentage

The ‘location’ column

Let’s see what the raw location data looks like:

college.loc[59,'location']
'University Park\xa0PA\xa0\xa016802'

We will split the ‘location’ string column into ‘City’ string, ‘State’ category, and ‘Zip’ integer columns.

regexp = '^(.*)\s([A-Z]{2})\s*(\d{5})'

college[['City','State','Zip']] = college['location'].str.extract(regexp)
college['State'] = college['State'].astype('category')
college['Zip'] = college['Zip'].astype('int')

The ‘Overall Admission Rate’, ‘Women’, and ‘Men’ columns

Let’s see what these columns look like:

college.loc[59,'Overall Admission Rate']
'56% of 52974 applicants were admitted'
college.loc[59,'Men']
'53% of 25925 applicants were admitted'
college.loc[59,'Women']
'59% of 27049 applicants were admitted'

We will split each of these columns into an ‘Offer Rate’ and ‘Apps’ column of floating point numbers.

regexp = '(\d*)%\D*(\d*)'

college[['Offer Rate (all)','Apps (all)']] = college['Overall Admission Rate'].str.extract(regexp).astype('float')
college[['Offer Rate (women)','Apps (women)']] = college['Women'].str.extract(regexp).astype('float')
college[['Offer Rate (men)','Apps (men)']] = college['Men'].str.extract(regexp).astype('float')

college[['Offer Rate (all)','Offer Rate (women)','Offer Rate (men)']] /= 100    #to make rates a decimal percentage

The ‘Average GPA’ column

Taking a look:

college.loc[59,'Average GPA']
'3.59'

The ‘Average GPA’ column doesn’t need to be split, but it does need to have it’s value extracted from any potential surrounding text that could accompany it at other schools. And it needs to be converted to a float column.

regexp = '(\d\.\d\d)'

college['GPA Mean'] = college['Average GPA'].str.extract(regexp).astype('float')

The ‘SAT Math’, ‘SAT Critical Reading’, ‘SAT Writing’, and ‘ACT Composite’ columns

Let’s look at these four columns:

college.loc[59,'SAT Math']
'612 average 560-670 range of middle 50%'
college.loc[59,'SAT Critical Reading']
'578 average 530-630 range of middle 50%'
college.loc[59,'SAT Writing']
nan
college.loc[59,'ACT Composite']
'27 average 25-29 range of middle 50%'

CollegeData.com doesn’t report a value for Penn State’s 'SAT Writing' because Penn State does not consider that score in its admissions process. Looking at other schools, the 'SAT Writing' column does follow the same format as the other SAT columns.

We split each of the three ‘SAT’ columns – Math, Reading, and Writing – and the one ‘ACT’ columns into three float columns each, one for the Mean, one for the 25th percentile, and one for the 75th percentile.

regexp = '(\d*) average (\d*)-(\d*)'

college[['SAT Math Mean','SAT Math 25th','SAT Math 75th']] = college['SAT Math'].str.extract(regexp).astype('float')
college[['SAT Reading Mean','SAT Reading 25th','SAT Reading 75th']] = college['SAT Critical Reading'].str.extract(regexp).astype('float')
college[['SAT Writing Mean','SAT Writing 25th','SAT Writing 75th']] = college['SAT Writing'].str.extract(regexp).astype('float')
college[['ACT Mean','ACT 25th','ACT 75th']] = college['ACT Composite'].str.extract(regexp).astype('float')

The ‘High School Class Rank’ column

Finally, let’s look at the last column:

college.loc[59, 'High School Class Rank']
'Top tenth:   36% Top quarter:   77% Top half:   97% Bottom half:    3%'

We split this column into four float columns for the Top 10th, Top 25th, Top 50th, and Bottom 50th.

regexp = '(\d*)%\D*(\d*)%\D*(\d*)%\D*(\d*)%'

college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] = college['High School Class Rank'].str.extract(regexp).astype('float')
college[['HS Top 10th','HS Top 25th','HS Top 50th','HS Bottom 50th']] /= 100   #decimal percentage

Changing the dataframe index

To make sure the name of each school is unique, we will append any duplicate school names with the state where the school is located.

for schoolId in college[college['schoolname'].duplicated(keep=False)].index:
    college.loc[schoolId,'schoolname'] = college.loc[schoolId,'schoolname']+' ('+college.loc[schoolId,'State']+')'

Instead of indexing our dataframe by CollegeData.com’s arbitrary schoolId, we will use the name of the school as the index.

college = college.set_index('schoolname', drop=False).sort_index()

Dropping old columns

And finally, we will delete all the old columns from the dataset.

college = college.drop(columns=original_columns)

Inspecting our cleaned dataframe

Let’s look at the .info() of our dataframe in its freshly cleaned state:

college.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1813 entries, Abilene Christian University to Youngstown State University
Data columns (total 35 columns):
Enrolled (all)        1303 non-null float64
Yield Rate (all)      1303 non-null float64
Offers (all)          1303 non-null float64
Enrolled (women)      1309 non-null float64
Yield Rate (women)    1309 non-null float64
Offers (women)        1309 non-null float64
Enrolled (men)        1265 non-null float64
Yield Rate (men)      1265 non-null float64
Offers (men)          1265 non-null float64
City                  1813 non-null object
State                 1813 non-null category
Zip                   1813 non-null int64
Offer Rate (all)      1548 non-null float64
Apps (all)            1548 non-null float64
Offer Rate (women)    1437 non-null float64
Apps (women)          1437 non-null float64
Offer Rate (men)      1431 non-null float64
Apps (men)            1431 non-null float64
GPA Mean              1343 non-null float64
SAT Math Mean         951 non-null float64
SAT Math 25th         951 non-null float64
SAT Math 75th         951 non-null float64
SAT Reading Mean      933 non-null float64
SAT Reading 25th      933 non-null float64
SAT Reading 75th      933 non-null float64
SAT Writing Mean      556 non-null float64
SAT Writing 25th      556 non-null float64
SAT Writing 75th      556 non-null float64
ACT Mean              1030 non-null float64
ACT 25th              1030 non-null float64
ACT 75th              1030 non-null float64
HS Top 10th           1035 non-null float64
HS Top 25th           1035 non-null float64
HS Top 50th           1035 non-null float64
HS Bottom 50th        1035 non-null float64
dtypes: category(1), float64(32), int64(1), object(1)
memory usage: 500.4+ KB

Our original, raw scraped dataframe was 1813 schools long and 53 dirty columns wide. In our cleaning, we dropped quite a few sparse columns. But we also split many original columns into clean, new columns.

We can check that Penn State’s data was properly cleaned. Since we’ve reset the index, we can select it directly by name now.

college.loc['Penn State University Park']
Enrolled (all)                   9800
Yield Rate (all)                 0.33
Offers (all)                    29878
Enrolled (women)                 4710
Yield Rate (women)               0.29
Offers (women)                  16047
Enrolled (men)                   5090
Yield Rate (men)                 0.37
Offers (men)                    13831
City                  University Park
State                              PA
Zip                             16802
Offer Rate (all)                 0.56
Apps (all)                      52974
Offer Rate (women)               0.59
Apps (women)                    27049
Offer Rate (men)                 0.53
Apps (men)                      25925
GPA Mean                         3.59
SAT Math Mean                     612
SAT Math 25th                     560
SAT Math 75th                     670
SAT Reading Mean                  578
SAT Reading 25th                  530
SAT Reading 75th                  630
SAT Writing Mean                  NaN
SAT Writing 25th                  NaN
SAT Writing 75th                  NaN
ACT Mean                           27
ACT 25th                           25
ACT 75th                           29
HS Top 10th                      0.36
HS Top 25th                      0.77
HS Top 50th                      0.97
HS Bottom 50th                   0.03
Name: Penn State University Park, dtype: object

Everything looks good here. The NaN for the SAT Writing is appropriate since Penn State did not consider the SAT Writing in admissions – CollegeData.com showed values of ‘Not reported’.


Exporting the dataframe

We will save the cleaned dataframe to a new .csv file.

college.to_csv('collegedata_cleaned_backup.csv')

In Part 3, we will import a separate dataset of US News and World Report’s college rankings and combine it with our cleaned school dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.