The Higher Education Grant Algorithm helps officials calculate the amount of Federal grant funding to be awarded to student applicants. The amount is calculated based on relevant information about the school and the tuition amount.
Last Updated: January 5, 2020
The intent of this algorithm is to model the dynamic between institutional factors that contribute to federal grant awards to prospective students, producing a fair and accurate recommendation for the grant award amount. Machine learning in situations like this can reduce implicit or latent biases that prevent financial access to higher education for all individuals. To prevent algorithmic discrimination, the only dimensions required for modeling for grant awards are the location of the school, the school type, whether the school is in the prospective student’s home state, and the tuition amount. all other factors have the potential to introduce biases into the model and are not relevant for producing a performant model.
The availability of data representing grant awards at the individual student level, and the percent of tuition covered by the grant are either non-existent or unknown at the time of designing this algorithm. However, the combination of multiple public datasets can allow us to approximate this student-level data.
The features of the model, or independent variables, reflect qualities about the educational institution, and the dependent variable is the percentage amount of tuition that was covered by a federal grant.
%config InlineBackend.figure_format = 'retina'
import glob
import json
import os
import numpy as np
import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn_transformers.preprocessing import FeatureSelector
from sklearn_transformers.preprocessing import MultiColumnLabelEncoder
from sklearn_transformers.regression import RegressorWithNoise
from helpers import regression_report
To model the scenario outlined above, the data required is an anonymized individual-level federal grant award dataset (where each row represents one person’s school, the school’s tuition, and the percent of the tuition the grant amount covered). Unfortunately, the government does readily provide such a dataset, however, these features can be assembled with other existing datasets: the Pell Grant data for full-time undergraduate students which provides the maximum amount awarded per student, the Federal Student Aid's Title IV program which outlines the total amount of grants per educational institution, and lastly the educational institution's tuition provided by College Scorecard.
The first dataset (Sources of Grant Aid for Full-Time Undergraduates by Sector, 2011-12), will provide the maximum Pell Grant amounts awarded per year. The maximum amount award will later be used to determine the percent awarded.
pell_grat_filename = 'data/pell_grant.xlsx'
pell_grant_url = 'http://research.collegeboard.org/sites/default/files/sa-2018-figure-21a.xlsx'
response = requests.get(pell_grant_url)
with open(pell_grat_filename, 'wb') as out_file:
out_file.write(response.content)
def parse_pell_grant_data_start_year(x):
s = int(x[:2])
if s >= 77:
return str(1900 + s)
else:
return str(2000 + s)
pell_grant_data = pd.read_excel(
pell_grat_filename,
header=1,
usecols=[i for i in range(3)],
)
pell_grant_data = pell_grant_data[:41]
pell_grant_data['start_year'] = pell_grant_data['Academic Year'].apply(parse_pell_grant_data_start_year)
pell_grant_data.columns = [col.strip().lower().replace(' ', '_') for col in pell_grant_data.columns]
print(pell_grant_data.shape)
The second dataset is the Federal Student Aid Title IV Volume reports that detail grand disbursements per institution per year (from https://studentaid.ed.gov/sa/about/data-center/student/title-iv). The Q4 data has the complete data for the given year.
base_url = 'https://studentaid.gov/sites/default/files/'
student_loans_data_urls = []
for year in range(6, 18):
this_year = f'{year}'.zfill(2)
next_year = f'{year + 1}'.zfill(2)
if 13 <= year:
url = f'{base_url}fsawg/datacenter/library/'
else:
url = base_url
student_loans_data_urls.append(f'{url}Q4{this_year}{next_year}AY.xls')
for url in student_loans_data_urls:
print(url)
filename = os.path.basename(url)
response = requests.get(url)
with open(filename, 'wb') as out_file:
out_file.write(response.content)
The grants data is in semi-regularly formatted excel spreadsheets, the code below extracts the relevant data from the sheets, and formats the data for use in a dataframe.
grant_fileinfo = [
{'filepath': 'data/Q40607AY.xls', 'sheet_name': 'Q4 0607 YTD', 'header_col': 3},
{'filepath': 'data/Q40708AY.xls', 'sheet_name': 'Q4 0708 YTD', 'header_col': 3},
{'filepath': 'data/Q40809AY.xls', 'sheet_name': 'Q4 0809 YTD', 'header_col': 3},
{'filepath': 'data/Q40910AY.xls', 'sheet_name': 'Q4 0910 YTD', 'header_col': 4},
{'filepath': 'data/Q41011AY.xls', 'sheet_name': 'Q4 1011 YTD', 'header_col': 4},
{'filepath': 'data/Q41112AY.xls', 'sheet_name': 'Q4 1112 YTD', 'header_col': 4},
{'filepath': 'data/Q41213AY.xls', 'sheet_name': 'Q4 1213 YTD', 'header_col': 4},
{'filepath': 'data/Q41314AY.xls', 'sheet_name': 'Award Year Summary', 'header_col': 4},
{'filepath': 'data/Q41415AY.xls', 'sheet_name': 'Award Year Summary', 'header_col': 4},
{'filepath': 'data/Q41516AY.xls', 'sheet_name': 'Award Year Summary', 'header_col': 4},
{'filepath': 'data/Q41617AY.xls', 'sheet_name': 'Award Year Summary', 'header_col': 4},
{'filepath': 'data/Q41718AY.xls', 'sheet_name': 'Award Year Summary', 'header_col': 4},
]
grant_dfs = []
for info in grant_fileinfo:
filepath = info['filepath']
sheet_name = info['sheet_name']
header_col = info['header_col']
print(filepath)
temp_df = pd.read_excel(filepath, sheetname=sheet_name, header=[header_col, header_col+1])
col_index = [True if col.startswith('Unnamed') or col == 'FEDERAL PELL GRANT PROGRAM' else False
for col in temp_df.columns.get_level_values(0)
]
temp_df = temp_df.iloc[:, col_index]
temp_df.columns = temp_df.columns.droplevel()
label = filepath.split('/')[-1][:-4]
temp_df['spreadsheet'] = label
temp_df['academic_year'] = f'20{label[2:4]}'
temp_df.columns = map(str.lower, temp_df.columns)
temp_df.index.name = 'ope id'
temp_df = temp_df.rename(columns={
'sum of recipients': 'pell grant ytd recipients',
'ytd recipients': 'pell grant ytd recipients',
'sum of disbursements': 'pell grant ytd disbursements',
'ytd disbursements': 'pell grant ytd disbursements',
})
temp_df = temp_df.reset_index()
temp_df['ope id'] = temp_df['ope id'].apply(lambda x: str(x).zfill(8))
grant_dfs.append(temp_df)
grants_df = pd.concat(grant_dfs)
print(grants_df.shape)
grants_df.sample(5)
Lastly, we need to know the tutition cost for each educational institution listed in the grant_amounts dataset. This data is provided by College Scorecard on data.gov. However, it requires an API key (sign up here) to programmatically download the data. Also, the API is rate limited, so this step can take a while. This cell is designed to run multiple times until the full data is collected.
url = 'https://api.data.gov/ed/collegescorecard/v1/schools'
API_KEY = #'<INSERT YOUR DATA.GOV API KEY HERE>'
params = {
'api_key': API_KEY,
'_per_page': 100
}
ids = set(grants_df['ope id'].values) - {fp.split('/')[-1][:8] for fp in glob.glob('data/*.json')}
for ope_id in ids:
params = {'ope8_id': ope_id}
local_filename = f'{ope_id}_tuition_data.json'
response = requests.get(url, params=params)
if response.status_code == 200:
with open(local_filename, 'wb') as out_file:
out_file.write(response.content)
if response.status_code != 200:
print(f'error: {response.url}, {response.status_code}, {json.dumps(response.text)}')
if int(response.headers['X-RateLimit-Remaining']) == 0:
print('stopping, over rate limit')
break
print('X-RateLimit-Remaining:', int(data.response.headers['X-RateLimit-Remaining']))
filepaths = glob.glob('data/*_tuition_data.json')
school_data = []
for fp in filepaths:
with open(fp, 'r') as infile:
d = json.load(infile)
if len(d['results']) == 0:
continue
for k, v in d['results'][0].items():
try:
row = {
'ope8_id': fp.split('/')[-1][:8],
'year': int(k),
'out_of_state_tuition': v['cost']['tuition']['out_of_state'],
'in_state_tuition': v['cost']['tuition']['in_state'],
}
school_data.append(row)
except:
pass
tuition_df = pd.DataFrame(school_data, columns=['ope8_id', 'year', 'out_of_state_tuition', 'in_state_tuition'])
tuition_df = tuition_df.sort_values(by=['ope8_id', 'year']).reset_index(drop=True)
print(tuition_df.shape)
print(len(tuition_df.ope8_id.unique()))
tuition_df.sample(5)
This step combines the three datasets into a single dataframe.
grants_df = grants_df.merge(
pell_grant_data[['start_year', 'maximum_pell_grant']],
left_on=['academic_year'],
right_on=['start_year'],
)
grants_df['academic_year'] = grants_df['academic_year'].astype('int64')
_df = grants_df.merge(
tuition_df,
how='left',
left_on=['ope id', 'academic_year'],
right_on=['ope8_id', 'year'],
)
print(_df.shape)
_df.sample(5)
In order to transform the raw data into a state where we can properly model the data, some preprocessing must occur. There are three predominant transformations: calculate_average_amount(), convert_region(), and determining the grant_coverage_percent based on in-state or out-of-state students, as well as the cleanup and standarization of some columns.
_df = _df.rename(columns={'school type': 'school_type'})
_df = _df[[
'school_type',
'pell grant ytd disbursements',
'pell grant ytd recipients',
'state',
'maximum_pell_grant',
'out_of_state_tuition',
'in_state_tuition',
]]
_df = _df[~_df['pell grant ytd recipients'].isin(['-'])]
def convert_to_numeric(x):
try:
return pd.to_numeric(x)
except:
return np.nan
_df['pell grant ytd disbursements'] = _df['pell grant ytd disbursements'].apply(convert_to_numeric)
_df['pell grant ytd recipients'] = _df['pell grant ytd recipients'].apply(convert_to_numeric)
_df['school_type'] = _df['school_type'].str.lower()
The first transformation, calculate_average_amount(), calculates the average amount of grant disbursements. To do so, it divides the total grant amount per institution per year by the total number of grant recipients per institution per year.
def calculate_average_amount(row):
if row['pell grant ytd recipients'] > 0:
return row['pell grant ytd disbursements'] / row['pell grant ytd recipients']
else:
return 0.0
_df['avg_amount'] = _df.apply(calculate_average_amount, axis=1)
The second transformation, convert_region(), determines the educational institution's geographical region by the state it's located in. This reduces feature complexity and allows for broader geographic patterns to be a identified within the model.
state_regions = {
'northeast': ['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NY', 'NJ' , 'PA'],
'midwest': ['IL', 'IN', 'MI', 'OH', 'WI', 'IA', 'KS', 'MN', 'MO', 'NE', 'ND', 'SD'],
'south': ['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX', 'PR', 'AS', 'VI'],
'west': ['AZ', 'CO', 'ID', 'MT', 'NV', 'NM', 'UT', 'WY', 'AK', 'CA', 'HI', 'OR', 'WA', 'GU', 'FM', 'MH', 'MP', 'PW'],
}
def convert_region(state_abbr):
for region, states in state_regions.items():
if state_abbr in states:
return region
return
_df['region'] = _df['state'].apply(convert_region)
_df = _df.dropna(subset=['region'])
The following preprocessing step splits the grant awards by in-state and out-of-state schools.
df_1 = _df.copy()[_df['maximum_pell_grant'] >= _df['out_of_state_tuition']]
df_1['in_state'] = False
df_1['grant_coverage_percent'] = df_1['avg_amount'] / df_1['out_of_state_tuition']
df_2 = _df.copy()[_df['maximum_pell_grant'] >= _df['in_state_tuition']]
df_2['in_state'] = True
df_2['grant_coverage_percent'] = df_2['avg_amount'] / df_2['in_state_tuition']
df = pd.concat([
df_1[['school_type', 'region', 'in_state', 'grant_coverage_percent']],
df_2[['school_type', 'region', 'in_state', 'grant_coverage_percent']]
])
Lastly, the final transformation required in the preprocessing step is determining the grant_coverage_percent.
df = df[df['grant_coverage_percent'] >= 1]
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=['grant_coverage_percent'])
print(df.shape)
df.sample(5)
feature_columns = ['school_type', 'region', 'in_state']
target_column = 'grant_coverage_percent'
x = df.copy()[feature_columns]
y = df.copy()[target_column]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1337)
print(len(x_train), len(x_test))
model = Pipeline([
('feature_column_encoder', MultiColumnLabelEncoder(columns=feature_columns)),
('estimator', RegressorWithNoise(
RandomForestRegressor(
n_estimators=1000,
max_features='sqrt',
max_depth=30,
random_state=1337,
)
))
])
model.fit(x_train, y_train)
regression_report.evaluate(model, x_test, y_test, encoder_step_label='feature_column_encoder')
display(model.predict(x_test.sample(5)))