Bail Amount Algorithm

The Bail Amount Algorithm assists judges by objectively determining the amount bail should be set at based on relevant information about the defendant. It can be used in conjunction with a risk assessment algorithm or independently.

Enter defendant information:

Age Group
Crime Class
Detainer
Gender
Offense Class
Race

Last Updated: January 5, 2020

Bail Amount Algorithm Source Code:

view this notebook on github →

The intent of this algorithm is to model the complex socio-economic factors considered by a judge when setting a defendant’s bail amount for pre-trial detention, and to produce a rigorous and math-based determination for what the dollar amount the bail should be set at. The importance of using an machine learning algorithm in these situations is because human judges are subject to human biases (whether they are intentional or latent) and can have an unintended and disastrous impact on the defendant. In this case an algorithm carefully designed to mitigate agains these biases will provide the optimal outcome.

The features of the model will reflect demographic information about the defendant, and the target variable to be predicted will be a class assigned to a numeric variable associated with the recommended dollar amount that bail should be set at.

Training data:

In [1]:
%config InlineBackend.figure_format = 'retina'
  
import re
import requests
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

from sklearn_transformers.classifier_helpers import FeatureSelector
from sklearn_transformers.classifier_helpers import MultiColumnLabelEncoder
from sklearn_transformers.classifier_helpers import BinaryClassifierWithNoise
import helpers.classifier_report as classifier_report

Download data:

The availability of pre-trial detention data on the individual level is extremely limited, however, in order to create a model for individual use, training on individual-level data is necessary. Fortunately, the Connecticut Department of Corrections currently maintains an updated-daily and anonymized dataset that is sufficient for modeling according to the intent stated above, the "Accused Pre-Trial Inmates in Correctional Facilities" dataset.

In [2]:
bail_data_filename = 'data/bail_data.csv'
bail_data_url = 'https://data.ct.gov/api/views/b674-jy6w/rows.csv?accessType=DOWNLOAD'

response = requests.get(bail_data_url)
with open(bail_data_filename, 'wb') as out_file:
    out_file.write(response.content)

_df = pd.read_csv(bail_data_filename)
print(_df.shape)
_df.sample(5)
(4572123, 10)
Out[2]:
DOWNLOAD DATEIDENTIFIERLATEST ADMISSION DATERACEGENDERAGEBOND AMOUNTOFFENSEFACILITYDETAINER
119147508/14/2017ZZSERLCR06/27/2017WHITEM2725000LARCENY, FOURTH DEGREE AMNEW HAVEN CCNONE
88318905/10/2017ZZSEJEBH02/08/2017WHITEM23130100CRIM VIOL OF PROTECTIVE ORDER DFHARTFORD CCNONE
438136605/01/2020ZZSHRZCZ08/28/2019HISPANICM391500000MANSLAUGHTER 2ND WITH MV (INTOX) CFWALKER RCNONE
63855902/19/2017ZZHLLSLW01/18/2017WHITEM5550000FAILURE TO APPEAR, SECOND DEGREE AMBRIDGEPORT CCNONE
153847012/08/2017ZZSZHJLZ09/11/2017WHITEM5775000VIOLATION OF PROBATION OR COND DISCHGBRIDGEPORT CCNONE

Preprocess:

In [3]:
df = _df.copy()
df = df.rename(columns={column: column.strip().lower().replace(' ', '_') for column in df.columns})

In order to transform the raw data into a state prepared for modeling, a fair amount of preprocessing must occur. There are three predominant transformations: extract_classes(), determine_age_group(), and determine_bond_amount(), as well as the cleanup and standarization of some columns.

The first transformation, extract_classes(), extracts and decodes the last two characters in the values within the OFFENSE column, these characters represent the 'offense class' and the 'crime class' respectively. By extracting these two values, we obtain a higher level categorical variable that the specific offense. This enables a more generalizable feature for our model instead of the specific value for the 'offense'. It also reduces dimensionality, making the values easier for a model to determine a pattern within the features.

In [4]:
class_pattern = re.compile('\s([A-Z]?)([A-Z])$')
df[['offense_class', 'crime_class']] = df['offense'].str.extract(class_pattern, expand=False)
df = df.fillna({'offense_class': 'unknown', 'crime_class': 'unknown'})
df.shape
Out[4]:
(4572123, 12)
In [5]:
field_replacements = {
    'race': {
        'amer ind': 'native american',
    },
    'gender': {
        'm': 'male',
        'f': 'female',
    },
    'detainer': {
        'state of ct': 'state',
        'consec sent': 'consecutive sentence',
        'governor wrnt': 'governor warrant',
        'other state': 'state',
        'none': 'unknown',
        '': 'unknown',
    },
    'crime_class': {
        'F': 'felony',
        'M': 'misdemeanor',
        'C': 'unknown',
        'I': 'unknown',
    },
    'offense_class': {
        'A': 'a',
        'B': 'b',
        'C': 'c',
        'D': 'd',
        'U': 'unknown',
        '': 'unknown',
    }
}

df['crime_class'] = df['crime_class'].replace(field_replacements['crime_class'])
df['offense_class'] = df['offense_class'].replace(field_replacements['offense_class'])
df['race'] = df['race'].str.lower().replace(field_replacements['race'])
df['gender'] = df['gender'].str.lower().replace(field_replacements['gender'])
df['detainer'] = df['detainer'].str.lower().replace(field_replacements['detainer'])
print(df.shape)
df.sample(5)
(4572123, 12)
Out[5]:
download_dateidentifierlatest_admission_dateracegenderagebond_amountoffensefacilitydetaineroffense_classcrime_class
275617212/04/2018ZZSZEWSL11/05/2018blackmale2640000INJURY OR RISK OF INJURY TO MINOR FOSBORN CIunknownunknownfelony
302727602/20/2019ZZSEZECE12/03/2018hispanicmale25230000CRIMINAL POSSESSION OF A PISTOL DFBRIDGEPORT CCunknowndfelony
19886309/14/2016ZZSELLRJ08/29/2016hispanicmale34250000CRIM VIOL OF PROTECTIVE ORDER DFHARTFORD CCunknowndfelony
160472512/28/2017ZZRJLWRW02/21/2017blackmale36175000POSSESSION OF NARCOTICSNEW HAVEN CCunknownunknownunknown
181984303/04/2018ZZRWLSHW11/01/2017hispanicmale38100050EVADING RESPONSIBILITY MHARTFORD CCunknownunknownmisdemeanor

Similarly, the next transformation, determine_age_group(), changes the ordinal value of the age column into a categorical variable, again to reduce dimensionality.

In [6]:
def determine_age_group(age):
    if age < 17:
        return '<17'
    elif 17 < age <= 24:
        return '18–24'
    elif 24 < age <= 34:
        return '25–34'
    elif 34 < age <= 44:
        return '35–44'
    elif 44 < age <= 54:
        return '45–54'
    elif 54 < age < 64:
        return '55–64'
    else:
        return '65+'

df['age_group'] = df.apply(lambda row: determine_age_group(row['age']), axis=1)
df.sample(5)
Out[6]:
download_dateidentifierlatest_admission_dateracegenderagebond_amountoffensefacilitydetaineroffense_classcrime_classage_group
415416502/12/2020ZZSRZCHJ05/23/2017hispanicmale33250000SEXUAL ASSAULT, FIRST DEGREE FCORRIGAN CIunknownunknownfelony25–34
370531509/19/2019ZZSHEJBW08/21/2019hispanicmale1835050BURGLARY, SECOND DEGREE CFMANSON YIunknowncfelony18–24
374609510/01/2019ZZRWCJRC01/16/2019blackmale41750000MURDER AFWALKER RCunknownafelony35–44
1732607/24/2016ZZELBBJZ05/09/2016blackmale53500000ASSAULT, SECOND DEGREE DFNORTHERN CIunknowndfelony45–54
11776508/06/2016ZZHSBLBE05/23/2011blackmale251527000HOME INVASION AFBRIDGEPORT CCspecial paroleafelony25–34

The last transformation is the most important because it enables this model to be a 'fair' and 'common sense' algorithm. Do to the history of structural racism inherent in the criminal justice system, the features present in the training data are considered to be highly biased against the victims of this oppression, and any fair model needs to account for this. Operating with the understanding that data is always biased, that un-biased data is a logically impossible and a naive ideal, the task here isn't to remove the bias, but shift its impact from the people whose lives have be negatively impacted by bail and onto the bail system itself. The setting of bail amount is an occasion for structural racism to occur. This can easily be addressed by transforming the value of the bail amount. The transformation, determine_bond_amount(), does this by turning the bail amount from continuous variable (any dollar amount) into a binary categorical variable (one of two dollar amounts: 0.00 or 1.00). The categories were chosen as representative of the original spirit of the 8th amendment, where the two classes inherently prevent the model from inadvertanly producing an 'excessive' bail amount recommendation.

This value will be used as the dependent variable (label) for classification.

In [7]:
def determine_bond_amount(series):
    return np.where(series > 2000000, '1.00', '0.00')

df['bond_amount'] = determine_bond_amount(df['bond_amount'])
df.sample(5)
Out[7]:
download_dateidentifierlatest_admission_dateracegenderagebond_amountoffensefacilitydetaineroffense_classcrime_classage_group
53082901/17/2017ZZHRLEHE05/02/2016blackmale310.00VIOLATION OF PROBATION OR COND DISCHGHARTFORD CCunknownunknownunknown25–34
384553111/01/2019ZZRJLSJZ10/10/2019blackmale360.00VIOLATION OF PROBATION OR COND DISCHGGARNERunknownunknownunknown35–44
15639308/17/2016ZZSELEHB07/29/2016hispanicfemale330.00PROSTITUTION AMYORK CIunknownamisdemeanor25–34
293049201/23/2019ZZRBZWSZ08/23/2018hispanicmale540.00SEXUAL ASSAULT, FOURTH DEGREE AMCORRIGAN CIunknownamisdemeanor45–54
43707812/20/2016ZZHBHHRB11/17/2015blackmale210.00ASSAULT, SECOND DEGREE DFHARTFORD CCunknowndfelony18–24

Finally, the cleanup and standardization mapping expands the values of abbreviated input data (ffemale, consec sentconsecutive sentence) and corrects the exonym AMER IND ('american indian'), opting for native american instead.

One thing to note is that the ethnic category HISPANIC remains in the race category because there is not enough information in the data to disambiguate ethnicity from race. It is worth mentioning the decision to retain racial/ethnic information as features in this model at all. There are countless opportunities for mis-categorization of race, from the prespective of the individual imputting the data (we have no clear way of reliably knowing that every instance within the data was reported by the individual the data represents or not) to the unnecessary reduction of diverse and complicated family heritage into brittle and outdated categories. However, the intent of the model is to diminish the impact of these racial features, and the only way demonstrate that these features do not impact the recommendations is to include them as feature in the model.

Model:

In [8]:
feature_columns = ['crime_class', 'offense_class', 'age_group', 'race', 'gender', 'detainer']
target_column = 'bond_amount'

x = df.copy()[feature_columns]
y = df.copy()[target_column]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1337)

print(len(x_train), len(x_test))
3657698 914425
In [9]:
model = Pipeline([
    ('feature_column_encoder', MultiColumnLabelEncoder(columns=feature_columns)),
    ('classifier', BinaryClassifierWithNoise(
        RandomForestClassifier(
            n_estimators=10,
            n_jobs=1,
            random_state=1337,
        )
    )),
])
model.fit(x_train, y_train)
Out[9]:
Pipeline(steps=[('feature_column_encoder', <sklearn_transformers.classifier_helpers.MultiColumnLabelEncoder object at 0x7f95c8022208>), ('classifier', BinaryClassifierWithNoise(classifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='aut...mators=10, n_jobs=1, oob_score=False, random_state=1337,
            verbose=0, warm_start=False)))])
In [10]:
classifier_report.performance(model, x_test, y_test, encoder_step_label='feature_column_encoder', cross_validate=False)
classifier_report.roc_curve(model, x_test, y_test)
display(model.predict_proba(x_test.sample(5)))
Accuracy:  0.987956
Recall:    0.999836
F-beta:    0.993939
Precision: 0.988111

----------------------------------------
Feature Importances:
crime_class=misdemeanor       0.333207
crime_class=felony            0.256596
crime_class=unknown           0.191440
offense_class=c               0.112013
offense_class=a               0.086826
offense_class=b               0.019918
[{'0.00': 0.99943522035904442, '1.00': 0.00056477964095558164},
 {'0.00': 0.88664721592895601, '1.00': 0.11335278407104399},
 {'0.00': 0.9995901015570815, '1.00': 0.00040989844291849931},
 {'0.00': 0.99988266355152688, '1.00': 0.00011733644847311542},
 {'0.00': 0.99919976145037281, '1.00': 0.00080023854962718577}]
In [11]:
y_pred = model.predict(x_test)
class_names = model.named_steps['classifier'].classifier.classes_
classifier_report.plot_confusion_matrix(y_test, y_pred, class_names=class_names)