Data
description
The Challenge uses data
on roughly 26,000 individuals from the State of Georgia released from Georgia
prisons on discretionary parole to the custody of the Georgia Department of
Community Supervision (GDCS) for the purpose of post-incarceration supervision
between January 1, 2013 and December 31, 2015. This dataset is split into two
sets, training and test. We used a 70/30 split, indicating that 70% of the data
is in the training dataset and 30% in the test dataset. The training dataset
includes the four dichotomous dependent variables measuring if an individual
recidivated in the three-year follow-up period (yes/no) as well as recidivated
by time period (year 1, year 2, or year 3). Recidivism is measured as an arrest
for a new felony or misdemeanor crime within three years of the supervision
start date. The test dataset does not include the four dependent variables. The
initial test dataset will include all individuals selected in the 30% test
dataset. After the first Challenge period (forecasting the probability
individuals recidivated year 1) concludes, a second test dataset will be
released containing only those individuals that did not recidivate year 1. The
same will be done after the second Challenge period. It should also be noted
that the test dataset will contain variables that describe supervision
activities, such as drug testing and employment. These data will not appear in
the test dataset until the second Challenge period (i.e., year 2 dataset). We
believe this is more reflective of practice where activities must accrue and
correctional officers must become aware prior to a recidivism event. The
additional data released at the second Challenge period will not change at the
third Challenge period release (i.e., year 3 dataset); they are measures of
supervision activities during the entire time people were under supervision or
until the date of recidivism for those arrested. The only thing that changes
with the third Challenge period release is the removal of those individuals
that did recidivate in year 2.
Both
the GDCS and the Georgia Bureau of Investigation provided data. GDCS provided
data. The GDCS data included demographics, prison and parole case information,
prior community supervision history, conditions of supervision as articulated
by the Board of Pardons and Paroles, and supervision activities (violations,
drug tests, program attendance, employment, residential moves, and accumulation
of delinquency reports for violating conditions of parole). The Georgia Bureau
of Investigation provided data from the Georgia Crime Information Center (GCIC)
statewide criminal history records repository. The GCIC data provides the Georgia
prior criminal history measures, to include arrest and conviction episodes
prior to prison entry. GCIC “rap sheet” data captures all charges at an arrest
episode, defined as a custodial arrest where a person is fingerprinted by law
enforcement. Arrest episodes with multiple charges are described in this data
by the most serious charge. The exception is criminal history domestic violence
and gun charges, which count all charges across all episodes. GCIC data also
provides the recidivism measure, defined as a new felony or misdemeanor arrest
episode within three years of parole supervision start date.
Posting schedule:
- April 30, 2021, Initial release of training data
- April 30, 2021, Initial release of test data
- May 31, 2021, End of submission period 1
- June 1, 2021, Release of updated test data
- June 15, 2021, End of submission period 2
- June 16, 2021, Release of final test data
- June 30, 2021, End of submission period 3
Datasets
The codebook for the Challenge data sets are in appendix 2
of the Challenge document (https://nij.ojp.gov/funding/recidivism-forecasting-challenge#g0jtto).
Training Dataset
The training dataset is a 70% random sample of the overall
population described above. This dataset provides you the dependent variables
so you can work on/train algorithms for the test datasets.
Please click "View Source Data" above to view and download data.
Test Dataset 1
The initial test dataset is the remaining 30% of the
population described above. This dataset does not have the dependent variable
as that is what you are intended to forecast.
Please click "View Source Data" above to view and download data.
Test Dataset 2
Second Test Dataset including supervision activities.
Please click "View Source Data" above to view and download data.
Test Dataset 3
Third Test Dataset including supervision activities.
Please click "View Source Data" above to view and download data.
Full Dataset
This is the dataset of all individuals (training and test) with all variables released.
Please click "View Source Data" above to view and download data.