Data description 

The Challenge uses data on roughly 26,000 individuals from the State of Georgia released from Georgia prisons on discretionary parole to the custody of the Georgia Department of Community Supervision (GDCS) for the purpose of post-incarceration supervision between January 1, 2013 and December 31, 2015. This dataset is split into two sets, training and test. We used a 70/30 split, indicating that 70% of the data is in the training dataset and 30% in the test dataset. The training dataset includes the four dichotomous dependent variables measuring if an individual recidivated in the three-year follow-up period (yes/no) as well as recidivated by time period (year 1, year 2, or year 3). Recidivism is measured as an arrest for a new felony or misdemeanor crime within three years of the supervision start date. The test dataset does not include the four dependent variables. The initial test dataset will include all individuals selected in the 30% test dataset. After the first Challenge period (forecasting the probability individuals recidivated year 1) concludes, a second test dataset will be released containing only those individuals that did not recidivate year 1. The same will be done after the second Challenge period. It should also be noted that the test dataset will contain variables that describe supervision activities, such as drug testing and employment. These data will not appear in the test dataset until the second Challenge period (i.e., year 2 dataset). We believe this is more reflective of practice where activities must accrue and correctional officers must become aware prior to a recidivism event. The additional data released at the second Challenge period will not change at the third Challenge period release (i.e., year 3 dataset); they are measures of supervision activities during the entire time people were under supervision or until the date of recidivism for those arrested. The only thing that changes with the third Challenge period release is the removal of those individuals that did recidivate in year 2.
 Both the GDCS and the Georgia Bureau of Investigation provided data. GDCS provided data. The GDCS data included demographics, prison and parole case information, prior community supervision history, conditions of supervision as articulated by the Board of Pardons and Paroles, and supervision activities (violations, drug tests, program attendance, employment, residential moves, and accumulation of delinquency reports for violating conditions of parole). The Georgia Bureau of Investigation provided data from the Georgia Crime Information Center (GCIC) statewide criminal history records repository. The GCIC data provides the Georgia prior criminal history measures, to include arrest and conviction episodes prior to prison entry. GCIC “rap sheet” data captures all charges at an arrest episode, defined as a custodial arrest where a person is fingerprinted by law enforcement. Arrest episodes with multiple charges are described in this data by the most serious charge. The exception is criminal history domestic violence and gun charges, which count all charges across all episodes. GCIC data also provides the recidivism measure, defined as a new felony or misdemeanor arrest episode within three years of parole supervision start date.

Posting schedule:

  • April 30, 2021, Initial release of training data
  • April 30, 2021, Initial release of test data
  • May 31, 2021, End of submission period 1
  • June 1, 2021, Release of updated test data
  • June 15, 2021, End of submission period 2
  • June 16, 2021, Release of final test data
  • June 30, 2021, End of submission period 3

Datasets 

The codebook for the Challenge data sets are in appendix 2 of the Challenge document (https://nij.ojp.gov/funding/recidivism-forecasting-challenge#g0jtto).

Training Dataset

The training dataset is a 70% random sample of the overall population described above. This dataset provides you the dependent variables so you can work on/train algorithms for the test datasets.
Please click "View Source Data" above to view and  download data.

Test Dataset 1

The initial test dataset is the remaining 30% of the population described above. This dataset does not have the dependent variable as that is what you are intended to forecast.
Please click "View Source Data" above to view and  download data.

Test Dataset 2

Second Test Dataset including supervision activities.
Please click "View Source Data" above to view and  download data.

Test Dataset 3

Third Test Dataset including supervision activities.
Please click "View Source Data" above to view and  download data.

Full Dataset

This is the dataset of all individuals (training and test) with all variables released.
Please click "View Source Data" above to view and  download data.