According to the creators of the dataset, the dataset was created “To allow large-scale, cross-jurisdictional analyses of criminal arrests" and ”enhance many types of research – for example, identification of high-frequency offenders, measurement of changes in policing strategies, and quantification of legislative efficacy – giving policy makers the best data upon which to base law enforcement decisions” .
The codebook lists Gabe Haarsma, Sasha Davenport, Pablo A. Ormachea & David M. Eagleman as authors .
The dataset was created to improve on the information available from the UCR SRS program. Specifically, according to the creators, the advantages of this novel dataset include: (1) individual identifiers allow for recidivism analysis—albeit only for repeated bookings within the same jurisdiction (2) the presence of all the charges allows for deeper understanding of all crime, not just a subset, (3) more and different offender-specific variables than the UCR, (4) the data represent a comprehensive and growing picture of information available to judges and prosecutors, and (5) more and different disposition-specific variables, enabling assessment of small variations in punishment .
There maybe an updated version of this dataset that is not freely available, from here: hrefhttp://scilaw.org/risk-assessment/ http://scilaw.org/risk-assessment/.
Records of criminal charges. The specific variables varies depending on the jurisdiction as described below.
Not within each jurisdiction.
Harris County, TX: 3.1 million records, spanning from 1977 to April, 2012.
New York City, NY: 9.8 million records spanning from 1977 to 2013.
Miami-Dade County, FL: 5.7 million records spanning from 1971 to 2012.
The dataset includes all records from each jurisdiction, within the stated time frame. Some data instances were removed in pre-processing. In addition:
(1) The database contains no juvenile records, as those are not included in basic Freedom of Information Act requests. We note that juvenile is defined differently in each locale, so 17 year olds are included in Harris County records whereas only 18 year olds appear in New York City and Miami-Dade County records.
(2) The database does not include sealed or expunged records, as those are typically removed from the underlying county databases. It is likely that this disproportionately affects certain crime types (e.g., traffic offenses).
In the Harris County dataset, each instance contains Information regarding the:
1. Offense: date, code, name, degree, bond amount at the time of arrest, category, broad category.
2. Defendant: unique ID, race, gender, DOB (mm/yyyy), height, weight, citizenship status.
3. Case: unique case ID, date filed, offense degree, case bond, case status.
4. Attorney: hired or assigned. 5. Grand jury: date, defendant present, and jury action code.
6. Disposition: date, plea, disposition (e.g., dismissed).
In the New York City dataset, each instance contains Information regarding the:
1. Offense: month, year.
2. Arrest: county, month, year, charge, crime category, broad crime category.
3. Defendant: race, gender, age at arrest.
3. Disposition: county, month, year, charge, disposition.
In the Miami-Dade County dataset, each instance contains Information regarding the:
1. Arrest: date, code, crime category, broad crime category.
2. Case: date filed, date closed, offense degree, trial type (Bench / Jury), case code, case status.
3. Defendant: race, gender, DOB (mm/yyyy).
4. Disposition: code, plea, disposition.
There is not a pre-specified target label. However, disposition is most suitable to be used as a target label.
No.
Yes. For race, this originates from the raw data and it is not clear whether it is based on the individual’s self-description.
The jurisdictions within the datasets do not identify offenders of Hispanic descent. To obtain a better understanding of the demographics, the creators have estimated the Hispanic population by last name .
All the records in the database were originally entered by humans. The creators attempted to fix typographical errors. However, a larger problem is missing data. For example, some fields have become more populated with time. Birth date was not as commonly entered in some of the earlier records from the 1970s and 1980s, but becomes more rigorously entered with time 6 .
The dataset does not contain corrections records, as most states do not consider those public. Therefore, while we know each offender’s sentence at the end of trial or plea bargaining, we cannot know how long an offender actually served .
The dataset contains partial information on criminal offending, as well as demographic information. The partial criminal offending can be constructed as the dataset contains unique identification numbers that can be linked across multiple offenses in an area. For example, in Harris County, Texas, 44% of the 1.2M uniquely identified offenders have multiple offenses – and therefore a partial record of offense (see Figure [fig:offense_dist]).
[fig:offense_dist]
Possibly, if comparing to other sources such as news articles. Only relevant for cases that attracted media attention.
Examples of papers that have used this dataset are .
No.
The dataset can be used for research questions around case disposition and sentencing. A partial criminal record can be constructed from the Harris County dataset.
The dataset only contains arrest data and not incident-based data, thus providing a picture of crime at the courthouse level. This means that previous stages in the law enforcement process (e.g., 911 calls, house calls, etc.) could skew the arrests that make it into courthouse databases .
The recidivism analysis allowed by this only applies for repeated bookings within the same jurisdiction. This approach will systematically undercount the true recidivism rate due to relocation .
The dataset does not have victim data, precluding the analysis of, for example, whether ethnicity or age of victim affects sentencing .
Some jurisdictions have more limited data than the rest. For example, New York City’s records only list the most serious offense per arrest and do not yet include an identifier .
While our Broad categorization allows for comparisons across jurisdictions, the detailed categorization does not. The subcategories become populated only if the jurisdictions’ labels or code citations provided enough detail .
To acquire the underlying data, the dataset creators “contacted New York City (New York), Harris County (Houston), and MiamiDade County (Miami), to obtain copies of their criminal records from their justice information management systems. As public records, the data were obtained via Freedom of Information Act requests” .
No. The data was derived from a dataset of criminal records used by respective local authorities. It was not collected for research purposes.
The data was entered into the courts data systems by employs of the courts.
Harris County, TX – 1977 to April, 2012.
New York City, NY – 1977 to 2013.
Miami-Dade County, FL – 1971 to 2012.
Unknown. The dataset creators do state that “The Institutional Review Board at Baylor College of Medicine exempted this release of an anonymized dataset from human subject research oversight because they consist of publicly available records” .
It is likely the individuals know of their criminal charges. It is unlikely they knew or gave consent for it to be used as part of a research dataset.
Unknown.
Yes. Data processing is described is detail in and in the codebook . Broadly, the data was cleaned and standardized, and duplicated entries were removed. Entries have been de-identified by removing names, addresses, etc. DOB was replaced with the month and year only. In the Harris County dataset, defendants and cases were given a unique identifiers. The creators added seven calculated variables for all the datasets: 1. Broad crime category (32 categories), 2. Detailed crime category ( ∼ 150 − 175 categories) 3. Standardized disposition2 4. Gender, using given name to determine gender when missing or unknown. 5. Race, using surname to add Hispanic ethnicity. 6. The defendant age at the time of case filed or the arrest date. 7. The year the case is filed. 8. Aggregated case numbers to combine multiple offenses into single case (Harris County only).
Yes. The calculated age, race and gender variables are added to the dataset alongside the raw variables.
No.
Yes. The dataset can be found here .
The dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
The dataset is not maintained. There maybe an updated version of this dataset that is not freely available, from here: hrefhttp://scilaw.org/risk-assessment/ http://scilaw.org/risk-assessment/.
Unknown.
No.
N/A.
No.
The owners can be contacted at: UCR-NIBRS@fbi.gov