Collated Police Incident Index (CPII)


For what purpose was the dataset created?

The dataset was created to assess the effect of New York’s bail reform on crime, and to ultimately determine: "did bail reform increase crime (as measured by the reconstructed index crime) in NYC, relative to shared co-movements in crime across the nation?" . It is not a single dataset per-se, but an index of 27 datasets with some common variables that can be combined or compared.

Who created the dataset?
Is it an official law enforcement or government body? An academic research team? Other?

The dataset was created researchers at UC Berkley, Cornell University, and New York City Criminal Justice Agency: Angela Zhou, Andrew Koo, Nathan Kallus, Rene Ropac, Richard Peterson, Stephen Koppel, and Tiffany Bergin.

Was there a specific task in mind, or gap that needed to be filled?

The authors wished to assess the impact of the New York State’s Bail Elimination Act which: “eliminates money bail and pretrial detention for nearly all misdemeanor and nonviolent felony defendants” . Specifically, they wished to investigate whether the Act had any impact on observed crimes rates, positing that bail and pretrial detention may have served as a deterrence. To do this, they assess New York’s crime rate against a synthetic control by reweighting the aggregated crime rate from 19 other municipal police departments.


What do the instances that comprise the dataset represent?
For example: crimes, offenders, court cases, police officers

Each instance represents a recorded crime report.

Are there multiple types of instances?
For example: offenders, victims, and the relationship between them.


How many instances are there in total?
Of each type, if appropriate.

There are a total of 27 datasets in this index, each one has between 10K – 1M instances.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
For example, if it is traffic stops from a territory, is it all traffic stops conducted within that territory within a specific time? If not, is it a representative sample of all stops? Describe how representativeness was validated/verified. If it is not representative, please describe why.

The compiled crime data represents 27 cities across the United States from the period Jan 1, 2018 - Mar 15, 2020. “These cities were chosen based on population size and public crime data availability: we assessed the list of cities in decreasing order of population, and downloaded data when it was available for the 30 most populous cities, ending up with 27 cities with available crime reporting data after omitting some due to significant reporting discontinuities in the data” .

What data does each instance consist of?
If there is a large number of variables, please provide a broad description of what is included.

As the data is compiled from 27 different sources, each source has a different set of variables. All sources report on the date, time, and location of the crime (as recorded) and the type of the offense. See Table [tab:variable_matrix] for further detail.


Is there a target label or associated with each instance?
Please include labels that are likely to be used as target labels, e.g. recidivism.

No. The data is in its record-based form. Once the data is aggregated, the crime rate could be considered as a target variable.


Does the dataset contain data on race and ethnicity?
If so, is it based on the individual’s self-description, or based on officer’s impression? Was it collected or derived in post-processing? For example, by name analysis.

Some of the 27 datasets in this index include information on offender and victim race. As the raw data is crime incident reports, this information is likely a mix of officer impression, victim impression and self-description.

Are there any known errors, sources of noise, bias or missing data, or variables collected for only part of the datasets?
If so, please provide a description.

No. However, the data is not standardized, and different agencies may employ different crime recording standards. Note these are just initial reporting figures produced for the local areas, and may be updated at a later date.

Does the dataset contain data on criminal history or other data that might be considered confidential or sensitive in any way?
For example: sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; If so, please provide a description.


Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
If so, please describe how.



What type of tasks, if any, has the dataset been used for?
If so, please provide examples and include citations.

To date, this dataset has only been used to determine the impact of NYC bail reform .


What (other) tasks could the dataset be used for?
For example: testing predictive policing systems, predicting recidivism.

The dataset could be used as an alternative for UCR Summary reporting service to obtain aggregate reports of crime. This dataset index was compiled at the point when 2020 UCR data was not yet available. Given the 2020 NIBRS data has now been released, there are two maino reasons to use this dataset (1) it includes cities that do not report to NIBRS and (2) it reports location in a more fine-grained manner.

Many of the variables do not match across the index, including the type of location they use, for example: tract, latitute/longitute, etc. These will have to be resolved for many use-cases. Additionally, some datasets report arrests, where-as some report incidents. This needs to be carefully managed when comapring the data from different localities.

Collection Process

How was the data associated with each instance acquired?
e.g. the data collected survey, the raw data is routinely collected by the courts.

The data in the index is hosted on the law enforcement agencies’ respective websites.

Was the information self-reported?
If the data was self-reported, was the data validated/verified? If so, please describe how.


Who was involved in the data collection process?
Was this done as part of their other duties? If not, were they compensated?

The authors of the study compiled the list of datasets. The raw data was collected as part of routine law enforcement work.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?
If not, please describe the timeframe in which the data associated with the instances was created. If the collection was not continuous within the timeframe, please specify the intervals, for example, annually, every 4 years, irregularly.

The data was compiled in 2021, and concerns the 2018 – March 2020 period.

An ethical review is not mentioned in the paper .


An analysis of the potential impact was not mentioned in the paper .

Pre-processing, cleaning, labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, removal of instances, processing of missing values)?
If so, please provide a description and reference to the documentation. If not, you may skip the remaining questions in this section.

From the paper: “We removed Atlanta and Fort Worth because of data quality reporting issues: due to changes in reporting scheme, the observed time series has a large discontinuity. Fort Worth and Houston both moved to NIBRS reporting in 2018 which aligns with the anomalies for those cities. Kansas City also moved from encoding with UCR codes to NIBRS descriptions in 2019; there also appears to be a data changepoint in the series in that time range” .

Yes, as the dataset is in fact an index of the original datasets.



Is the data publicly available? How and where can it be accessed (e.g., website, GitHub)?
Does the dataset have a digital object identifier (DOI)?

Yes. Please see index below:

Each local dataset is subject to an individual license.


Is the dataset maintained? Who is supporting/hosting/maintaining the dataset?

No, the index is not maintained. The raw data is likely maintained by respective agencies.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

The authors of the study can be contacted at:








Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?


Are older versions of the dataset continue to be supported/hosted/maintained?


If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
If so, please provide a description.

Contact the authors.