National Crime Victimization Survey (NCVS)


For what purpose was the dataset created?

The National Crime Victimization Survey (NCVS) series was designed to achieve four primary objectives :

  1. To develop detailed information about the victims and consequences of crime

  2. To estimate the number and types of crimes not reported to police

  3. To provide uniform measures of selected types of crime

  4. To permit comparisons over time and types of areas

Who created the dataset?
Is it an official law enforcement or government body? An academic research team? Other?

The survey is administered by the U.S. Census Bureau (under the U.S. Department of Commerce) on behalf of the Bureau of Justice Statistics (under the U.S. Department of Justice).

Was there a specific task in mind, or gap that needed to be filled?

The NCVS began in 1972 and was developed following a survey done by the National Opinion Research Center and the President’s Commission on Law Enforcement and Administration of Justice. The survey highlighted that many crimes were not reported to the police. The NCVS was created to assess the levels of and gain better understanding of criminal victimization, including from crimes that were never reported to law enforcement .


What do the instances that comprise the dataset represent?
For example: crimes, offenders, court cases, police officers

Instances in NCVS correspond to a record of a criminal victimization incident.

Are there multiple types of instances?
For example: offenders, victims, and the relationship between them.

Yes, there are four types of records:

  1. Address ID Record

  2. Household Record

  3. Person Record

  4. Incident Record

Each person can have multiple incident records, each household can include several persons.

How many instances are there in total?
Of each type, if appropriate.

Data is collected bi-annually, from a nationally representative sample of  ∼ 49, 000 households comprising  ∼ 100, 000 persons on the frequency, characteristics, and consequences of criminal victimization in the United States. The number of incidents will vary from year to year.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
For example, if it is traffic stops from a territory, is it all traffic stops conducted within that territory within a specific time? If not, is it a representative sample of all stops? Describe how representativeness was validated/verified. If it is not representative, please describe why.

The dataset contains a sample of 100,000 persons from the United States. Weights are provided at the person, household, and incident level to produce a representative sample of the US . Excluded are persons who are crews of vessels, in institutions (e.g., prisons and nursing homes) or members of the armed forces living in military barracks . Once in the sample, respondents are interviewed every six months for a total of seven interviews over a three-year period.

What data does each instance consist of?
If there is a large number of variables, please provide a broad description of what is included.

Each instance consists of the following data :

Is there a target label or associated with each instance?
Please include labels that are likely to be used as target labels, e.g. recidivism.

No variable in the dataset is designated as a label. However, whether or not a crime was reported and whether or not an arrest was made following that report may be suitable as target labels .


Does the dataset contain data on race and ethnicity?
If so, is it based on the individual’s self-description, or based on officer’s impression? Was it collected or derived in post-processing? For example, by name analysis.

Yes, race and ethnicity are reported for the victim, and sometimes for the offender as well.

For the victim, that is, the survey respondent, race and ethnicity are self-reported. The race categories are: White, Black or African American, American Indian or Alaska Native, Asian, Native Hawaiian or Other Pacific Islander, and Other. Only Hispanic ethnicity is recorded.

For the offender, race and ethnicity are perceived and reported by the respondent, i.e., the victim, if they saw the offender. The race categories are: Mostly White, Mostly Black or African American, Mostly American Indian or Alaska Native, Mostly Asian, Mostly Native Hawaiian or Other Pacific Islander, Equal number of each race, and Don’t Know. The ethnicity categories are: mostly Hispanic or Latino, mostly non-Hispanic, equal number of Hispanic and non-Hispanic, and don’t know.

Are there any known errors, sources of noise, bias or missing data, or variables collected for only part of the datasets?
If so, please provide a description.

Weights are provided on a person, household and incident level to produce a representative sample of the US. Prisoners are excluded from the sample.

Does the dataset contain data on criminal history or other data that might be considered confidential or sensitive in any way?
For example: sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; If so, please provide a description.

The experiences of criminal victimization themselves and their consequences can be considered sensitive, especially for sexual assault and rape. In addition, the dataset contains information on: age, race, gender, and income.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
If so, please describe how.

No, the data is sufficiently anonymized.


What type of tasks, if any, has the dataset been used for?
If so, please provide examples and include citations.

The dataset has been used for a range of victimization studies, looking at things such as:

Yes. Papers that cited this dataset can be found in:

What (other) tasks could the dataset be used for?
For example: testing predictive policing systems, predicting recidivism.

This dataset can be used for a variety of tasks which requires an understanding of the level of victimization for specific crimes, with demographic information on both the victim and offender, and information on whether the crime was reported and an arrest was made.

According to the authors of a book chapter on the potential sources of error in the NCVS , there are questions as to whether the rape and sexual assault are underestimated as they do not align with alternative surveys. “The Bureau of Justice Statistics does not provide public information on the edit process in the National Crime Victimization Survey, although processing and editing errors are an important part of any major survey data collection. The lack of transparency about these processes makes it difficult for data users to fully understand the survey’s estimate” .

Collection Process

How was the data associated with each instance acquired?
e.g. the data collected survey, the raw data is routinely collected by the courts.

The data was acquired from a bi-annual survey.

Was the information self-reported?
If the data was self-reported, was the data validated/verified? If so, please describe how.

Yes. The data is collected in a survey. However, the raw survey responses are not provided.

Who was involved in the data collection process?
Was this done as part of their other duties? If not, were they compensated?

The survey is administered by the U.S. Census Bureau.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?
If not, please describe the timeframe in which the data associated with the instances was created. If the collection was not continuous within the timeframe, please specify the intervals, for example, annually, every 4 years, irregularly.

The data is collected twice a year and released once a year. Data is available from the year 1979 and collection is still ongoing.


Yes, the survey is optional.


Pre-processing, cleaning, labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, removal of instances, processing of missing values)?
If so, please provide a description and reference to the documentation. If not, you may skip the remaining questions in this section.

Yes, as the survey responses were processed into the data available in the dataset. However, information on pre-processing is not supplied in the codebook.

It is not part of the publicly available dataset.“



Is the data publicly available? How and where can it be accessed (e.g., website, GitHub)?
Does the dataset have a digital object identifier (DOI)?

The dataset is hosted at:

There are multiple DOIs associated with this dataset, depending on the version and years of collection.

The license is not specified, but a citation and deposit requirement are listed:

Citation Requirement: Publications based on ICPSR data collections should acknowledge those sources by means of bibliographic citations. To ensure that such source attributions are captured for social science bibliographic utilities, citations must appear in footnotes or in the reference section of publications.
Deposit Requirement: To provide funding agencies with essential information about use of archival resources and to facilitate the exchange of information about ICPSR participants’ research activities, users of ICPSR data are requested to send to ICPSR bibliographic citations for each completed manuscript or thesis abstract. Visit the ICPSR Web site for more information on submitting citations.


Is the dataset maintained? Who is supporting/hosting/maintaining the dataset?

The dataset is hosted and supported by the Ministry of Justice and the US Census Bureau.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

By contacting the US Census Bureau.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

New versions of the dataset are released yearly.

Are older versions of the dataset continue to be supported/hosted/maintained?

Yes. Previous years of the dataset will continue to be hosted by the Ministry of Justice and are avilable to download.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
If so, please provide a description.


  1. The dark figure of crime is term used to illustrate the extent of committed crimes that are never reported or discovered by law enfrocment.