Scientific citations are meant to let readers trace claims back to original evidence. When a reference cannot be found, that chain of trust breaks. A 2026 preprint on arXiv examines how large language models (LLMs) relate to "ghost citations", defined as plausible but nonexistent papers.

The authors built a verification pipeline called CiteVerifier and applied it to model outputs, archival papers, and a researcher survey. They link model behavior, observed citation errors in top venues, and self-reported verification gaps, while emphasizing that their data show temporal correlation rather than direct causation.

Key Findings from the GhostCite Study


  • Every tested language model produced ghost citations, with rates ranging from about 14 percent to 95 percent.
  • Prompt engineering and online search rarely lowered hallucination rates across models.
  • Invalid or fabricated citations appeared in 1.07 percent of 56 381 papers from top-tier AI/ML and Security venues, with an 80.9 percent increase in 2025 alone.
  • 41.5 percent of surveyed researchers paste citations unchecked; 76.7 percent of reviewers skip systematic verification.
  • The authors propose automated checks and retrieval grounded AI tools to restore confidence in references.

The CiteVerifier Framework


CiteVerifier tackles a messy input: reference sections scraped from PDF files, which often carry optical character recognition (OCR) errors. These errors are misread letters introduced when software turns scanned images into text.

The pipeline first parses references with the open source tool GROBID, then cleans fragments with an LLM based reparser. It runs a four stage search: a local cache of known papers, bibliographic indexes such as DBLP, Google Scholar queries, and a general web search fallback.

Each stage stops once a confident match appears, which keeps database calls economical for large batches. Matching relies on title similarity scored with normalized Levenshtein distance, a simple edit distance metric that counts how many single character changes turn one string into another.

CiteVerifier flags a citation as suspicious when similarity falls below 0.9, an empirically chosen cutoff that gives more false alarms than missed fakes. Because the tool focuses on untraceable titles, it cannot catch every metadata typo.

The authors show that it scales by applying CiteVerifier to about 2.2 million citations in their archival study. They describe the framework as open source to support venue side checks at submission time.

More Technology Articles

LLM Benchmarking Reveals Widespread Hallucination


Using OpenRouter APIs, the team queried 13 commercially available or research LLMs across 40 computer science domains. They used standardized prompts for each model domain combination to generate citations.

CiteVerifier processed 331,809 extracted references from 375,440 generated and found hallucination rates spanning about 14 percent for DeepSeek to nearly 95 percent for Hunyuan. Domain choice was as important as model choice.

Digital library prompts triggered around 80 percent ghost citations on average, while computation and language prompts generated about 29 percent. Across domains, average hallucination rates differed by more than 50 percentage points.

This indicates that some subfields face much higher exposure to hallucinated references than others. Prompt variations did not help consistently in the authors' experiments.

Enabling online search with chain of thought rationales or changing batch size sometimes reduced error rates for one model but raised them for another. This supports the view that hallucination reflects limits in models' bibliographic knowledge rather than a simple prompting issue.

The authors also tested a common strategy of asking an LLM to validate citations it or another model produced. Using 100 bibliography entries with known ground truth, they asked each model to classify entries as valid or invalid.

Average accuracy was 38 percent, which is worse than random guessing. ERNIE reached 56 percent accuracy but did so largely by flagging many legitimate references as invalid.

The authors conclude that users cannot rely on having language models check their own generated citations. External validation remains necessary until generation pipelines integrate reliable retrieval of identifiers such as DOIs or URLs.

Archival Analysis Shows an Accelerating Problem


To see whether ghost citations already appear in published work, the team collected 56,381 papers from eight venues. These included flagship conferences like NeurIPS, ICML, and IEEE S&P, published between 2020 and 2025. These venues span leading artificial intelligence, machine learning, and security conferences.

CiteVerifier screened 2,199,409 references and flagged 2,530 for manual review by a 16 person committee. Manual checks, followed by expert cross review, confirmed 739 invalid citations: 603 were nonexistent ghost references and 136 had severe metadata errors such as incorrect titles, authors, or venues.

The share of affected papers hovered around one percent from 2020 through 2024, then rose to 1.61 percent in 2025. That 80.9 percent relative increase aligns in time with wider adoption of AI assisted writing and autonomous agent workflows.

The authors describe this as a correlation rather than proof that LLMs caused the change. Invalid references repeated across different publications.

One phantom paper appeared in 16 independent papers, copied verbatim across different author groups. The pattern resembles earlier observations about citation misprints.

The authors suggest that automation and AI assisted workflows may accelerate rather than create long standing copying habits. They note that automated checks at submission, based on tools such as CiteVerifier, could catch many of these issues before publication.

Researcher Survey Highlights Verification Gaps


The authors also surveyed 97 researchers including faculty, graduate students, and industry scientists. They analyzed 94 responses after filtering inconsistent samples.

Of those who answered, 87.2 percent reported using AI tools to draft or polish prose for research. In the survey, 41.5 percent of respondents said they copy BibTeX entries without checking against source databases.

When encountering suspicious references in published work, 44.4 percent reported that they either verify privately or ignore the issue rather than contacting authors or journals. Among the 30 active peer reviewers in the sample, 76.7 percent said they do not systematically check reference lists.

Furthermore, 80 percent reported that they have never suspected a fake citation during review. Several respondents cited time limits as a reason that deep verification is rare in standard reviewing practice.

Notably, 70.2 percent of surveyed researchers favored automated checks at submission. Respondents who opposed automated screening raised concerns about false positives delaying publication schedules.

This echoes committee feedback that any tool should surface clear and low noise alerts.

Proposed Mitigations and Remaining Limits


The paper recommends layered responsibility for citation integrity. Authors should cross check titles and DOIs in trusted indexes such as Crossref before final submission.

Conferences could integrate CiteVerifier or equivalent services into existing plagiarism scans to generate brief risk dashboards for reviewers. For tool makers, the authors recommend a shift toward retrieval grounded generation.

In this approach, the model fetches metadata from verified sources before composing prose. They also suggest enforcing structured outputs with evidence fields such as DOIs, URLs, or database identifiers.

Systems should surface "not found" signals instead of fabricating metadata. CiteVerifier's design still misses some errors.

It uses only title similarity, so a real paper cited with the correct title but other mistaken fields can be treated as valid. Gaps or inconsistencies in bibliographic databases limit the completeness of any automated verification pipeline.

The authors therefore describe their 1.07 percent invalid paper rate as conservative. Because their pipeline focuses on ghost citations and title level validity rather than all possible citation errors, broader miscitation remains outside scope.

Wrong page numbers, mismatched venues or years, or citations that do not actually support the claimed result could mean the true reliability deficit is higher.

Conclusion


Ghost citations provide quantitative evidence that automated writing can make it harder to follow the evidence behind published claims. Until retrieval-first generation and venue level verification become routine, readers may need to verify some references independently.

If tools and norms do not change, growing use of LLMs in academic writing could increase the number of unsupported or invalid citations in the literature. That outcome would raise the cost of later verification work for researchers who rely on citation trails to assess prior results.

Sources


Article Credits