Embedding-Based Summarization and Document Redaction

Introduction

Large text documents often contain repetitive or redundant information that can be removed to produce a more concise summary. In machine learning and natural language processing, one effective approach to identify such redundancy is through the use of **vector embeddings** and **cosine similarity**. By breaking a document into smaller chunks (sentences or paragraphs) and representing each chunk as a high-dimensional embedding vector, we can quantitatively measure the semantic similarity between any two chunks. High similarity scores indicate overlapping or repetitive content. This forms the basis of an automated summarization technique: detect clusters of similar chunks and remove the less important ones, thereby **reducing the document length** without losing key information.

The same principle can be extended to **document redaction** in a multi-document scenario. For example, given two documents, we can identify portions of one document that substantially overlap with content in the other document and **redact (remove or mask) those overlapping portions**. Instead of exact keyword matching, semantic embeddings allow us to catch paraphrased or reworded similarities – a crucial feature noted in plagiarism detection research.

In this white paper, we detail the methodology for embedding-based content summarization and show how it can be applied to cross-document redaction. We also provide mathematical formulations, matrix operations, and visual illustrations of the similarity matrices involved.

Method Overview

 

Step 1: Document Chunking

The input document is split into chunks (e.g., sentences or paragraphs). Each chunk is treated as a semantic unit.

 

Step 2: Embedding Generation

Each text chunk is converted into a numerical vector using a pre-trained language model or embedding model. For example, one could use a BERT-based model or other transformer-based embedding models to encode the semantic content of each chunk into a high-dimensional vector. Let viRd denote the embedding vector for the ith chunk. These embeddings capture semantic meaning beyond exact word overlap – e.g., two chunks discussing the same concept with different wording will have similar vectors.

Step 3: Cosine Similarity Matrix 

Self-Redaction

We compute pairwise cosine similarities between all chunk vectors. The cosine similarity between two vectors vi,vj​ ∈ Rdis defined as:

Cx is represented as [0.12127365, -1.3553627……] depending on model’s dimensions. 

A math equation with black text

AI-generated content may be incorrect.

Here CxCis cosine similarity function

 

Redaction by reference

This approach is applied when redacting a target document Q using a reference document R. Both Q and R are first divided into discrete chunks. Cosine similarity is then computed between each chunk of Q and each chunk of R, producing an m × n similarity matrix, where m and n represent the number of chunks in Q and R, respectively.A math equation with black text

AI-generated content may be incorrect.

Step 4: Masking Self-Similarity

For square matrix CxCy, if x=y, then CxCy =1, in the case of self-redaction. This condition, however, does not apply when performing redaction against an external reference document.

Step 5: Significance Score

Self-Redaction

Given a similarity matrix MRN×NMRN×N, where MijMij​ is the cosine similarity between chunk ii and chunk jj, and Mii=1Mii​=1 (perfect similarity with itself):

We define the significance score for chunk ii, denoted sisi​, as:

Redaction by reference

A mathematical equation with numbers and symbols

AI-generated content may be incorrect.

A screenshot of a graph

AI-generated content may be incorrect.

 

Step 6: Redundant Cluster Detection

Define threshold theta. Chunks with    form a cluster. Use this to create connected components of similar chunks.

Self-Redaction

A black text on a white background

AI-generated content may be incorrect.

 

Redaction by reference

A close-up of a number

AI-generated content may be incorrect.


 

Step 7: Chunk Selection

For each cluster of similar chunks identified in the previous step, we retain only the most significant chunk and mark the rest for removal. The “most significant” chunk can be defined as the one with the highest Sij (average similarity to others) since it is the most representative of the cluster’s content. By keeping the chunk with highest Sij and discarding other chunks in that cluster, we ensure that we preserve the information content (because the kept chunk is highly similar to the ones removed) while eliminating repetition. In our example, for the cluster {Chunk 0, Chunk 1}, suppose S0j > S1n, then Chunk 0 is kept and Chunk 1 removed. Similarly, in {Chunk 2, Chunk 3}, if S2j > S3n, keep 2 and remove 3. This step results in a set of redacted chunks to drop. If a chunk was not part of any high-similarity pair it is inherently unique and is retained.

Step 8: Summary Reconstruction

Concatenate remaining chunks to form the summary.

Mathematical Formulation

- **Matrix product for similarity**:  
  - For single doc: 

A black and white text

AI-generated content may be incorrect.
  - For redaction: 

A black and white symbol

AI-generated content may be incorrect.

Application to Summarization

- Keeps most representative chunk per idea
- Removes redundant content
- Retains unique, informative content

Application to Cross-Document Redaction

- Compares Document A vs. B
- Removes chunks in A if similarity to any chunk in B exceeds threshold
- Rectangular similarity matrix M x N
- Can detect paraphrased overlaps

Comparison to Other Methods

- Compared with LexRank, TextRank, MMR
- More robust due to embedding-based semantic comparison
- Handles paraphrasing better than TF-IDF-based models

Conclusion

This white paper presents a principled embedding-based approach for both document summarization and redaction using cosine similarity. By constructing and analyzing similarity matrices (square or rectangular), we can efficiently reduce redundancy or filter overlap between texts. The technique offers clear advantages in flexibility, semantic fidelity, and interpretability, and is supported by visualizations and matrix math.

Figure 1: Cosine Similarity Matrix (N × N)

Figure 2: Cosine Similarity Matrix (N × M)

Comments

Popular posts from this blog

Unordered JSON compare for differences using javascript

Delete horizontal, vertical and angled lines from an image using Python to clear noise and read text with minimum errors

utility to extract date from text with java