Introduction

Large text documents often contain repetitive or redundant information that can be removed to produce a more concise summary. In machine learning and natural language processing, one effective approach to identify such redundancy is through the use of **vector embeddings** and **cosine similarity**. By breaking a document into smaller chunks (sentences or paragraphs) and representing each chunk as a high-dimensional embedding vector, we can quantitatively measure the semantic similarity between any two chunks. High similarity scores indicate overlapping or repetitive content. This forms the basis of an automated summarization technique: detect clusters of similar chunks and remove the less important ones, thereby **reducing the document length** without losing key information.

The same principle can be extended to **document redaction** in a multi-document scenario. For example, given two documents, we can identify portions of one document that substantially overlap with content in the other document and **redact (remove or mask) those overlapping portions**. Instead of exact keyword matching, semantic embeddings allow us to catch paraphrased or reworded similarities – a crucial feature noted in plagiarism detection research.

In this white paper, we detail the methodology for embedding-based content summarization and show how it can be applied to cross-document redaction. We also provide mathematical formulations, matrix operations, and visual illustrations of the similarity matrices involved.

Method Overview

Step 1: Document Chunking

The input document is split into chunks (e.g., sentences or paragraphs). Each chunk is treated as a semantic unit.

Step 2: Embedding Generation

Each text chunk is converted into a numerical vector using a pre-trained language model or embedding model. For example, one could use a BERT-based model or other transformer-based embedding models to encode the semantic content of each chunk into a high-dimensional vector. Let v_i∈R^d denote the embedding vector for the i^th chunk. These embeddings capture semantic meaning beyond exact word overlap – e.g., two chunks discussing the same concept with different wording will have similar vectors.

Step 3: Cosine Similarity Matrix

Self-Redaction

We compute pairwise cosine similarities between all chunk vectors. The cosine similarity between two vectors v_i,v_j ∈ R^dis defined as:

Cx is represented as [0.12127365, -1.3553627……] depending on model’s dimensions.

Redaction by reference

This approach is applied when redacting a target document Q using a reference document R. Both Q and R are first divided into discrete chunks. Cosine similarity is then computed between each chunk of Q and each chunk of R, producing an m × n similarity matrix, where m and n represent the number of chunks in Q and R, respectively.

Step 4: Masking Self-Similarity

For square matrix C_xC_{y, if x=y, then}C_xC_y₌₁, in the case of self-redaction. This condition, however, does not apply when performing redaction against an external reference document.

Step 5: Significance Score

Self-Redaction

Given a similarity matrix M∈RN×NM∈RN×N, where M_ijM_ijis the cosine similarity between chunk ii and chunk jj, and M_ii=1M_ii=1 (perfect similarity with itself):

We define the significance score for chunk ii, denoted sisi, as:

Redaction by reference

A screenshot of a graph

AI-generated content may be incorrect.

Step 6: Redundant Cluster Detection

Define threshold theta. Chunks with form a cluster. Use this to create connected components of similar chunks.

Self-Redaction

A black text on a white background

AI-generated content may be incorrect.

Redaction by reference

A close-up of a number

AI-generated content may be incorrect.

Step 7: Chunk Selection

For each cluster of similar chunks identified in the previous step, we retain only the most significant chunk and mark the rest for removal. The “most significant” chunk can be defined as the one with the highest S_ij(average similarity to others) since it is the most representative of the cluster’s content. By keeping the chunk with highest S_ij and discarding other chunks in that cluster, we ensure that we preserve the information content (because the kept chunk is highly similar to the ones removed) while eliminating repetition. In our example, for the cluster {Chunk 0, Chunk 1}, suppose S_0j > S_1n, then Chunk 0 is kept and Chunk 1 removed. Similarly, in {Chunk 2, Chunk 3}, if S_2j > S_3n, keep 2 and remove 3. This step results in a set of redacted chunks to drop. If a chunk was not part of any high-similarity pair it is inherently unique and is retained.

Step 8: Summary Reconstruction

Concatenate remaining chunks to form the summary.

Mathematical Formulation

- **Matrix product for similarity**:
- For single doc:

- For redaction:

Application to Summarization

- Keeps most representative chunk per idea
- Removes redundant content
- Retains unique, informative content

Application to Cross-Document Redaction

- Compares Document A vs. B
- Removes chunks in A if similarity to any chunk in B exceeds threshold
- Rectangular similarity matrix M x N
- Can detect paraphrased overlaps

Comparison to Other Methods

- Compared with LexRank, TextRank, MMR
- More robust due to embedding-based semantic comparison
- Handles paraphrasing better than TF-IDF-based models

Conclusion

This white paper presents a principled embedding-based approach for both document summarization and redaction using cosine similarity. By constructing and analyzing similarity matrices (square or rectangular), we can efficiently reduce redundancy or filter overlap between texts. The technique offers clear advantages in flexibility, semantic fidelity, and interpretability, and is supported by visualizations and matrix math.

Figure 1: Cosine Similarity Matrix (N × N)

Figure 2: Cosine Similarity Matrix (N × M)

Solutions for elusive IT issues

Embedding-Based Summarization and Document Redaction

Method Overview

Step 1: Document Chunking

Step 2: Embedding Generation

Step 3: Cosine Similarity Matrix

Step 4: Masking Self-Similarity

Step 5: Significance Score

Step 6: Redundant Cluster Detection

Step 7: Chunk Selection

Step 8: Summary Reconstruction

Mathematical Formulation

Application to Summarization

Application to Cross-Document Redaction

Comparison to Other Methods

Conclusion

Comments

Post a Comment

Popular posts from this blog

Unordered JSON compare for differences using javascript

Delete horizontal, vertical and angled lines from an image using Python to clear noise and read text with minimum errors

Print multi tree data structure ascii visualization on console with java