Posts

Embedding-Based Summarization and Document Redaction

Image
Introduction Large text documents often contain repetitive or redundant information that can be removed to produce a more concise summary. In machine learning and natural language processing, one effective approach to identify such redundancy is through the use of **vector embeddings** and **cosine similarity**. By breaking a document into smaller chunks (sentences or paragraphs) and representing each chunk as a high-dimensional embedding vector, we can quantitatively measure the semantic similarity between any two chunks. High similarity scores indicate overlapping or repetitive content. This forms the basis of an automated summarization technique: detect clusters of similar chunks and remove the less important ones, thereby **reducing the document length** without losing key information. The same principle can be extended to **document redaction** in a multi-document scenario. For example, given two documents, we can identify portions of one document that substantially overlap with c...

Unordered JSON compare for differences using javascript

Sometimes we need to programmatically compare JSON from two or more sources and determine what's the difference between them. The program below can do deep compare JSON even if they are not in the same order or conatins substructures and arrays. It provides the detailed list of mismatches and misses in both JSON's by calling out attributes with the same keys and different values or missing keys. const obj1={ "k1":"aq", "k2":"b", "k3":"c", "k4":{ "kk1":"aq", "kk2":"b", "kk3":"c", "kk4":{ "kkk1":"" }, "kk5":"abc" }, "k5":[{"kk5":"abc","kk6":"53"},{"kk5":"abc","kk6":"54"}] } const obj2={ "k1":"a", "k2...

HashiCorp Vault Integration with Ansible Etower using approle

Image
HashiCorp Vault is  a secrets management tool specifically designed to control access to sensitive credentials in a low-trust environment . It can be used to store sensitive values and at the same time dynamically generate access for specific services/applications on lease.  Integrating the vault with Ansible Etower provides robust and secure automation. Following is the step-by-step guide for the integration. Enable key-value secret engine in Hashi Vault (also known as "kv" engine). Let's call the engine, the "kv" engine. Create a secret inside "kv". A secret can be a collection of key-value pairs or a JSON for nested structure.            Lets assume that secrets are stored as JSON  in the format   { "my_app":{ "service_account_name": "some_service", "service_account_password": "some_password" } } Create a secret policy defining what can be done with the above-defined secret. Create an...

Ansible variable inside variable.

 Lets assume there is a json with following structure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 { "dev" :{ "app_1" :{ "key" : "element" , "value" : "somevalue" }, "app_2" :{ "key" : "element" , "value" : "somevalue" } }, "test" :{ "app_1" :{ "key" : "element" , "value" : "somevalue" }, "app_2" :{ "key" : "element" , "value" : "somevalue" } }, "production" :{ "app_1" :{ "key" : "element" , "value" : "somevalue" }, "app_2" :{ "key" : ...

Java Currency Formatter Changing $ to ¤

 Java currency formatting problem. Following code shows the conditions when the currency symbol is replaced by an alt currency symbol. public class Test { public static void main(String[] args){ double num = 1323.526; Locale eng = new Locale("", ""); NumberFormat engFormat = NumberFormat.getCurrencyInstance(eng); System.out.println(engFormat.format(num)); eng = new Locale("en", ""); engFormat = NumberFormat.getCurrencyInstance(eng); System.out.println(engFormat.format(num)); eng = new Locale("en", "en_US"); engFormat = NumberFormat.getCurrencyInstance(eng); System.out.println(engFormat.format(num)); eng = new Locale("", "US"); engFormat = NumberFormat.getCurrencyInstance(eng); System.out.println(engFormat.format(num)); eng = new Locale(...

Delete horizontal, vertical and angled lines from an image using Python to clear noise and read text with minimum errors

I was working on an algorithm to detect lines in a image. I found many solutions online, but none of them seem to work to my satisfaction. Thereafter, I decided to write an algorithm, which can work to my satisfaction. This algorithm can detect horizontal, vertical and angled lines. Following is the pseudo code of the algorithm. The implementation is in Python and it uses third party code. pix_array = Convert the image into 2D pixel array  Mono chromate the pixel array by setting values between 0-200 as 0 (black) and greater than 200 as 255(white) as  image_width = length of pix_array[0] image_height = length of pix_array Loop through pix_array, i.e. 1 pixel high horizontal slice of image: h = x coordinate of current pixel k = y coordinate of current pixel choose a value of radius as r draw a circle taking h,k as center using equation (X-h)^2 + (Y-k)^2 = r^2 derive values of X and Y and check if pix_arrY[X][Y] is black (0) if black add them to possible line c...