utility to extract date from text with java


A date pattern recognition algorithm to not only identify date pattern but also fetches probable date in Java date format. This algorithm is very fast and lightweight. The processing time is linear and all dates are identified in a single pass.
Algorithm resolves date using tree traverse mechanism. Tree data structures are custom created to build supported date, time and month patterns.

Following Trees are used (Note: ^ sign denotes complete pattern)

DATE PATTERN TREE

Month Pattern Tree (Jan and January both are valid, identified using ^ sign)

Time pattern (Identified as suffix to Date, $ sign indicates SPACE character)

Tree structures are used in the following algorithm. 

Date Part Identification algorithm flowchart



Time part identification algorithm flowchart




These flow charts are self-explanatory in most part. --> sign is used to denote next match in respective tree structures.

The algorithm also acknowledges multiple space characters in between Date literals. E.g. DD DD DD and DD     DD DD are considered as valid dates.


Following date-patterns are considered as valid and are identifiable using this algorithm.

dd MM(MM) yy(yy)
yy(yy) MM(MM) dd
MM(MM) dd yy(yy)

Where M is month literal is alphabet format like Jan or January

Allowed delimiters between dates are '/', '\', ' ', ',', '|', '-', ' '

It also recognizes trailing time pattern in following format
hh(24):mm:ss.SSS am / pm
hh(24):mm:ss am / pm
hh(24):mm:ss am / pm

Resolution time is linear, no pattern matching or brute force is used. This algorithm is based on tree traversal and returns back, the list of date with following three components
- date string identified in the text
- a converted & formatted date string
- SimpleDateFormat

Using date string and the format string, users are free to convert the string into objects based on their requirements.


The algorithm library is available at maven central.

        <dependency>
            <groupId>net.rationalminds</groupId>
            <artifactId>DateParser</artifactId>
            <version>1.4</version>
        </dependency>


The sample code to use this is below.

 import java.util.List;  
 import net.rationalminds.LocalDateModel;  
 import net.rationalminds.Parser;  
 public class Test {  
   public static void main(String[] args) throws Exception {  
        Parser parser=new Parser();  
        List<LocalDateModel> dates=parser.parse("Identified date :'2015-January-10 18:00:01.704', converted");  
        System.out.println(dates);  
   }  
 }  


Output:
[LocalDateModel [originalText=2015-january-10 18:00:01.704, dateTimeString=2015-01-10 18:00:01.704, conDateFormat=yyyy-MM-dd HH:mm:ss, identifiedDateFormat=YYYY-MMMMM-DD HH:mm:ss.SSS, start=18, end=45]]


The complete source is available on GitHub at https://github.com/vbhavsingh/DateParser

Algorithm flow charts are also available as http://draw.io source files.




Comments

  1. The following String doesn't provide any result. i.e., 0 dates.
    "Scheduled to be cancelled by 2018-09-24T21:14:13.877Z"

    ReplyDelete
  2. The following String doesn't provide any result. i.e., 0 dates.
    "Scheduled to be cancelled by 2018-09-24T21:14:13.877Z"

    ReplyDelete
  3. I like your idea here. In briefly testing, I found that your code is able to parse four year dates, but not two year dates. So, for instance, it finds the date in this string: "The meeting is scheduled for 07/25/2020. See you there."

    But not in this string: "The meeting is scheduled for 07/25/20. See you there."

    ReplyDelete
  4. I was parsing strings that contained dates in multiple formats. This utility saved me time and code. Thanks so much! For anyone considering this make sure to use the latest version as it will include more formats.

    ReplyDelete

Post a Comment

Popular posts from this blog

Caused by: java.sql.SQLTimeoutException: ORA-01013: user requested cancel of current operation

pandas dataframe add missing date from range in a multi-dimensional structure with duplicate index

Delete horizontal, vertical and angled lines from an image using Python to clear noise and read text with minimum errors