In 2007, the Yellow Pages Association (YPA) presented SageKey with a daunting task: Take the PDF pages from a
published Yellow Pages directory and, using fast, cost-effective methods, find all the headings on each page in the book
and output the heading and page number into YPA's standardized format.
There was little tolerance for error and the system had to handle up to four thousand Yellow Pages directories per
year.
The Solution
Using state of the art data extraction techniques and pattern recognition algorithms, SageKey developed software that, with
some help from a trained operator, could inspect a PDF file, recognize the patterns of color, space and images and discern
which text was a heading. The process is estimated to be 95% automated.
With each publisher, and in many cases each directory within a publisher, having different colors, layouts and quirks, the
system had to be very powerful, yet flexible enough to handle the nuances of font, color and layout changing from directory
to directory.
The system has been in production since mid 2007 and has processed more than 6,300 Yellow Pages directories.