For a couple of recent projects I've needed the ability to tag articles automatically based on keywords or topics and manipulate articles pulled from a variety of news or content sources. This has generally been for automatic deployment of articles to other systems - so it was important to have as much hands-off control as possible.
Part of the manipulation of article content required the ability to create summaries of the main text - something a little more classy than the usual sub-string manipulation kludge that appears across the Internet. Not wanting to have to undergo manual intervention on this I looked to various algorithms available for text summarisation.
A plethora of information is out there for those wishing to start to construct their own summarisation classes - but a method I settled on was to
use the wonderful Open Text Summarisation system which comes as a binary you can install directly into your operating system.
Installing this to your system is exceptionally easy - especially if on a Debian-based system:
This will install the system and provide you with the ability to launch summarisation of text files with the ots command.
Various percentage summarisation options are available and the system works well as a scheduled cron job or as an event-based summarisation in most cases I find.
Using this system utility I've put together some sub-routines that take incoming article information, store it to a local file for processing and then update the database content with the summarisation before removing the temporary file:-
First of all, create the temporary files that you need to work through with your cron job:
Once the failes are on the system you can work through them all, summarise and update the database and then remove the temporary file.
It's not particularly difficult to arrange and it makes a massive difference to the way in which you can display content for various devices. I've used this system to generate summaries for small phone app news feeds as well as for larger article summaries.
Why not try plugging it in to your systems and generating more meaningful article summaries without the usual sub_str and ellipsis you see on sites!
Next Item.. Hubot Servergrid Scripts
Previous Item.. Automated Flash Adverts from a CSV