ginger technological blog header information panel


Automated Text Summarisation

Published: April 22nd 2014

For a couple of recent projects I've needed the ability to tag articles automatically based on keywords or topics and manipulate articles pulled from a variety of news or content sources. This has generally been for automatic deployment of articles to other systems - so it was important to have as much hands-off control as possible.

Part of the manipulation of article content required the ability to create summaries of the main text - something a little more classy than the usual sub-string manipulation kludge that appears across the Internet. Not wanting to have to undergo manual intervention on this I looked to various algorithms available for text summarisation.

Open Text Summarizer

A plethora of information is out there for those wishing to start to construct their own summarisation classes - but a method I settled on was to use the wonderful Open Text Summarisation system which comes as a binary you can install directly into your operating system.
Information Here

Installing this to your system is exceptionally easy - especially if on a Debian-based system:

sudo apt-get install libots0

This will install the system and provide you with the ability to launch summarisation of text files with the ots command.

Various percentage summarisation options are available and the system works well as a scheduled cron job or as an event-based summarisation in most cases I find.

Using this system utility I've put together some sub-routines that take incoming article information, store it to a local file for processing and then update the database content with the summarisation before removing the temporary file:-

Create the file

First of all, create the temporary files that you need to work through with your cron job:

/*
* Description:
* Store the article as a text file ready for summarisation
* Parameters:
* $article_id for unique identification of the article and it's temp file
* $fullarticle is the content from whatever content-scraper you have
*
*/
public function storeArticleFile($article_id, $fullarticle)
{
  $exportfilename = "/path/to/temp/article/store/".$article_id.".txt";

  // check the folder exists for the file
  $fh = fopen($exportfilename, 'w');
  fwrite($fh, $fullarticle);
  fclose($fh);
  return;
}

Summarise and update the DB

Once the failes are on the system you can work through them all, summarise and update the database and then remove the temporary file.

/*
* Description:
* Run text through an open summariser
* Parameters:
* $article_id for unique identification of the article and it's temp file
* $percentSummarise for the percentage summarisation - passed as integer values between 0 and 100
*
*/
function summariseMyText($article_id, $percentSummarise)
{
  // Summarise the article with
  $commandToSummarise = "ots /path/to/temp/article/store/".$article_id.".txt -r ".$percentSummarise;

  $summarised = shell_exec($commandToSummarise);

  $sql = "UPDATE article_table
      SET article_summary='".$summarised."'
      WHERE article_id='".$article_id."'
      LIMIT 1";

  db::execute($sql);
  $commandToDelete = "rm /path/to/article/store/".$article_id.".txt";

  shell_exec($commandToDelete);
  return;
}

It's not particularly difficult to arrange and it makes a massive difference to the way in which you can display content for various devices. I've used this system to generate summaries for small phone app news feeds as well as for larger article summaries.

Why not try plugging it in to your systems and generating more meaningful article summaries without the usual sub_str and ellipsis you see on sites!

Happy hacking!
gingerCoder()



Next Item.. Hubot Servergrid Scripts

Previous Item.. Automated Flash Adverts from a CSV