You have probably noticed with popular Internet search engines that a web page is not removed from their index immediately, but may remain in the index for many months after the page has been removed from the Net. You can easily do the same thing with ESP, depending on the connector used to feed content.

Enterprise Crawler

With the enterprise crawler the easiest way to is to configure the HTTP Response Code and Errors for your collection. This section allows you to define the crawler store and feed behavior based on the received HTTP response code or other error message. For example, if a site is crawled once a week and you want to remove documents from the index 30 days after they are removed from the site, enter 404 (HTTP 404, page not found response) in the Error box, and DELETE:4 in the Value box, then click on the right arrow. In this example it could actually take up to 35 days (original try, plus 4 retries, times the 7 day crawl cycle) before the document is removed from the index.  See the FAST Enterprise Crawler: Crawl Guide in the ESP documentation for more details.

File Traverser

It does take a little work to automate delayed deleting with the File Traverser, but it can be done using File Traverser commands and a few command line tools, such as diff. Here is the basic procedure, assuming we are feeding HTML pages from a web server (c:\webroot) in to a collection called webdocuments and removing documents 30 days after they disappear from the directory:

  1. Delete all documents from the index for files deleted 30 days ago. (filetraverser –r c:/webroot –c webdocuments –u delete_file_<today>.txt) This delete file was created 30 days ago (see below).
  2. First run the File Traverser with the –M switch to pick up all files which have changed or been added since the last feed.  (filetraverser –r c:/webroot –c webdocuments –s html –s htm –M)
  3. Generate a listing of all files currently in the index, but no longer in the directory, by running the File Traverser in report mode. (filetraverser –r c:/webroot –c webdocuments –R –o output_file_<today>.txt –K )
  4. The file needs to be cleaned up before we can use it for deleting documents. The file output_file_<today>.txt lists all documents removed from the directory during the previous 30 days. By diffing output_file_<today>.txt  with output_file_<today-1>.txt you can generate a list of files removed from the directory since the previous feed. Call this file delete_file_<today + 30>.txt. There are still some extraneous lines in the file that can be removed. If you choose not to remove them you may get some errors during feeding, but they can be ignored.

For more information see the FAST Enterprise Search Plantform: File Transverser Guide in the ESP documentation.  Note: the above procedure is not in the guide.

Advanced Topics

Content Transformation Services (CTS) is an integral part of FAST Search for Internet Sites. CTS can be used in more advanced scenarios for deleting documents. The actual flow used will vary depending on your business case, but the basics are:

  1. Feed content to CTS, including delete requests.
  2. When a connector feeds a document, not only does it include the content, it also includes an operating instruction telling ESP what to do with the document (e.g. add, update or delete). Intercept the delete requests and decide if the document should be deleted immediately or queued up to be deleted later. The queue could be as simple as a file with the documents contentID, collection name, and delete date.
  3. Create another flow which is periodically run manually. This flow would read the queue file, determine which files should be deleted on that day, and forward the delete requests to ESP.

Related Topics
For more information about document feeding and index backup strategies you should attend 10804 Microsoft FAST Search for Internet Sites for Application Developers. The class answers the following questions related to off-hour feeding:

  • How to feed basic content with the Enterprise Crawler, File Traverser and JDBC Connector for Databases?
  • How to feed and process content using Content Transformation Services and Document Processing?
  • What are other options for feeding strategies?

To register online visit www.fastuniversity.com or contact an Education Consultant for assistance at fastuniv@microsoft.com.

By Brian Barry