Macromedia colfusion mx 7 Manual

Descargar
Página de 170
126
Chapter 9:  Indexing Collections with Verity Spider
-norobo
Type: Web crawling only
Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to 
specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.
If you are re-indexing a site and the robots.txt file has changed, Verity Spider deletes documents 
that have been newly disallowed by the robots.txt file.
Use this option with discretion and extreme care, especially in conjunction with the 
option.
See also 
-pathlen
Syntax
-pathlen num_pathsegments
Limits indexing to the specified number of path segments in the URL or file system path. The 
path length is determined as follows:
The host name and drive letter are not included; for example, neither www.spider.com:80/ nor 
C:\ would be included in determining the path length. 
All elements following the host name are included. 
The actual filename, if present, is included; for example, /world.html would be included in 
determining the path length. 
Any directory paths between the host and the actual filename are included. 
Example
For the following URL, the path length would be four: 
http://www.spider:80/comics/fun/funny/world.html
       <-1->          <2>  <-3-> <---4--->
For the following file system path, the path length would be three:
C:\files\docs\datasheets
    <-1-><-2-><---3--->
The default value is 100 path segments.
-refreshtime
Syntax
-refreshtime timeunits
Specifies not to refresh any documents that have been indexed since the timeunits value began. 
The following is the syntax for timeunits: 
n day n hour n min n sec
Where n is a positive integer. You must include spaces, and since the first three letters of each time 
unit are parsed, you can use the singular or plural form of the word.