IBM 15 Manual Do Utilizador

231

Performance Considerations for Streams and Nodes

The following operations cannot be performed in most databases. They should be placed in the
stream after the operations in the preceding list:

Operations on any nondatabase data, such as flat files

Merge by order

Balance

Distinct operations in discard mode or where only a subset of fields are selected as distinct

Any operation that requires accessing data from records other than the one being processed

State and count field derivations

History node operations

Operations involving “

@” (time-series) functions

Type-checking modes Warn and Abort

Model construction, application, and analysis

Note: Decision trees, rulesets, linear regression, and factor-generated models can generate
SQL and can therefore be pushed back to the database.

Data output to anywhere other than the same database that is processing the data

Node Caches

To optimize stream running, you can set up a cache on any nonterminal node. When you set up a
cache on a node, the cache is filled with the data that passes through the node the next time you
run the data stream. From then on, the data is read from the cache (which is stored on disk in a
temporary directory) rather than from the data source.

Caching is most useful following a time-consuming operation such as a sort, merge, or

aggregation. For example, suppose that you have a source node set to read sales data from a
database and an Aggregate node that summarizes sales by location. You can set up a cache on the
Aggregate node rather than on the source node because you want the cache to store the aggregated
data rather than the entire data set.

Note: Caching at source nodes, which simply stores a copy of the original data as it is read into
IBM® SPSS® Modeler, will not improve performance in most circumstances.

Nodes with caching enabled are displayed with a small document icon at the top right corner.
When the data is cached at the node, the document icon is green.