Cisco Cisco Tidal Enterprise Adapter for zLinux Data Sheet

20. Cisco Workload Automation Adapter for

Hadoop

Adapters Overview

Cisco Workload Automation Sqoop Adapter Overview

The Cisco Workload Automation (CWA) Sqoop Adapter provides easy import and export of data from structured
data stores such as relational databases and enterprise data warehouses. Sqoop is a tool designed to transfer
data between Hadoop and relational databases. You can use Sqoop to import data from a relational database
management system (RDBMS) into the Hadoop Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS. Sqoop Adapter allows users to automate the tasks
carried out by Sqoop.

The Sqoop Adapter allows for the definition of the following job tasks:

•

Code Generation – This task generates Java classes which encapsulate and interpret imported
records. The Java definition of a record is instantiated as part of the import process, but can also be
performed separately. If Java source is lost, it can be recreated using this task. New versions of a class
can be created which use different delimiters between fields or different package name.

•

Export – The export task exports a set of files from HDFS back to an RDBMS. The target table must
already exist in the database. The input files are read and parsed into a set of records according to the
user-specified delimiters. The default operation is to transform these into a set of INSERT statements
that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements
that replace existing records in the database.

•

Import – The import tool imports structured data from an RDBMS to HDFS. Each row from a table is
represented as a separate record in HDFS. Records can be stored as text files (one record per line), or
in binary representation such as Avro or SequenceFiles.

•

Merge – The merge tool allows you to combine two datasets where entries in one dataset will
overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will
generate multiple datasets in HDFS where successively newer data appears in each dataset. The
merge tool will "flatten" two datasets into one, taking the newest available records for each primary
key. This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file
types of the newer and older datasets must be the same. The merge tool is typically run after an
incremental import with the date-last-modified mode.

Cisco Workload Automation MapReduce Adapter Overview

Hadoop MapReduce is a software framework for writing applications that process large amounts of data (multi-
terabyte data-sets) in-parallel on large clusters (up to thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner. A Cisco Workload Automation MapReduce Adapter job divides the input data set into
independent chunks that are processed by the map tasks in parallel. The framework sorts the map’s outputs,