Analytical Process Manager

Overview of Input Engines

In a microarray experiment, raw data from an image scanner are typically stored in a scanner-specific format on a per-chip or per-slide basis. The associated experimental factors (such as disease status, cell line, treatment, etc.) and gene annotation for each chip are usually not directly included in the raw data. For statistical analysis, the raw data from all single chips need to be combined with the associated experimental factors and annotation information.

Input engines in SAS Scientific Discovery Solutions provide an easy and automatic way to import raw data and experimental factors into the SAS Scientific Discovery Solutions repository in the form of a SAS data set. The structure of this output data set is in stacked form, in which all of the intensity measurements are stacked into one variable, and various experimental and user-defined variables are contained in other columns.

The following sections describe the ten input engines that are provided with Version 1.3.

Input Engines for Generic Data

Input Engines for Microarray-Specific Data Formats

Input Engines for Genetic Marker Data Formats

Experimental Design File for the Affymetrix, Agilent, Experiment, GenePix, QuantArray, and Scanalyze Input Engines

When importing raw data through the Affymetrix, Agilent, Arlequin, Experiment, GenePix, QuantArray, and Scanalyze input engines, you must provide an associated experimental design file in the form of a table in one of the following formats: Microsoft Excel, tab delimited, comma separated, or SAS data set. If the table is Microsoft Excel, tab delimited, or comma separated format, the first row is assigned to the variable names to be created in the output SAS data set. If the table is a SAS data set, the column names are assigned to variable names in the output SAS data set. Each record (row) represents the experimental information that is associated with each experimental sample (chip or each channel of each slide). You must provide the following information:

It is also recommended that the variables related to the experimental sample stay to the left of the File column and variables read from the raw files stay to the right of the Intensity column.

There is an option of log base 2 transformation for the Intensity column

For more information on input engine experimental design files, see Input Engine Recommendations.

Experimental Design File Examples

For the Affymetrix Human Latin Square (netaffx.com) and Drosophila Aging experiments [Jin, et al., "The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster." Nature Genetics 29, 389-395 (2001)], the first five records of the associated experimental design tables are shown in Tables 1 and 2, respectively. In Table 1, ChipID is the chip identifier, Experiment indicates the experimental configuration, and File indicates the name of the CEL data file. In Table 2, Array is the array identifier, Dye, Line, Sex, and Age indicate the associated experimental factors for each channel of each array, File indicates the name of the data file, and Intensity indicates the corresponding name of the associated intensity measurement in the raw data files. Note that for this example, there the two rows for every file correspond to the Cy3 and Cy5 channels.

Table 1: Experimental Design Table for the Affymetrix Human Latin Square Experiments
ChipID Experiment File
1 a 1532a99hpp_av04
2 b 1532b99hpp_av04
3 c 1532c99hpp_av04
4 d 1532d99hpp_av04
5 e 1532e99hpp_av04

Table 2: Experimental Design Table for the Drosophila Aging Experiment
Array Dye Line Sex Age File Intensity
1 Cy3 ORE FEM WK1 OF1A3.OF6A5 Ch1i
1 Cy5 ORE FEM WK6 OF1A3.OF6A5 Ch2i
2 Cy3 ORE FEM WK1 OF1B3.OF6B5 Ch1i
2 Cy5 ORE FEM WK6 OF1B3.OF6B5 Ch2i
3 Cy3 ORE MAL WK1 OM1A3.OM6A5 Ch1i

The input engines combine the experimental information with the intensity measurements and some platform-specific variables in the output SAS data set. Partial output data sets for the Affymetrix Human Latin Square and Drosophila Aging experiments are shown in Tables 3 and 4, respectively.

In Table 3, Unit, AffyID, and Probe are variables that are specific to the Affymetrix GeneChip; they indicate unit number, Affymetrix GeneChip internal gene identifier, and probe identifier, respectively. Log2i is the logarithm base 2 perfect match probe intensity. In Table 4, Spot is a Scanalyze-specific variable that indicates spot number within arrays. Log2i is the logarithm base 2 intensity. After the output SAS data set is created, it is stored in the data warehouse and is then available as an input data set for analytical processes.

Table 3: Output Merging Data Set for Affymetrix Human Latin Square Experiment
Unit AffyID Probe ChipID Experiment Series Log2i
12274 476_s_at 3 1 a 4 6.8073549221
872 33034_at 7 1 a 4 8.7032114674
3855 39967_at 1 1 a 4 8.3456270122
5504 35247_at 13 1 a 4 7.5101707512
6798 39416_at 11 1 a 4 7.9265925101

Table 4: Output Merging Data Set for the Drosophila Aging Experiment
Spot Array Dye Line Sex Age Log2i
4 1 Cy3 ORE FEM WK1 14.143383
8 1 Cy3 ORE FEM WK1 13.351905
9 1 Cy3 ORE FEM WK1 13.518899
14 1 Cy3 ORE FEM WK1 11.469133
15 1 Cy3 ORE FEM WK1 11.401413

Stacked Versus Rectangular Data Formats

Most of the SDS analytical processes require the data to be in stacked form, in which all of the intensity measurements are stacked into one variable, and so the resulting output data set from the various input engines is analysis ready for these analytical processes. Other analytical processes require the data to be in a rectangular form with, for example, genes as rows and samples as columns. To transform the data into this rectangular form, use the Data Transpose analytical process from the Utilities folder. In addition, the Data Transpose Rectangular analytical process is available for manipulating data that is already in rectangular form.

Input of Rectangular Formatted Data

If your original raw data are already in rectangular form, for example, in a single Excel file, you can use one of the following two options to create a SAS data set in stacked form in SDS:

Table 5: Experimental Design Table for a Single Rectangular Formatted Raw Data File
Sample TrT File Intensity
1 0 Data1 Sample1
2 0 Data1 Sample2
3 0 Data1 Sample3
4 1 Data1 Sample4
5 1 Data1 Sample5
6 1 Data1 Sample6

See also:
SAS Scientific Discovery Solutions Analytical Processes