In a microarray experiment, raw data from an image scanner are typically stored in a scanner-specific format on a per-chip or per-slide basis. The associated experimental factors (such as disease status, cell line, treatment, etc.) and gene annotation for each chip are usually not directly included in the raw data. For statistical analysis, the raw data from all single chips need to be combined with the associated experimental factors and annotation information.
Input engines in SAS Scientific Discovery Solutions provide an easy and automatic way to import raw data and experimental factors into the SAS Scientific Discovery Solutions repository in the form of a SAS data set. The structure of this output data set is in stacked form, in which all of the intensity measurements are stacked into one variable, and various experimental and user-defined variables are contained in other columns.
The following sections describe the ten input engines that are provided with Version 1.3.
When importing raw data through the Affymetrix, Agilent, Arlequin, Experiment, GenePix, QuantArray, and Scanalyze input engines, you must provide an associated experimental design file in the form of a table in one of the following formats: Microsoft Excel, tab delimited, comma separated, or SAS data set. If the table is Microsoft Excel, tab delimited, or comma separated format, the first row is assigned to the variable names to be created in the output SAS data set. If the table is a SAS data set, the column names are assigned to variable names in the output SAS data set. Each record (row) represents the experimental information that is associated with each experimental sample (chip or each channel of each slide). You must provide the following information:
It is
also recommended that the variables related to the experimental sample stay
to the left of the File column and variables read from the raw files stay to the
right of the Intensity column.
There is an option of log base 2 transformation for the Intensity column
For more information on input engine experimental design files, see Input Engine Recommendations.
For the Affymetrix Human Latin Square (netaffx.com) and Drosophila Aging experiments [Jin, et al., "The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster." Nature Genetics 29, 389-395 (2001)], the first five records of the associated experimental design tables are shown in Tables 1 and 2, respectively. In Table 1, ChipID is the chip identifier, Experiment indicates the experimental configuration, and File indicates the name of the CEL data file. In Table 2, Array is the array identifier, Dye, Line, Sex, and Age indicate the associated experimental factors for each channel of each array, File indicates the name of the data file, and Intensity indicates the corresponding name of the associated intensity measurement in the raw data files. Note that for this example, there the two rows for every file correspond to the Cy3 and Cy5 channels.
Table 1: Experimental Design Table for the Affymetrix Human Latin Square Experiments
ChipID | Experiment | File |
---|---|---|
1 | a | 1532a99hpp_av04 |
2 | b | 1532b99hpp_av04 |
3 | c | 1532c99hpp_av04 |
4 | d | 1532d99hpp_av04 |
5 | e | 1532e99hpp_av04 |
Table 2: Experimental Design Table for the Drosophila Aging Experiment
Array | Dye | Line | Sex | Age | File | Intensity |
---|---|---|---|---|---|---|
1 | Cy3 | ORE | FEM | WK1 | OF1A3.OF6A5 | Ch1i |
1 | Cy5 | ORE | FEM | WK6 | OF1A3.OF6A5 | Ch2i |
2 | Cy3 | ORE | FEM | WK1 | OF1B3.OF6B5 | Ch1i |
2 | Cy5 | ORE | FEM | WK6 | OF1B3.OF6B5 | Ch2i |
3 | Cy3 | ORE | MAL | WK1 | OM1A3.OM6A5 | Ch1i |
The input engines combine the experimental information with the intensity measurements and some platform-specific variables in the output SAS data set. Partial output data sets for the Affymetrix Human Latin Square and Drosophila Aging experiments are shown in Tables 3 and 4, respectively.
In Table 3, Unit, AffyID, and Probe are variables that are specific to the Affymetrix GeneChip; they indicate unit number, Affymetrix GeneChip internal gene identifier, and probe identifier, respectively. Log2i is the logarithm base 2 perfect match probe intensity. In Table 4, Spot is a Scanalyze-specific variable that indicates spot number within arrays. Log2i is the logarithm base 2 intensity. After the output SAS data set is created, it is stored in the data warehouse and is then available as an input data set for analytical processes.
Table 3: Output Merging Data Set for Affymetrix Human Latin Square Experiment
Unit | AffyID | Probe | ChipID | Experiment | Series | Log2i |
---|---|---|---|---|---|---|
12274 | 476_s_at | 3 | 1 | a | 4 | 6.8073549221 |
872 | 33034_at | 7 | 1 | a | 4 | 8.7032114674 |
3855 | 39967_at | 1 | 1 | a | 4 | 8.3456270122 |
5504 | 35247_at | 13 | 1 | a | 4 | 7.5101707512 |
6798 | 39416_at | 11 | 1 | a | 4 | 7.9265925101 |
Table 4: Output Merging Data Set for the Drosophila Aging Experiment
Spot | Array | Dye | Line | Sex | Age | Log2i |
---|---|---|---|---|---|---|
4 | 1 | Cy3 | ORE | FEM | WK1 | 14.143383 |
8 | 1 | Cy3 | ORE | FEM | WK1 | 13.351905 |
9 | 1 | Cy3 | ORE | FEM | WK1 | 13.518899 |
14 | 1 | Cy3 | ORE | FEM | WK1 | 11.469133 |
15 | 1 | Cy3 | ORE | FEM | WK1 | 11.401413 |
Most of the SDS analytical processes require the data to be in stacked form, in which all of the intensity measurements are stacked into one variable, and so the resulting output data set from the various input engines is analysis ready for these analytical processes. Other analytical processes require the data to be in a rectangular form with, for example, genes as rows and samples as columns. To transform the data into this rectangular form, use the Data Transpose analytical process from the Utilities folder. In addition, the Data Transpose Rectangular analytical process is available for manipulating data that is already in rectangular form.
If your original raw data are already in rectangular form, for example, in a single Excel file, you can use one of the following two options to create a SAS data set in stacked form in SDS:
array samp[*] sample1-sample6;
do sample=1 to 6;
log2i =
log2(samp[i]);
if sample < = 3 then trt = 0;
else trt = 1;
output;
end;
Table 5: Experimental Design Table for a Single Rectangular Formatted Raw Data File
Sample | TrT | File | Intensity |
---|---|---|---|
1 | 0 | Data1 | Sample1 |
2 | 0 | Data1 | Sample2 |
3 | 0 | Data1 | Sample3 |
4 | 1 | Data1 | Sample4 |
5 | 1 | Data1 | Sample5 |
6 | 1 | Data1 | Sample6 |
See also:
SAS Scientific Discovery Solutions Analytical
Processes