Genotype inference software

This software infers genotypes in families. It works with two types of family structures:

with three-generation families, in which high-density genotypes are available for grandparents and parents, and sparse genotypes are available for all family members, and
with nuclear (two-generation) families, in which high-density genotypes are available for parents and one sib, and sparse genotypes are available for all family members.

For further information about this program, or help using it, e-mail Josh Burdick (jburdick@gradient.cis.upenn.edu) or Vivian Cheung (vcheung@mail.med.upenn.edu).

Note that the MERLIN program also infers genotypes, in a more general way.

If you're using HapMap data...

Genotypes inferred by running this, using the HapMap data (and IBD information from Vivian Cheung's lab) are available from the HapMap website, in essentially the same format as the other HapMap genotypes. Currently, inferred genotypes from HapMap builds 16c1 and 21 are available, so if you're using those datasets, you shouldn't need to run this program.

Data needed

You will need genotype data in "pedigree" format, and the locations of the genetic markers.

You will also need identity-by-descent (IBD) information between the relatives within a pedigree. One convenient way to calculate this is using Goncalo Abecasis' program Merlin, which outputs an IBD file in the format this program is designed to read. Other methods of determining IBD can be used; however, you will need to translate the IBD file to the same format that Merlin outputs.

The data formats are described below.

Getting the program

Download either this ZIP archive or this .tar.gz archive. Each contains:

the C++ source code (which should be portable to any platform with a C++ compiler and an implementation of the C++ Standard Template Library (STL).
several sample files and test scripts (one of these requires Perl >= 5.8.7).

If you'd rather not build the program from source, pre-compiled executables are available for

x86 Linux

Compiling

To compile the program, and run it on some test data sets, do:

tar xvfz genotypeinference.tar.gz
cd genotypeinference/src
make test

The executable is called inferrer.

Running the program

The program expects input from text files similar to those used by programs such as Merlin; their names need to be provided as command-line options.

-m : the name of the marker (map) file
-p : the name of the pedigree file
-i : the name of the IBD file

Assuming the inferrer executable is in your $PATH, you can run, for example,

inferrer -m infer.map -p infer.ped -i infer.ibd > infer2.ped

This will infer genotypes for the children, and write out the pedigree, with inferred child genotypes added, to standard output (which in this case will be stored in infer2.ped.)

Marker file

The markers should be listed in a file, with each row containing chromosome, marker name, and location. For example:

21 rs990141 15.0096740000000000 21 rs1005526 15.3884800000000000 21 rs926166 15.4379340000000000

Pedigree file

The family structure and genotypes should be listed in the pedigree file format used by Merlin; many other programs use a similar pedigree format.

This format is a text file, with one line per person. Different items are separated by whitespace (or optionally slashes.) The first five items are a family ID, individual ID, ID of this person's father or mother (or X or 0 if this person is a founder), and sex (1 = male, 2 = female.) The IDs can be arbitrary non-whitespace text. This is followed by the genotypes, in the same order as in the marker file. Genotypes are represented by numbers, with 0 for missing alleles. For example:

1 3 x x 1 1/1 0/0 1/1 1/2 0/0 1 4 x x 2 2/2 0/0 0/0 0/0 1/1 1 6 3 4 2 1/2 0/0 0/0 0/0 0/0

IBD file

The IBD information should be in the same format that Merlin generates; the first line should be the same as what Merlin outputs, indicating the family ID, IDs of the two individuals in question, marker name, and the probability of sharing 0, 1, or 2 alleles. For example:

FAMILY ID1 ID2 MARKER P0 P1 P2 1 1 1 RS0 0.0 0.0 1.0 1 2 1 RS0 1.0 0.0 0.0 1 2 2 RS0 0.0 0.0 1.0

Running

merlin --ibd --singlePoint --markerNames

should generate the correct IBD file (see the Merlin documentation for what filenames to use, and other useful options for Merlin.)

If you are doing two-generation inference, then the IBD file should contain extended IBD state information; see the Merlin documentation about extended IBD states for details.

Also, the program expects single-point IBD; it will infer fewer genotypes if it's given multipoint IBD. (So, when computing IBD using Merlin, you should use the --singlepoint option.)

Family structure

The program supports inferring genotypes in two types of family structure. By default, it assumes grandparent and parent genotypes are given, and attempts to infer genotypes for all of the children.

If the option --infer sibs is specified, the program will assume that genotypes for parents and at least one child are included in the pedigree file, and will attempt to infer genotypes for the remaining sibs. Note that if you use this option, you will need to include extended IBD state information with the -i flag (for instance, by running merlin --extended), instead of the non-extended IBD information.

Output format

After doing inference, the program can write its output in several formats. By default, it uses the pedigree file format that its input was in.

However, given the command-line option ---outputFormat csv, the output will be a comma-separated values (CSV) file, containing columns "Family", "Individual", "Marker ID", and "Genotype". Genotypes that were completely missing are suppressed. This option only works for SNP genotypes.

Sample data

The testdata directory contains three sets of data files. For convenience, the IBD file generated by Merlin is included.

synthetic - a set of genotypes constructed to show some of the possible cases for genotype inference.
chr21_example - a set of genotypes on chromosome 21 for members of several CEPH families. The sparse genotype data is from The SNP Consortium, and the dense genotype data is from the HapMap project. The dense genotypes are centered around chromosome 21, 44 Mb from the p-terminus.
chr22_example - an example using some genotypes on chromosome 22, from the same sources as above. First, it does three-generation inference. It then masks the inferred genotypes of all but one child in each of the families, and does two-generation inference. This test requires Perl (tested using 5.8.7, but probably will work with other Perl versions.)

last modified 20100910