Genotype inference software

This software infers genotypes in families.  It works with two types of family structures:

For further information about this program, or help using it, e-mail Josh Burdick (jburdick@gradient.cis.upenn.edu) or Vivian Cheung (vcheung@mail.med.upenn.edu).

Note that the MERLIN program also infers genotypes, in a more general way.

If you're using HapMap data...

Genotypes inferred by running this, using the HapMap data (and IBD information from Vivian Cheung's lab) are available from the HapMap website, in essentially the same format as the other HapMap genotypes.  Currently, inferred genotypes from HapMap builds 16c1 and 21 are available, so if you're using those datasets, you shouldn't need to run this program.

Data needed

You will need genotype data in "pedigree" format, and the locations of the genetic markers.

You will also need identity-by-descent (IBD) information between the relatives within a pedigree. One convenient way to calculate this is using Goncalo Abecasis' program Merlin, which outputs an IBD file in the format this program is designed to read. Other methods of determining IBD can be used; however, you will need to translate the IBD file to the same format that Merlin outputs.

The data formats are described below.

Getting the program

Download either this ZIP archive or this .tar.gz archive. Each contains: If you'd rather not build the program from source, pre-compiled executables are available for

Compiling

To compile the program, and run it on some test data sets, do:

tar xvfz genotypeinference.tar.gz
cd genotypeinference/src
make test

The executable is called inferrer.

Running the program

The program expects input from text files similar to those used by programs such as Merlin; their names need to be provided as command-line options.

Assuming the inferrer executable is in your $PATH, you can run, for example,

inferrer -m infer.map -p infer.ped -i infer.ibd > infer2.ped

This will infer genotypes for the children, and write out the pedigree, with inferred child genotypes added, to standard output (which in this case will be stored in infer2.ped.)

Marker file

The markers should be listed in a file, with each row containing chromosome, marker name, and location.  For example:

21 rs990141 15.0096740000000000
21 rs1005526 15.3884800000000000
21 rs926166 15.4379340000000000

Pedigree file

The family structure and genotypes should be listed in the pedigree file format used by Merlin; many other programs use a similar pedigree format.

This format is a text file, with one line per person. Different items are separated by whitespace (or optionally slashes.) The first five items are a family ID, individual ID, ID of this person's father or mother (or X or 0 if this person is a founder), and sex (1 = male, 2 = female.) The IDs can be arbitrary non-whitespace text. This is followed by the genotypes, in the same order as in the marker file. Genotypes are represented by numbers, with 0 for missing alleles.  For example:

1 3 x x 1 1/1 0/0 1/1 1/2 0/0
1 4 x x 2 2/2 0/0 0/0 0/0 1/1
1 6 3 4 2 1/2 0/0 0/0 0/0 0/0

IBD file

The IBD information should be in the same format that Merlin generates; the first line should be the same as what Merlin outputs, indicating the family ID, IDs of the two individuals in question, marker name, and the probability of sharing 0, 1, or 2 alleles.  For example:

FAMILY ID1 ID2 MARKER P0 P1 P2
1 1 1 RS0 0.0 0.0 1.0
1 2 1 RS0 1.0 0.0 0.0
1 2 2 RS0 0.0 0.0 1.0

Running

merlin --ibd --singlePoint --markerNames

should generate the correct IBD file (see the Merlin documentation for what filenames to use, and other useful options for Merlin.)

If you are doing two-generation inference, then the IBD file should contain extended IBD state information; see the Merlin documentation about extended IBD states for details.

Also, the program expects single-point IBD; it will infer fewer genotypes if it's given multipoint IBD.  (So, when computing IBD using Merlin, you should use the --singlepoint option.)

Family structure

The program supports inferring genotypes in two types of family structure.  By default, it assumes grandparent and parent genotypes are given, and attempts to infer genotypes for all of the children.

If the option --infer sibs is specified, the program will assume that genotypes for parents and at least one child are included in the pedigree file, and will attempt to infer genotypes for the remaining sibs.  Note that if you use this option, you will need to include extended IBD state information with the -i flag (for instance, by running merlin --extended), instead of the non-extended IBD information.

Output format

After doing inference, the program can write its output in several formats.  By default, it uses the pedigree file format that its input was in.

However, given the command-line option ---outputFormat csv, the output will be a comma-separated values (CSV) file, containing columns "Family", "Individual", "Marker ID", and "Genotype".  Genotypes that were completely missing are suppressed.  This option only works for SNP genotypes.

Sample data

The testdata directory contains three sets of data files. For convenience, the IBD file generated by Merlin is included.


last modified 20100910