This software infers genotypes in families. It works with two types of family structures:
For further information about this program, or help using it, e-mail
Josh Burdick
(jburdick@gradient.cis.upenn.edu) or Vivian Cheung
(vcheung@mail.med.upenn.edu).
Genotypes inferred by running this, using the HapMap data (and IBD information from Vivian Cheung's lab)
are available from the HapMap website, in essentially the same format
as the other HapMap genotypes. Currently, inferred
genotypes from HapMap builds 16c1 and 21 are available, so if
you're using those datasets, you shouldn't need to run this program.
You will need genotype data in "pedigree" format, and the locations of the genetic markers.
You will also need identity-by-descent (IBD) information between the relatives within a pedigree. One convenient way to calculate this is using Goncalo Abecasis' program Merlin, which outputs an IBD file in the format this program is designed to read. Other methods of determining IBD can be used; however, you will need to translate the IBD file to the same format that Merlin outputs.
The data formats are described below.
tar xvfz genotypeinference.tar.gz
cd genotypeinference/src
make test
The executable is called inferrer
.
The program expects input from text files similar to those used by programs such as Merlin; their names need to be provided as command-line options.
-m
: the name of the marker (map) file-p
: the name of the pedigree file-i
: the name of the IBD fileAssuming the inferrer executable is in your $PATH
, you
can
run, for example,
inferrer -m infer.map -p infer.ped -i infer.ibd >
infer2.ped
This will infer genotypes for the children, and write out the
pedigree,
with inferred child genotypes added, to standard output (which in this
case
will be stored in infer2.ped
.)
The markers should be listed in a file, with each row containing chromosome, marker name, and location. For example:
21 rs990141 15.0096740000000000
21 rs1005526 15.3884800000000000
21 rs926166 15.4379340000000000
The family structure and genotypes should be listed in the pedigree file format used by Merlin; many other programs use a similar pedigree format.
This format is a text file, with one line per person. Different
items are
separated by whitespace (or optionally slashes.) The first five items
are a
family ID, individual ID, ID of this person's father or mother (or X or
0 if
this person is a founder), and sex (1 = male, 2 = female.) The IDs can
be
arbitrary non-whitespace text. This is followed by the genotypes, in
the same
order as in the marker file. Genotypes are represented by numbers, with
0 for
missing alleles. For example:
1 3 x x 1 1/1 0/0 1/1 1/2 0/0
1 4 x x 2 2/2 0/0 0/0 0/0 1/1
1 6 3 4 2 1/2 0/0 0/0 0/0 0/0
The IBD information should be in the same format that Merlin
generates;
the first line should be the same as what Merlin outputs, indicating
the
family ID, IDs of the two individuals in question, marker name, and the
probability of sharing 0, 1, or 2 alleles. For example:
FAMILY ID1 ID2 MARKER P0 P1 P2
1 1 1 RS0 0.0 0.0 1.0
1 2 1 RS0 1.0 0.0 0.0
1 2 2 RS0 0.0 0.0 1.0
Running
merlin --ibd --singlePoint --markerNames
should generate the correct IBD file (see the Merlin documentation for what filenames to use, and other useful options for Merlin.)
If you are doing two-generation inference, then the IBD file should contain extended IBD state information; see the Merlin documentation about extended IBD states for details.
Also, the program expects single-point IBD; it will infer fewer
genotypes if it's given multipoint IBD. (So, when computing IBD
using Merlin, you should use the --singlepoint
option.)
-i
flag (for instance, by running merlin
--extended), instead of the non-extended IBD information.The testdata
directory contains three sets of data
files. For
convenience, the IBD file generated by Merlin is included.
synthetic
- a set of genotypes constructed
to show some of the possible cases for genotype inference.chr21_example
- a set of genotypes on
chromosome 21 for members of several CEPH
families. The sparse genotype data is from The SNP Consortium, and the dense
genotype data is from the HapMap
project. The dense genotypes are centered around chromosome 21, 44
Mb from the p-terminus.last modified 20100910