Phased Genotype Matrices#

Class Family Overview#

The PhasedGenotypeMatrix object family is a derivative of the GenotypeMatrix object family. The PhasedGenotypeMatrix family of objects has all the same purposes and functionality as its parent family except that it is used to represent phased genotypes. Phased genotype matrix representation is essential to the representation of genomes in PyBrOpS.

Summary of Phased Genotype Matrix Classes#

Phased genotype matrix classes in PyBrOpS are found in the pybrops.popgen.gmat module. Contained in this module are several PhasedGenotypeMatrix class type definitions which are summarized in the table below.

Summary of phased genotype matrix classes in the pybrops.popgen.gmat module#

Class Name

Class Type

Class Description



Interface for all phased genotype matrix child classes.



Class representing dense, phased genotype matrices.

Loading Class Modules#

Phased genotype matrix classes can be imported as follows:

# import the PhasedGenotypeMatrix class (an abstract interface class)
from pybrops.popgen.gmat.PhasedGenotypeMatrix import PhasedGenotypeMatrix

# import the DensePhasedGenotypeMatrix class (a concrete implemented class)
from pybrops.popgen.gmat.DensePhasedGenotypeMatrix import DensePhasedGenotypeMatrix

Creating Phased Genotype Matrices#

Like the GenotypeMatrix family of classes, members of the PhasedGenotypeMatrix family can be constructed from raw NumPy arrays, by reading data from VCF files, or by reading data from an HDF5 file containing a saved phased genotype matrix.

Creating phased genotype matrices from NumPy arrays#

Phased genotype matrices can be constructed from raw NumPy arrays using the constructor. The example below demonstrates the construction of a phased genotype matrix using the DensePhasedGenotypeMatrix class. Numerical matrix inputs containing genotypic codings must have a shape of (m,n,p), where m is the number of phases, n is the number of taxa, and p is the number of marker variants. The genotype code matrix must be binary in nature ({0,1}) and must have an int8 data type.

Like the DenseGenotypeMatrix class, additional optional metadata may be stored along with a DensePhasedGenotypeMatrix including taxa names (taxa), taxa groups (taxa_grp), marker variant chromosome assignments (vrnt_chrgrp), marker variant chromosome physical positions (vrnt_phypos), marker variant names (vrnt_name), marker variant genetic map positions (vrnt_genpos), sequential recombination probabilities between markers (vrnt_xoprob), marker variant haplotype group assignments (vrnt_hapgrp), reference haplotype (vrnt_hapref), alternative haplotype (vrnt_hapalt), and a variant mask (vrnt_mask).

# shape parameters for random genotypes
ntaxa = 100
nvrnt = 1000
ngroup = 20
nchrom = 10
nphase = 2

# create random genotypes
mat = numpy.random.randint(0, 2, size = (nphase,ntaxa,nvrnt)).astype("int8")

# create taxa names
taxa = numpy.array(["taxon"+str(i+1).zfill(3) for i in range(ntaxa)], dtype = object)

# create taxa groups
taxa_grp = numpy.random.randint(1, ngroup+1, ntaxa)

# create marker variant chromsome assignments
vrnt_chrgrp = numpy.random.randint(1, nchrom+1, nvrnt)

# create marker physical positions
vrnt_phypos = numpy.random.choice(1000000, size = nvrnt, replace = False)

# create marker variant names
vrnt_name = numpy.array(["SNP"+str(i+1).zfill(4) for i in range(nvrnt)], dtype = object)

# create a phased genotype matrix from scratch using NumPy arrays
pgmat = DensePhasedGenotypeMatrix(
    mat = mat,
    taxa = taxa,
    taxa_grp = taxa_grp,
    vrnt_chrgrp = vrnt_chrgrp,
    vrnt_phypos = vrnt_phypos,
    vrnt_name = vrnt_name,
    ploidy = nphase

Loading phased genotype matrices from VCF files#

Data from VCF files can be loaded using the from_vcf method. This import method assumes that the provided VCF file has been previously phased. If an input file is unphased, it will be loaded as if it were correctly phased, which will be problematic for non-homozygous loci.

# read a phased genotype matrix from VCF file
pgmat = DensePhasedGenotypeMatrix.from_vcf("widiv_2000SNPs.vcf.gz")

Loading phased genotype matrices from HDF5 files#

Like regular genotype matrices, phased genotype matrices can be exported to HDF5 files via the to_hdf5 method. These files can later be read into PyBrOpS using the from_hdf5 method. The example below illustrates loading a DensePhasedGenotypeMatrix into memory from an HDF5 file:

# read a genotype matrix from HDF5 file
pgmat = DensePhasedGenotypeMatrix.from_hdf5("widiv_2000SNPs.h5")

Phased Genotype Matrix Properties#

General properties#

Summary of PhasedGenotypeMatrix general properties#




Pointer to the raw phased genotype matrix pointer


The number of dimensions for the phased genotype matrix


Genotype matrix shape


Genotype matrix format


The ploidy of the taxa represented by the phased genotype matrix

Phase properties#

Summary of PhasedGenotypeMatrix phase properties#




The number of chromosome phases represented by the phased genotype matrix


The matrix axis along which phases are stored

Taxa properties#

Summary of PhasedGenotypeMatrix taxa properties#




The number of taxa represented by the phased genotype matrix


The names of the taxa


The matrix axis along which taxa are stored


Taxa group label


The names of the taxa groups


The start indices (inclusive) for each taxa group, post sorting and grouping


The stop indices (exclusive) for each taxa group, post sorting and grouping


The length of each taxa group, post sorting and grouping

Marker variant properties#

Summary of PhasedGenotypeMatrix marker variant properties#




The number of genotype variants represented by the phased genotype matrix


The names of the marker variants


The axis along which marker variants are stored


The chromosome to which a marker variant belongs


The physical position of a marker variant


The genetic position of a marker variant


The crossover probability between the previous marker


The reference haplotype for the marker variant


The alternative haplotype for the marker variant


The haplotype grouping for the marker variant


A mask associated with the marker variants


The names of the chromosomes


The start indices (inclusive) for each chromosome, post sorting and grouping


The stop indices (exclusive) for each chromosome, post sorting and grouping


The length of each chromosome, post sorting and grouping

Copying Phased Genotype Matrices#

# copy a phased genotype matrix
tmp = copy.copy(pgmat)
tmp = pgmat.copy()

# deep copy a phased genotype matrix
tmp = copy.deepcopy(pgmat)
tmp = pgmat.deepcopy()

Phased Genotype Matrix Element Copy-On-Manipulation#

Adjoining elements#

# create a new genotype matrix to demonstrate
new = pgmat.deepcopy()

# adjoin genotype matrices along the taxa axis
tmp = pgmat.adjoin(new, axis = pgmat.taxa_axis)
tmp = pgmat.adjoin_taxa(new)

# adjoin genotype matrices along the variant axis
tmp = pgmat.adjoin(new, axis = pgmat.vrnt_axis)
tmp = pgmat.adjoin_vrnt(new)

Deleting elements#

delete taxa#

# delete first taxon using an integer
tmp = pgmat.delete(0, axis = pgmat.taxa_axis)
tmp = pgmat.delete_taxa(0)

# delete first five taxa using a slice
tmp = pgmat.delete(slice(0,5), axis = pgmat.taxa_axis)
tmp = pgmat.delete_taxa(slice(0,5))

# delete first five taxa using a Sequence
tmp = pgmat.delete([0,1,2,3,4], axis = pgmat.taxa_axis)
tmp = pgmat.delete_taxa([0,1,2,3,4])

delete marker variants#

# delete first marker variant using an integer
tmp = pgmat.delete(0, axis = pgmat.vrnt_axis)
tmp = pgmat.delete_vrnt(0)

# delete first five marker variants using a slice
tmp = pgmat.delete(slice(0,5), axis = pgmat.vrnt_axis)
tmp = pgmat.delete_vrnt(slice(0,5))

# delete first five marker variants using a Sequence
tmp = pgmat.delete([0,1,2,3,4], axis = pgmat.vrnt_axis)
tmp = pgmat.delete_vrnt([0,1,2,3,4])

Inserting elements#

# create a new genotype matrix to demonstrate
new = pgmat.deepcopy()

# insert genotype matrix along the taxa axis before index 0
tmp = pgmat.insert(0, new, axis = pgmat.taxa_axis)
tmp = pgmat.insert_taxa(0, new)

# insert genotype matrix along the variant axis before index 0
tmp = pgmat.insert(0, new, axis = pgmat.vrnt_axis)
tmp = pgmat.insert_vrnt(0, new)

Selecting elements#

# select first five taxa using a Sequence
tmp =[0,1,2,3,4], axis = pgmat.taxa_axis)
tmp = pgmat.select_taxa([0,1,2,3,4])

# select first five marker variants using a Sequence
tmp =[0,1,2,3,4], axis = pgmat.vrnt_axis)
tmp = pgmat.select_vrnt([0,1,2,3,4])

Phased Genotype Matrix Element In-Place-Manipulation#

Appending elements#

# append genotype matrices along the taxa axis
tmp = pgmat.deepcopy()                   # copy original
tmp.append(pgmat, axis = tmp.taxa_axis)  # append original to copy

tmp = pgmat.deepcopy()                   # copy original
tmp.append_taxa(pgmat)                   # append original to copy

# append genotype matrices along the variant axis
tmp = pgmat.deepcopy()                   # copy original
tmp.append(pgmat, axis = tmp.vrnt_axis)  # append original to copy

tmp = pgmat.deepcopy()                   # copy original
tmp.append_vrnt(pgmat)                   # append original to copy

Removing elements#

remove taxa#

# remove first taxon using an integer
tmp = pgmat.deepcopy()                           # copy original
tmp.remove(0, axis = pgmat.taxa_axis)            # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_taxa(0)                               # remove from copy

# remove first five taxa using a slice
tmp = pgmat.deepcopy()                           # copy original
tmp.remove(slice(0,5), axis = pgmat.taxa_axis)   # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_taxa(slice(0,5))                      # remove from copy

# remove first five taxa using a Sequence
tmp = pgmat.deepcopy()                           # copy original
tmp.remove([0,1,2,3,4], axis = pgmat.taxa_axis)  # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_taxa([0,1,2,3,4])                     # remove from copy

remove marker variants#

# remove first marker variant using an integer
tmp = pgmat.deepcopy()                           # copy original
tmp.remove(0, axis = pgmat.vrnt_axis)            # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_vrnt(0)                               # remove from copy

# remove first five marker variants using a slice
tmp = pgmat.deepcopy()                           # copy original
tmp.remove(slice(0,5), axis = pgmat.vrnt_axis)   # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_vrnt(slice(0,5))                      # remove from copy

# remove first five marker variants using a Sequence
tmp = pgmat.deepcopy()                           # copy original
tmp.remove([0,1,2,3,4], axis = pgmat.vrnt_axis)  # remove from copy

tmp = pgmat.deepcopy()                           # copy original
tmp.remove_vrnt([0,1,2,3,4])                     # remove from copy

Incorporating elements#

# incorp genotype matrix along the taxa axis before index 0
tmp = pgmat.deepcopy()                           # copy original
tmp.incorp(0, pgmat, axis = pgmat.taxa_axis)     # incorporate into copy

tmp = pgmat.deepcopy()                           # copy original
tmp.incorp_taxa(0, pgmat)                        # incorporate into copy

# incorp genotype matrix along the variant axis before index 0
tmp = pgmat.deepcopy()                           # copy original
tmp.incorp(0, pgmat, axis = pgmat.vrnt_axis)     # incorporate into copy

tmp = pgmat.deepcopy()                           # copy original
tmp.incorp_vrnt(0, pgmat)                        # incorporate into copy

Concatenating matrices#

# concatenate along the taxa axis
tmp = pgmat.concat([pgmat, pgmat], axis = pgmat.taxa_axis)
tmp = pgmat.concat_taxa([pgmat, pgmat])

# concatenate along the variant axis
tmp = pgmat.concat([pgmat, pgmat], axis = pgmat.vrnt_axis)
tmp = pgmat.concat_vrnt([pgmat, pgmat])

Grouping and Sorting#


reorder taxa#

# create reordering indices
indices = numpy.arange(pgmat.ntaxa)
tmp = pgmat.deepcopy()

# reorder values along the taxa axis
tmp.reorder(indices, axis = tmp.taxa_axis)

reorder marker variants#

# create reordering indices
indices = numpy.arange(pgmat.nvrnt)
tmp = pgmat.deepcopy()

# reorder values along the marker variant axis
tmp = pgmat.deepcopy()
tmp.reorder(indices, axis = tmp.vrnt_axis)


lexsort taxa#

# create lexsort keys for taxa
key1 = numpy.random.randint(0, 10, pgmat.ntaxa)
key2 = numpy.arange(pgmat.ntaxa)

# lexsort along the taxa axis
pgmat.lexsort((key2,key1), axis = pgmat.taxa_axis)

lexsort marker variants#

# create lexsort keys for marker variants
key1 = numpy.random.randint(0, 10, pgmat.nvrnt)
key2 = numpy.arange(pgmat.nvrnt)

# lexsort along the marker variant axis
pgmat.lexsort((key2,key1), axis = pgmat.vrnt_axis)


sort taxa#

# sort along taxa axis
tmp = pgmat.deepcopy()
tmp.sort(axis = tmp.taxa_axis)

sort marker variants#

# sort along marker variant axis
tmp = pgmat.deepcopy()
tmp.sort(axis = tmp.vrnt_axis)


group taxa#

# sort along taxa axis
tmp = pgmat.deepcopy() = tmp.taxa_axis)
# determine whether grouping has occurred along the taxa axis
tmp.is_grouped(axis = tmp.taxa_axis)

group marker variants#

# sort along vrnt axis
tmp = pgmat.deepcopy() = tmp.vrnt_axis)
# determine whether grouping has occurred along the vrnt axis
tmp.is_grouped(axis = tmp.vrnt_axis)

Summary Statistics#

# count the number of major alleles across all taxa
out = pgmat.acount()
out = pgmat.acount(dtype = "int32")

# calculate the allele frequency across all taxa
out = pgmat.afreq()
out = pgmat.afreq(dtype = "float32")

# calculate whether a locus is polymorphic across all taxa
out = pgmat.apoly()
out = pgmat.apoly(dtype = int)

# count the number of genotypes across all taxa
out = pgmat.gtcount()
out = pgmat.gtcount(dtype = "int32")

# calculate the genotype frequency across all taxa
out = pgmat.gtfreq()
out = pgmat.gtfreq(dtype = "float32")

# calculate the minor allele frequency across all taxa
out = pgmat.maf()
out = pgmat.maf(dtype = "float32")

# calculate the mean expected heterozygosity for the population
out = pgmat.meh()
out = pgmat.meh(dtype = "float32")

# count the number of major alleles individually within taxa
out = pgmat.tacount()
out = pgmat.tacount(dtype = "int32")

# calculate the allele frequency individually within taxa
out = pgmat.tafreq()
out = pgmat.tafreq(dtype = "float32")

Saving Genotype Matrices#

Write to HDF5#

# write a breeding value matrix to an HDF5 file