Glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A
Allele: Any of two or more genes/DNA sequences that have the same relative position on homologous chromosomes and are responsible for alternative characteristics.
Allele frequency: Allele frequency is a measure of the relative frequency of an allele on a genetic locus in a population. Usually it is expressed as a proportion or a percentage.
Annotation: Annotation is extra information associated with a particular point in a document or other piece of information. Given that molecular biology and bioinformatics have known the need for DNA annotation since the 1980s, where a previously unknown sequence representation of genetic material is annotated with information relating position to intron-exon-boundaries, regulatory sequences, repeats, gene names and protein products, etc. This annotation is usually stored in predefined fields in biological databases, especially sequence databases.
B
Base pair: Two complemetary nucleotides which form hydrogen bonds between the two antiparallel DNA strands.
Biological process: One of the three categories used by the Gene Ontology project, biological process describes broad biological goals, such as mitosis or purine metabolism.
Bit Score: The bit score is derived from the raw alignment score in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.
BLAST: Basic Local Alignment Search Tool is a search algorithm developed by Altschul et al. (1990). It is a very fast search algorithm that is used by the blastn, blastp, and blastx programs to separately search protein or DNA databases. BLAST is best used for sequence similarity searching, rather than for motif searching.
blastn: A BLAST program that compares a nucleotide query sequence against a nucleotide sequence database.
BLOSUM 80: An alternative scoring matrix for BLAST searches.
BLOSUM 45: An alternative scoring matrix for BLAST searches.
BLOSUM 62: A scoring matrix that is used as the default in blastp, blastx, tblastx, and tblastn BLAST searches.
C
cDNA / Complementary DNA: Complementary DNA (cDNA) is DNA synthesized from a mature mRNA template in a reaction catalysed by the enzyme reverse transcriptase. cDNA is often used to clone eukaryotic genes.
cDNA library: A cDNA library is the set of all the mRNAs contained within a cell. Because working with mRNA is difficult (as mRNA is unstable and is easily degraded by RNases which can be found even on the skin), researchers use an enzyme called reverse transcriptase which will produce a DNA copy of each mRNA strand. Referred to as cDNA these reverse transcribed mRNAs are collectively known as the library.
Cellular Component: One of the three categories used by the Gene Ontology project, cellular component encompasses subcellular structures, locations, and macromolecular complexes. Examples include nucleus, membrane, and ribosome.
Cloning: The process of making copies of a specific piece of DNA, usually a gene.
Clustal W: Clustal W is an alignment program for DNA and proteins with improved sensitivity for the alignment of divergent protein sequences.
Comparative genomics: Comparative genomics is the study of relationships between the genomes of different species or strains. Comparative genomics is an attempt to take advantage of the information provided by the signatures of selection to understand the function and evolutionary processes that act on genomes.
D
DNA: Deoxyribonucleic acid. A molecule made out of two strands in antiparallel orientation that are held together by hydrogen bonds, each strand being made up of a sugar (deoxy ribose), a phosphate group and one of four bases (adenine, guanine, cytosine or thymine).
DUST: A program for filtering low complexity regions from nucleic acid sequences. DUST filtering is performed by default in blastn searches.
E
Exon: The region of a gene that encodes amino acids in a protein.
Electrophoresis: The process in which DNA fragments can be separated according to size and electrical charge by applying an electric current to them, generally in a matrix of agarose gel or polyacrylamide gel.
Expressed sequence tags: An expressed sequence tag or EST is a short sub-sequence of a transcribed spliced nucleotide sequence (either protein-coding or not). They are intended as a way to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. An EST is produced by one-shot sequencing of a cloned mRNA (i.e. sequencing several hundred base pairs from an end of a cDNA clone taken from a cDNA library).
E-Value: In a BLAST search, an E-Value refers to the Expectation Value. The number of different alignments with scores equivalent to or better than alignment scores that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
Evidence Code: Every GO annotation must indicate the type of evidence that supports it; the evidence codes correspond to broad categories of experimental or other support. The evidence code indicates how annotation to a particular term is supported.
F
FASTA File: A FASTA file is a simple format primarily used to store genetic sequence information. FASTA files are easily created in a text editor. It consists of a header line beginning with a '>', holding a name or identifier and any additional information about the sequence. The following lines contain the DNA or protein sequence.
Filter Options: Filtering masks of portions of a query sequence that have low compositional complexity (such as short internal repeats or poly-A sequences) to reduce the frequency of statistically significant but biologically uninteresting BLAST results.
Functional genomics: Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Unlike genomics and proteomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures.
G
Gene: One of the chromosomal units that transmit specific hereditary traits: a segment of the self- reproducing molecule, deoxyribonucleic acid.
Gene expression: Gene expression is the process by which a gene's DNA sequence is converted into functional proteins.
Gene Ontology (GO): The Gene Ontology (GO) project was established to provide a common language to describe aspects of a gene product's biology. The use of a consistent vocabulary allows genes from different species to be compared based on their GO annotations. For each of three categories of biological information--molecular function, biological process, and cellular component--a set of terms has been selected and organized. Each set of terms uses a controlled vocabulary, and parent-child relationships between terms are defined. This combination of a controlled vocabulary with defined relationships between items is referred to as an ontology. Within an ontology, a child may be a "part of" or an example ("instance") of its parent. There are three independently organized controlled vocabularies, or gene ontologies, one for molecular function, one for biological process, and one for cellular component. Many-to-many parent-child relationships allowed in the ontologies. A gene may be annotated to any level in an ontology, and to more than one item within an ontology. The Gene Ontology project is a collaboration between three model organism databases, FlyBase (Drosophila), Saccharomyces Genome Database (SGD) and Mouse Genome Informatics (MGI).
Genetic code: The set of three nucleotides which transfer the information as a particular amino acid in protein during translation.
Genetic map: A map of chromosomes showing the position of known genes and/or markers relative to each other, rather than as specific physical points on each chromosome.
Genetic Markers: Alleles of DNA polymorphisms, used as experimental tags to keep track of an individual, a tissue, a cell, a nucleus, a chromosome, or a gene. Stated another way, any character that acts as a signpost or signal of the presence OR location of a gene OR heredity characteristic in an individual of a population.
Genome: The total genetic complement of an organism, i.e. an organism's complete set of DNA sequences.
Genotype: The actual alleles present in an individual; the genetic makeup of an organism.
H
Homolog: A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication.
I
Intron: A noncoding sequence of DNA that is initially copied into RNA but is spliced out of the final RNA transcript.
L
Locus: The position on a chromosome of a gene or a particular segment of DNA (marker).
Low Complexity Region: Regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of some residues. The SEG program is used to mask or filter LCRs in amino acid queries. The DUST program is used to mask or filter LCRs in nucleic acid queries.
M
Microsatellite: Tandem repeats of short simple DNA sequence, generally of 1-6 bases.
Molecular Function: One of the three categories used by the Gene Ontology project, molecular function describes the tasks performed by individual gene products; examples are transcription factor and DNA binding.
Motif: Recurring pattern in a DNA sequence.
Mutation: A disruption in the normal sequence of a DNA strand resulting in a varied sequence at the same position.
mRNA / Messenger RNA: Messenger Ribonucleic Acid (mRNA) is a molecule of RNA encoding a chemical "blueprint" for a protein product. mRNA is transcribed from a DNA template, and carries coding information to the sites of protein synthesis: the ribosomes.
N
Non-coding DNA: The segment of DNA that does not carry the information necessary to make an amino acid sequence.
Nucleotide: One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base (one of five chemicals: adenine, thymine in DNA, uracil in RNA, guanine, and cytosine) plus a molecule of sugar and one of phosphoric acid.
P
PAM 30: Sequence alignment matrix that allows 30 accepted point mutations per 100 amino acids. A higher PAM is more suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing closely related sequences (Swartz and Dayhoff, 1978).
PAM 70: Sequence alignment matrix that allows 70 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences (Swartz and Dayhoff, 1978).
Phenotype: The visible characteristics of an organism, resultant of genotype and environment.
Plasmid: A plasmid is a DNA molecule separate from the chromosomal DNA and capable of autonomous replication. It is typically circular and double-stranded. It usually occurs in bacteria, sometimes in eukaryotic organisms. Size of plasmids varies from 1 to over 400 kilobase pairs (kbp).
poly(A) tail: The 3' poly(A) tail is a long sequence of adenine nucleotides (often several hundred) added to the "tail" or 3' end of the pre-mRNA through the action of an enzyme, polyadenylate polymerase. In higher eukaryotes, the poly(A) tail is added onto transcripts that contain a specific sequence, the AAUAAA signal.
Polymerase Chain Reaction: Polymerase Chain Reaction (PCR) A method of DNA analysis that exponentially amplifies a specific DNA sequence or region allowing rapid DNA analysis.
Primer: A short oligonucleotide sequence used to amplify DNA sequences in a polymerase chain reaction.
Q
Recombinant DNA: Recombinant DNA is a form of artificial DNA which is engineered through the combination or insertion of one or more DNA strands, thereby combining DNA sequences which would not normally occur together.
Repeat element: Short stretches of DNA with the capacity to move between different points within a genome.
Repeat kind: Nature of repeat sequence, e.g. perfect repeat, imperfect repeat, compound repeat.
Repeat type: Based on the number of nucleotides in a motif, repeat type is classified into mono, di, tri, tetra, penta, hexa.
Restriction enzyme: A restriction enzyme (or restriction endonuclease) is an enzyme that cuts double-stranded DNA. The enzyme makes two incisions, one through each of the sugar-phosphate backbones of the double helix without damaging the nitrogenous bases.
S
SEG: A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "X" in an alignment. SEG filtering is performed by default in blastp, blastx, tblastx, and tblastn searches.
Splicing: Splicing is the process by which pre-mRNA is modified to remove certain stretches of non-coding sequences called introns; the stretches that remain include protein-coding sequences and are called exons. Sometimes pre-mRNA messages may be spliced in several different ways, allowing a single gene to encode multiple proteins. This process is called alternative splicing. Splicing is usually performed by an RNA-protein complex called the spliceosome, but some RNA molecules are also capable of catalyzing their own splicing.
SSR: Simple sequence repeats (same as microsatellites). A repeated sequence of 1 to 6 bases.
STR: Short tandem repeats (same as microsatellites). A repeated sequence of 1 to 6 bases.
T
tblastn: A BLAST program compares a protein query against the six-frame translations of a nucleotide sequence database.
tblastx: A BLAST program that translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.
Transcription: Transcription is the process through which a DNA sequence is enzymatically copied by an RNA polymerase to produce a complementary RNA. In other words, it is the transfer of genetic information from DNA into RNA.
Transcriptome: The transcriptome is the set of all messenger RNA (mRNA) molecules, or "transcripts", produced in one or a population of cells. The term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type. Unlike the genome, which is roughly fixed for a given cell line (excluding mutations), the transcriptome can vary with external environmental conditions.
V
Vector: A vector is a DNA molecule into which foreign fragments of DNA may be inserted. A vector functions like a "molecular carrier", which will carry fragments of DNA into a host cell. Vectors are usually derived from plasmids, which are small, circular, double-stranded DNA molecules occurring naturally in the cytoplasm of bacteria. Vectors contain an origin of replication, which enables the vector, together with the foreign DNA fragment inserted into it, to replicate. Vectors contain genetic markers that allow for selection of cells which have taken up the plasmid DNA. Vector DNA functions to insert and amplify a gene into a target genome. Vector DNA can be used in a DNA vaccine.
W
Wildcard: A wildcard is a character that may be used in a search term to represent one or more other characters.
Word Size: The Word Size (W) is a BLAST parameter that determines the minimum length of a match. The query sequence is split up into every possible 'word' of a selected size. BLAST first searches for a perfect match of at least the word length. Once a match is found then it tries to extend the HSP.