Search | Site Map | Contact

Research: Project

Statistical, functional and bioinformatic analysis of ChIP-Seq and RNA-Seq data in regulatory gene networks

Pedro Madrigal, Institute of Plant Genetics, Polish Academy of Science

TF binding site analysis in ChIP-Seq

Transcription factors (TF) and other chromatin-associated proteins are involved in important phenotype influencing mechanisms. Determining how proteins interact with DNA to regulate gene expression is essential for full understanding of many biological processes and disease states, including the regulatory gene networks controlling reproductive development and flowering in Arabidopsis thaliana.

Nowadays, the process of identifying transcription factor binding sites from ChIP-Seq datasets provided by wet labs is focused on peak-searching criteria along the enriched regions, under certain assumptions concerning the statistical distribution of treated sample and background control (if available). The vast majority of ChIP-Seq software packages make use of this idea, with very few exceptions that turns the problem from one of peak identification to peak deconvolution. At present state of art, peak detection is one of the stages in ChIP-Seq data analysis having a strong impact on the conclusions made in the end.

Therefore, different approaches are required to increase the accuracy in terms of sensitivity, specificity and false discovery rate of the results produced by the existing experimental methods, combined with user-friendly tools not too computationally expensive to process the large amount of data coming from high-throughput sequencing. This gap can be filled by developing new signal processing techniques using mathematical and statistical models adapted to deal with next generation sequencing data.

The aim of my project is the implementation and testing of new statistical algorithms to determine and characterize the enriched regions by mathematical models typically used in functional data analysis theory and determine, locate and classify the binding sites through a new strategy, which can include the application of classification methods like neural networks or pattern recognition.

Fig 1. Protein binding signal in chromosome 4 of Arabidopsis thaliana obtained by analyzing ChIP-Seq data using CSAR R Bioconductor package [7]

Figure 1

Data integration in ChIP-Seq

As NGS technologies become cheaper and faster, new methodologies are needed to merge and unify the biological insights contained in samples coming from different replicates (technical or biological), time stages or even from proteins binding the DNA in different configurations. The subsequent step of the project is the construction of new software libraries for the joint analysis of samples coming from different sources, including the information contained in the chromatin structure and nucleosome occupancy.

Differentially expressed genes identification from RNA-Seq data and statistical testing

The analysis of RNA-Seq data can be used to identify gene expression, discover novel transcripts and exon junctions, and isoforms detection and quantitation. Biological replicates are essential in RNA-Seq experiments to draw generalized conclusions, and new procedures will be developed to solve the lack of statistical and computational methodologies in this area.

Fig 2. IGB profiles wigglegram of gene expression along Arabidopsis thaliana chromosome obtained by analyzing RNA-Seq data using TopHat [4]

Figure 2


[1] Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-Seq peak detection. PLoS ONE 5, e11471.

[2] Szalkowski AM, Schmid CD (2010) Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts. Brief Bioinform, doi:10.1093/bib/bbq068.

[3] Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6, S22-32.

[4] Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105-11.

[5] Nicol JW, et al. (2009) The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics 25(20), 2730-1.

[6] Ferrier T, et al. (2011) Arabidopsis paves the way: genomic and network analyses in crops. Curr Opin Biotechnol 22, 260-70.

[7] Muino JM, et al. (2011) ChIP-seq analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions. Plant Methods 7:11.