Avaliação de predição de seleção genômica ampla pelo uso de redes neurais artificiais

Carlos Henrique Paiva Camisa Nova¹, Daniel Furtado Dardengo Sant Anna², Davi Leal Barbosa³, Norberto da Silva Rocha⁴, Matheus Lima Corrêa Abreu⁵, Kamila da Silva Alvarenga⁶, Antônio Paulo Oliveira Neto⁷, Leonardo Siqueira Glória⁸

1 - Universidade Estadual do Norte Fluminense - Darcy Ribeiro
2 - Universidade Estadual do Norte Fluminense - Darcy Ribeiro
3 - Universidade Estadual do Norte Fluminense - Darcy Ribeiro
4 - Universidade Federal dos vales do Jequitinhonha e Mucuri
5 - Universidade Federal de Minas Gerais
7 - Universidade Estadual do Norte Fluminense - Darcy Ribeiro
8 - Universidade Estadual do Norte Fluminense - Darcy Ribeiro

RESUMO -

Recentemente, há um aumento de interesse na utilização de métodos não paramétricos, tais como redes neurais artificiais (RNA), na área de seleção genômica ampla (SGA). Uma classe especial de RNA é aquela com regularização Bayesiana, a qual não exige um conhecimento a priori da arquitetura genética da característica. O objetivo do presente estudo foi aplicar a RNA baseado em regularização Bayesiana na predição de valores genéticos genômicos utilizando conjuntos de dados simulados a fim de selecionar os marcadores SNP mais relevantes por meio de dois métodos diferentes. A arquitetura mais simples da rede neural com regularização Bayesiana obteve os melhores resultados para as duas características avaliadas, os quais foram muito similares às metodologias tradicionais RR-BLUP e Lasso Bayesiano (BLASSO).

Palavras-chave: aprendizado de máquinas, computação, melhoramento animal

Assessment of Genome-Wide Prediction by Using Bayesian Regularized Neural Networks

ABSTRACT - Currently, concern to use of nonparametric methods has been increasing, such as artificial neural networks (ANN), in the field of study Wide Genomic Selection (WGS). A special class of ANN is with Bayesian Regularized, on which previous knowledge is not required for genetic architecture of the characteristic. The goal in this study was apply an ANN based on Bayesian Regularized in prediction of genomic genetic value using a set of simulated data to select the SNP markers more significant by two different methods. The simplest Bayesian Regularized Neural Network presented the greater results to two evaluated characteristics, which were similar to traditional methods RR-BLUP and Lasso Bayesiano (BLASSO).

Keywords: machine learning, programming, animal breeding

Introdução

The importance of Genome-enabled selection (GS) have increased in animal breeding area for the last few years due to the rapid development of DNA sequencing technologies that have allowed large-scale genotyping of thousands of genetic markers in animal species. Hence, the number of markers (p) typically runs into thousands and often far exceeds the number of phenotypes (n), leading to the classic p>>n problem. That is just one of other statistics and computation challenges. In the context of genome-enable prediction, the high dimensional nature of high-throughput SNP-marker data sets have required the versatility of regularization methods, such as the Bayesian Regularized Neural Network (de Los Campos, et al. 2013). Many studies using this method have been proposed in the recently years (Gianola, et al. 2011, González-Camacho, et al. 2012, Okut, et al. 2011). The aim of the present study was to apply the ANN based on Bayesian regularization to genome-enable prediction regarding simulated data sets, to select the most relevant SNP markers by using two proposed methods.

Revisão Bibliográfica

The advances in the animal breeding area have been occurring mainly in computer science and statistic, where were developed estimators of individual genetic values created in different environments and, concomitantly, the estimation possibility of a vast number of parameters, using great data sets (Brito,2011). In this context, many prediction methods of markers were proposed such as BLUP, Bayes A and Bayes B (Meuwissen; Hayes; Goddard, 2001), LASSO bayesiano (Park & Casella, 2008; de los Campos et al., 2009) and Bayesian Ridge Regression (Gianola; Perez-Enciso, Toro, 2003). Against a great number of SNP markers through the genome, they are used in GWS context to watch and improve directly through the prediction of individual genetic merit to desirable traits. According to Gianola et al. (2003), that case requires statistical methods which consider the selection of covariables (multicollarity problem) and a regularization of estimation process (dimension problem). The RR-BLUP method (Ride-Regression-BLUP) estimates simultaneously effects of all markers (Meuwissen et al., 2001), being those considered random effects with common variance; in other words, it assumes that all markers act equality to genetic variance (absence of genes with greater effects) The LASSO Bayesian estimator (BLASSO) sets a variance to each marker and pushes the estimators to zero, as in RR-BLUP case. However,it allows effectively that some estimators are identically equal to zero, doing simultaneously the shrinkage proceed and covariables selection. (de los Campos et al., 2009). According to Gianola et al. (2011) the addictive models based in linear multiple regression do not get right effects that take part in genetic mechanism of a trait. Among those effects may be cited complexed interactions between genes, genes and environment or epigenetics effects. In this context, it was observed an increasing interest on prediction methods that allow explore information of these effects to increase the accuracy on prediction of individual genetic values. Among those methods stand out those based in learning machine generally called Artificial Neural Networks (ANN).

Materiais e Métodos

An outbred population was simulated for the 16th QTLMAS Workshop (2012). True breeding values (TBV) for the two simulated traits were calculated as the sum of the additive effects of the 50 QTLs in each one. Random residuals were drawn from normal distributions with mean zero and trait-specific residual variances to simulate heritability of 0.35 for milk yield (T1) and fat yield (T2). The phenotypes, given as individual yield deviations, were determined only for the 3,000 females from G1 to G3. The remainder 1,020 genotyped individuals, from the fourth generation, were used as the validation set to estimate the genomic estimated breeding value (GEBV). The Neural Network ToolboxTM of MATLAB® (Beale, et al. 2010) was used for the analysis, and the “trainbr” function in this tool was used for Bayesian regularization in all cases, using 1000 epochs of algorithm run for each network. Six combinations of activation function, number of layers and number of neurons in each layer were tested, Net1 - is the neural network with 1 layer with one neuron and identity activation function. Net2 - is the neural network with two layers, first logistic and second identity activation function with two and one neuron respectively. Net3 - is the neural network with two layers and two identity activation function with two and one neuron respectively. Net4 is the neural network with two layers first tangent hyperbolic and second identity activation function with two and one neuron respectively. Net5 - is the neural network with tree layers first logistic, second tangent hyperbolic, and the last identity activation function with two, two and one neuron respectively. Net6 - is the neural network with tree layers first tangent hyperbolic, second logistic, and the last identity activation function with two, two and one neuron respectively. The Bayesian Lasso method was implemented in the package BLR (Perez-Rodriguez, et al. 2013) of the R software using 30.000 posteriors samples of a 100,000 MCMC chain size, sampled each two iterations after a burn-in of 40,000 iterations. The RR-BLUP method was implemented using the package rrBLUP (Endelman 2011) of the R software. The estimates of SNP effects in the genomic enable prediction using the different proposed neural networks were based in the methods proposed by Garson (1991) and Dimopoulos, et al. (1995).

Resultados e Discussão

The effects of the markers were distributed throughout the five chromosomes for the traits. For both traits, the BLASSO method penalized more than the other methods (RR-BLUP and net1), thus the number of SNP with values equal 0 is higher in the BLASSO, and the number of QTL discovered using the other methods was higher than the BLASSO method. In addition, for the trait 1, the correlation between the true top fifty SNP (most relevant) effects and the estimate from Garson (1991), Dimopoulos, et al. (1995), RR-BLUP, and the BLASSO, was 0.61, 0.60, 0.60 and 0.55, respectively. For the trait 2 these correlations were 0.81, 0.81, 0.81 and 0.71, respectively. According to (Gianola, et al. 2011), the results obtained from BRNN applied to genomic prediction using real data in dairy cattle suggest that this method may be useful for predicting complex traits using high-dimensional genomic information, a situation where the number of coefficients that need to be estimated exceeds sample size. Furthermore, BRNN have the ability of capturing nonlinearities, and do so adaptively, which may be useful in the study of quantitative traits under complex gene action. When applied to real data of litter size in pigs (Tusell, et al. 2013), the non-parametric models gave similar predictions to the parametric counterparts, but BRNN giving the best prediction (r=0.31), leading the authors to conclude that this method showed some promising results under certain scenarios.

Conclusões

The simplest Bayesian Regularized Neural Network (BRNN) model gave consistent predictions for both traits, which were similar to the results obtained from the traditional RR-BLUP and BLASSO methods.

Referências

BRITO, FERNANDA VARNIERI. Diversidade genética e acurácia da informação genômica em bovinos de corte. 2011. Tese de Doutorado. UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Beale, M. H., Hagan, M. T. and Demuth, H. B. (2010) Neural Network Toolbox 7. User’s Guide, MathWorks. DE LOS CAMPOS, G.; NAYA, H.; GIANOLA, D.; CROSSA, J.; LEGARRA, A.; MANFREDI, E.; WEIGEL, K.; COTES, J. M. Predicting Quantitative Traits With Regression Models for Dense Molecular Markers and Pedigree. Genetics, v.182, n.1, p.375-385, 2009 de Los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. and Calus, M. P. (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics,193, 327-345. Dimopoulos, Y., Bourret, P. and Lek, S. (1995) Use of Some Sensitivity Criteria for Choosing Networks with Good Generalization Ability. Neural Processing Letters,2, 1-4. Endelman, J. B. (2011) Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome-Us,4, 250-255. Garson, D. G. (1991) Interpreting neural network connection weights. GIANOLA, D.; PEREZ-ENCISO, M.; TORO, M.A. 2003. On Marker-Assisted Prediction of Genetic Value: Beyond the Ridge. Genetics, 163:347-365. Gianola, D., Okut, H., Weigel, K. A. and Rosa, G. J. (2011) Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genet,12, 87. González-Camacho, J., De Los Campos, G., Pérez, P., Gianola, D., Cairns, J., Mahuku, G., Babu, R. and Crossa, J. (2012) Genome-enabled prediction of genetic values using radial basis function neural networks. Theoretical and Applied Genetics,125, 759-771. Meuwissen, T. H. E., Hayes, B. J. and Goddard, M. E. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics,157, 1819-1829. Park, T. and Casella, G. (2008) The Bayesian Lasso. Journal of the American Statistical Association,103, 681-686. Okut, H., Gianola, D., Rosa, G. J. M. and Weigel, K. A. (2011) Prediction of body mass index in mice using dense molecular markers and a regularized neural network. Genetics Research,93, 189-201. Perez-Rodriguez, P., Gianola, D., Weigel, K. A., Rosa, G. J. and Crossa, J. (2013) Technical note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding. Journal of Animal Science,91, 3522-3531. Tusell, L., Pérez-Rodríguez, P., Forni, S., Wu, X.-L. and Gianola, D. (2013) Genome-enabled methods for predicting litter size in pigs: a comparison. animal,7, 1739-1749.