# Programme And Abstracts For Thursday 14^{th} Of December

Keynote: Thursday 14^{th} 9:10 098 Lecture Theatre (260-098)

## ALTREP: Alternate Representations Of Basic R Objects

Luke Tierney

University of Iowa

**Abstract:**The ALTREP provide framework provides infrastructure to support for alternate representations of basic R objects. Some examples include R vectors with data in memory-mapped files, compact representation of arithmetic sequences, deferred computations, and adding meta-data to objects. This talk will outline the framework, present some examples of its use, and describe the current state of incorporating the framework into the R distribution.

Thursday 14^{th} 10:30 098 Lecture Theatre (260-098)

## Penalized Vector Generalized Additive Models

Thomas Yee^{1}, Chanatda Somchit^{2}, and Chris Wild^{1}

^{1}University of Auckland

^{2}University of Phayao

**Abstract:** Over the last two decades generalized additive models (GAMs) have become an indispensible tool for modern data analysis and regression. First-generation GAMs as developed by Hastie and Tibshirani are based on backfitting (e.g., the `gam`

R package). Second-generation GAMs have automatic smoothing parameter selection (e.g., the `mgcv`

package by Simon Wood) and are based on, e.g., P-splines. Until recently, these two implementations were largely confined to the exponential family. However, since the 1990s, the vector generalized linear and additive model (VGLM/VGAM) classes were developed by Yee and coworkers, and these are a much larger class of models. First-generation VGAMs were based on vector splines and vector backfitting. This talk will describe 2nd-generation VGAMs using O-splines and P-splines. We illustrate them by examples, to show that automatic smoothing parameter selection based on optimizing a predictive quantity such as generalized cross validation can be very useful. The speaker’s `VGAM`

package implementation will be described.

**Keywords:** Automatic smoothing parameter selection, O-splines, P-splines, Vector generalized additive models, VGAM R package

**References:**

*Vector Generalized Linear and Additive Models: With an Implementation in R*. New York, USA: Springer.

Thursday 14^{th} 10:30 OGGB4 (260-073)

## A Package For Multiple Precision Floating-Point Computation On R

Ei-Ji Nakama^{1} and Junji Nakano^{2}

^{1}COM-ONE Ltd.

^{2}Institute of Statistical Mathematics

**Abstract:** As recent requirements for numerical computation performed by R become larger and more complicated, errors from floating-point arithmetic become problematic. In R, double precision floating-point arithmetic is usually performed, but it may not be adequate or precise for some situations. To avoid and detect errors of double precision floating-point arithmetic, multiple precision arithmetic is useful. Several multiple precision arithmetic packages exist on R, but their abilities are limited. Therefore we provide another multiple precision arithmetic package Rmpenv, which can handle multiple precision arithmetic for real and complex numbers, matrix product and inversion, etc. We also provide a syntactic sugar to make easy the multiple precision computation on R. We utilize a free and open source MPACK library for multiple precision arithmetic and linear algebra computation.

**Keywords:**Double precision, floating-point arithmetic, MPACK

Thursday 14^{th} 10:30 OGGB5 (260-051)

## Dissimilarities Between Groups Of Data

Nobuo Shimizu^{1}, Junji Nakano^{1}, and Yoshikazu Yamamoto^{2}

^{1}Institute of Statistical Mathematics

^{2}Tokushima Bunri University

**Abstract:** We often have “big data” expressed by both continuous real variables and categorical variables. When their sizes are huge, it is almost impossible to see and check each individual data. Then we divide them into small number of groups which have clear domain meanings. We express each group by using information up to second order moments. For example, means, variances and covariances are used to summarize many continuous real variables, and a Burt matrix which consists of contingency tables by pairs of categorical variables are used to summarize many categorical variables. We call such a set of descriptive statistics “aggregated symbolic data (ASD)”.

We here propose dissimilarities between two ASDs by utilizing pseudo-likelihood ratio test statistic and chi-squared test statistic. Former one is theoretically derived and the latter one is heuristically given. We adopt two dissimilarities for clustering districts in Tokyo by ASD derived from huge real estate data.

**Keywords:**Aggregated symbolic data, Chi-squared test statistic, clustering, pseudo-likelihood ratio test statistic

Thursday 14^{th} 10:30 Case Room 2 (260-057)

## Comparison Of Tests Of Mean Difference In Longitudinal Data Based On Block Resampling Methods

Hirohito Sakurai and Masaaki Taguri

National Center for University Entrance Examinations

**Abstract:** Let us consider a two-sample problem in longitudinal data, and discuss comparison of tests of mean difference using block resampling methods. The testing methods are based on moving block bootstrap (MBB), circular block bootstrap (CBB) and stationary bootstrap (SB). These block resampling techniques are used to approximate the null distributions of the following four types of test statistics: sum of absolute values of difference between two mean sequences (\(T_1\)), sum of squares of difference between two mean sequences (\(T_2\)), area-difference between two mean curves (\(T_3\)), and difference of kernel estimators based on two mean sequences (\(S_n\)). Our testing algorithm generates blocks of observations in each sample similar to MBB, CBB or SB, and draws resamples *with replacement* or *without replacement* from the mixed blocks which are generated by two samples. In the context of block resampling, a resample is usually generated *with replacement* from blocks of observations, however our discussion also includes block resampling *without replacement* similar to permutation analogy for MBB, CBB and SB, with \(T_1\), \(T_2\), \(T_3\) and \(S_n\), respectively. Monte Carlo simulations are carried out to examine the empirical level and power of the testing methods.

**Keywords:** moving block bootstrap, circular block bootstrap, stationary bootstrap, with/without replacement, empirical level/power

**References:**

*Resampling Methods for Dependent Data*. New York: Springer.

Thursday 14^{th} 10:30 Case Room 3 (260-055)

## SSREM: A Summary-Statistics-Based Random Effect Model To Estimating Heritability, Co-Heritability And Effect Sizes In GWAS Data Analysis

Jin Liu^{1} and Can Yang^{2}

^{1}Duke-NUS Medical School

^{2}Hong Kong University of Science and Technology

**Abstract:** Most existing methods for GWAS data analysis require individual-level genotype data as their input. However, it is often not easy to get access to individual-level data, due to many practical issues, such as privacy protection and disagreement on data-sharing among multiple research groups. In this talk, we introduce SSREM, a Summary-Statistics-based approach to estimating heritability, co-heritability and effect sizes in GWAS data analysis. This is achieved by Bayesian analysis with the standard random-effect prior and a summary-statistics-based likelihood function. We have implemented a parallel Gibbs sampling strategy, which allows us to handle genome-wide-scale datasets. Our analysis results suggest that summary-statistics-based analysis can achieve comparable performance to individual-level data analysis.

**Keywords:**Summary statistics; Genome-wide association study; Probabilistic model; Gibbs sampling; Heritability; Co-heritability

Thursday 14^{th} 10:50 098 Lecture Theatre (260-098)

## Consistency Of Linear Mixed-Effects Model Selection With Inconsistent Covariance Parameter Estimators

Chihhao Chang

National University of Kaohsiung

**Abstract:** For linear mixed-effects models with data collected within one cluster, the maximum likelihood estimators of covariance parameters cannot be estimated consistently. Hence the asymptotic behaviors of likelihood-based information criteria, such as Akaike’s information criterion (AIC) are rarely discussed in literature. In the contrast, the number of the clusters is generally assumed going to infinity with the sample size to guarantee the consistency of the covariance parameter estimators and thereby guarantees the consistency of the model selection procedures. In this talk, under some mild conditions, we establish asymptotic theorems for ML estimators of covariance parameters when the number of clusters is fixed. Furthermore, the asymptotic behaviors of the generalized information criterion, which includes AIC as special cases, are well studied in our research.

**References:**

Fan, Y. and Li, R. (2012). Variable selection in linear mixed effects models. In: *The Annals of Statistics*, **40**, 2043 - 2068.

Jiang, J., Rao, J. S. Gu, Z. and Nguyen, T. (2008). Fence methods for mixed model selection. In: *The Annals of Statistics*, **36**, 1669-1692.

Müller, S., Scealy, J. L. and Welsh, A. H. (2013). Model Selection in Linear Mixed Models. In: *Statistical Science*, **28**, 135-167.

*The Annals of Statistics*,

**35**, 2795-2814.

Thursday 14^{th} 10:50 OGGB4 (260-073)

## An Incomplete-Data Fisher Scoring With An Acceleration Method

Keiji Takai

Kansai University

**Abstract:** Incomplete data complicate conventional statistical analyses because the analyses presume complete data are always available. The primary problem is the complication of the parameter estimation. The parameter estimation is based on the observed-data log-likelihood function that consists of the sum of the logarithm of the marginalized likelihood with respect to the missing values, and thus the log-likelihood function becomes complicated to handle. The EM algorithm was proposed to make it easy to handle the log-likelihood function. However, the EM algorithm still has some problems that are often criticized (McLachlan and Krishnan, 2002); namely, slow convergence and unavailability of the standard error.

In my talk, I propose an incomplete-data Fisher scoring (IFS) method with an acceleration method to overcome these problems. The IFS method takes a Newton-Raphson type iteration, but it produces exactly the identical sequence or an approximate sequence to the sequence produced by the EM algorithm. The notable feature of the IFS is that the IFS can accelerate itself by adjusting its steplength and can produce the standard error with the functions used only for the acceleration. The convergence rate is faster than the EM algorithm. In the talk, I provide the convergence theorem and practical examples.

**Keywords:** Incomplete data, EM algorithm, Fisher scoring, acceleration method

**References:**

Barnett, J.A., Payne, R.W. and Yarrow, D. (1990). *Yeasts: Characteristics and identification: Second Edition.* Cambridge: Cambridge University Press.

McLachlan, G., and Krishnan, T. (2002). The EM algorithm and extensions, 2nd Edition. Wiley.

(ed.) Barnett, V., Payne, R. and Steiner, R. (1995). *Agricultural Sustainability: Economic, Environmental and Statistical Considerations*. Chichester: Wiley.

Payne, R.W. (1997). *Algorithm AS314 Inversion of matrices Statistics*, **46**, 295–298.

*COMPSTAT90 Proceedings in Computational Statistics*, 297–302. Heidelberg: Physica-Verlag.

Thursday 14^{th} 10:50 OGGB5 (260-051)

## Interactive Visualization Of Aggregated Symbolic Data

Yoshikazu Yamamoto^{1}, Junji Nakano^{2}, and Nobuo Shimizu^{2}

^{1}Tokushima Bunri University

^{2}Institute of Statistical Mathematics

**Abstract:** When we have new “big data”, the first step may be to visualize them. For visualizing continuous multivariate data, interactive parallel coordinate plot is known to be appropriate. However, the number of data is huge and some variables are categorical, a simple parallel coordinate plot is not available. We propose to divide big data into rather small groups and summarize them as aggregated symbolic data (ASD), and visualize them by triangular arranged parallel coordinate plots.

We have developed a statistical graphics software for this purpose. Our software equips interactive operations such as selection and linked highlighting, and is written by Java, R, and big data processing technologies such as Apache Hadoop and Apache Spark.

Aggregated symbolic data is a set of descriptive statistics calculated by up to second order moments of variables in each group. We also propose further summarization of ASD to describe characteristics of each variable and a pair of variables for visualizing the difference among ASDs. Real example data are visualized by our software and interpreted intuitively.

**Keywords:**Apache Hadoop, Apache Spark, Parallel coordinate plot, Symbolic data analysis

Thursday 14^{th} 10:50 Case Room 2 (260-057)

## Analysis Of Spatial Data With A Gaussian Mixture Markov Random Field Model

Wataru Sakamoto

Okayama University

**Abstract:** In spatial data, detecting regions with higher relative risk is of primary interest. A latent Markov random field model with Gaussian mixture component is introduced, in which the probit or the logit of the mixture weight for each location follows a Gaussian Markov random field such as an intrinsic auto-regressive model (Besag *et al.*, 1991). A mixture model with spatially correlated weights was proposed by Fernández and Green (2002), and our modeling with Gaussian mixture Markov random field can be extended to the cases of involving covariate and random effects. Parameters are estimated by a Bayesian approach, and the posterior mean of the mixture weight for each location, which varies smoothly, gives meaningful interpretation for spatial structure. Our computation was conducted with R Stan package, in which the Hamiltonian Monte Carlo method is implemented. Some applications to disease mapping data are illustrated.

**Keywords:** Bayesian modeling, spatial cluster detection, spatial correlation

**References:**

Fernández, C. and Green, P. J. (2002). Modelling spatially correlated data via mixtures: a Bayesian approach. *J. Roy. Statist. Soc. B*, **64**, 805–826.

Besag, J., York, J. and Mollié, A. (1991). Bayesian image restoration, with two applications in spatial statistics. *Ann. Inst. Statist. Math.*, **43**, 1–59.

*Gaussian Markov Random Fields: Theory and Applications.*Chapman and Hall.

Thursday 14^{th} 10:50 Case Room 3 (260-055)

## Forward Selection In Regression Models Based On Robust Estimation

Shan Luo^{1} and Zehua Chen^{2}

^{1}Shanghai Jiao Tong University

^{2}National University of Singapore

**Abstract:** For the purpose of feature selection in ultra-high dimensional regression models, it is required that a sequence of candidate models and a criterion to select the “best” model from them are available. Under different scenarios, various methods have been proposed to achieve these two goals. Intuitively, it is straightforward to choose appropriate loss and penalty functions in a regularization method to accommodate specific characteristics of the data. However, the computation could be expensive for certain cases. From recent studies, we can see that sequential method is promising to produce good candidate models for ultra-high dimensional data. Moreover, it can be easily extended to complex models other than the linear regression model. In this paper, we propose a new feature selection method based on robust estimation.

**Keywords:**Feature selection, robust estimation, sequential method

Thursday 14^{th} 11:10 098 Lecture Theatre (260-098)

## Selecting Generalised Linear Models Under Inequality Constraints

Daniel Gerhard

University of Canterbury

**Abstract:** Model selection by information criteria can be used to identify a single best model or for inference based on weighted support from a set of competing models, incorporating model selection uncertainty into parameter estimates and estimates of precision. Anraku (1999) proposed a modified version of the well known Akaike information criterion, selecting models in the one-way analysis of variance models when the population means are subject to monotone trends. A generalization of this order-restricted information criterion was proposed by Kuiper et al. (2011), allowing a restriction of population means by a mixture of linear equality and inequality constraints.

An extension to this approach is presented, applying the generalized order-restricted information criterion to model selection from a set of generalized linear models. The class of models can comprise linear equality or inequality constraints of population parameters assuming a distribution of the exponential family for the response. The methodology is illustrated using the open source environment R with the add-on package `goric`

.

**Keywords:** Model selection, Order-restriction, GLM

**References:**

Anraku, K. (1999). An information criterion for parameters under a simple order restriction. *Biometrika*, **86**, 141–152.

*Biometrika*,

**98**, 495–501.

Thursday 14^{th} 11:10 OGGB4 (260-073)

## Improvement Of Computation For Nonlinear Multivariate Methods

Masahiro Kuroda^{1}, Yuichi Mori^{1}, and Masaya Iizuka^{2}

^{1}Okayama University of Science

^{2}Okayama University

**Abstract:** Nonlinear multivariate methods (NL-MM) using optimal scaling as a quantification technique can analyze any data including quantitative and qualitative variables. The alternating least squares (ALS) algorithm is the most popular iterative algorithm in NL-MM. While the algorithm has a stable convergence property, it requires many iterations and a large computational cost, especially for a large data set involving many qualitative variables, because its convergence is linear. It is therefore important to improve the speed of computation when NL-MM with the ALS algorithm is applied. Kuroda and his co-workers tried to accelerate the convergence of the ALS algorithm using the vector \(\varepsilon\) (v\(\varepsilon\)) accelerator. In this talk, the v\(\varepsilon\) acceleration for the ALS algorithm is implemented in NL-MM, e.g., nonlinear principal component analysis and nonlinear factor analysis, and the performances are demonstrated in numerical experiments.

**Keywords:** Alternating least squares algorithm, Optimal scaling, Acceleration of convergence

**References:**

Gifi, A. (1990). *Nonlinear multivariate analysis*. Wiley.

Kuroda, M., Mori, Y., Iizuka, M. and Sakakihara, M. (2011). Acceleration of the alternating least squares algorithm for principal components analysis. *Computational Statistics and Data Analysis*, **55**, 143–153.

*Nonlinear principal component analysis and its Applications*. JSS Research Series in Statistics, Springer.

Thursday 14^{th} 11:10 Case Room 3 (260-055)

## Feature Selection In High-Dimensional Models With Complex Block Structures

Zehua Chen^{1} and Shan Luo^{2}

^{1}National University of Singapore

^{2}Shanghai Jiao Tong University

**Abstract:** We consider feature selection in multivariate regression models where the response variables as well as the covariates are high-dimensional and both have intrinsic group structures. The models arise naturally in many biology studies for detecting associations between multiple traits and multiple features where the traits and features are embedded in biological functioning groups such as genes or pathways. We propose a sequential procedure for selecting the feature groups based on a correlation principle. At each step of the procedure, the response groups are fitted to already selected feature groups and the residuals are obtained for the response groups, then, the feature group which has the highest correlation with the residuals of any response group is selected next. The correlation measure is the trace of the sample canonical correlation matrix between two vectors. The EBIC is used as the stopping rule of the procedure. This procedure possesses the property of selection consistency. Compared with a group penalization approach, our method is more accurate and demands much less computation.

**Keywords:** Canonical correlation, correlation principle, grouped data, simultaneous feature selection, selection consistency

**References:**

Luo, S., and Chen, Z. (2017). *Sequential group feature selection by correlation principle in sparse high-dimensional models with complex block structures*. Manuscript, submitted.

*Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure.*

*Biometrics*

**71(2)**, 354–363.

Thursday 14^{th} 11:30 098 Lecture Theatre (260-098)

## Statistical Generalized Derivative Applied To The Profile Likelihood Estimation In A Mixture Of Semiparametric Models

Yuichi Hirose and Ivy Liu

Victoria University of Wellington

**Abstract:** There is a difficulty in finding an estimate of variance of the profile likelihood estimator in the joint model of longitudinal and survival data. We solve the difficulty by introducing the “statistical generalized derivative”. The derivative is used to show the asymptotic normality of the estimator without assuming the second derivative of the density function in the model exists.

**Keywords:** Efficiency, Efficient information bound, Efficient score, Implicitly defined function, Profile likelihood, Semi-parametric model, Joint model, EM algorithm, Mixture model

**References:**

Hsieh, F., Tseng, Y.K. and Wang, J.L. (2006). *Joint modeling of survival and longitudinal data: likelihood approach revisited.* Biometrics **62**, 1037–1043.

Hirose, Y. (2016). *On differentiability of implicitly defined function in semi-parametric profile likelihood estimation.* Bernoulli **22** 589–614.

*Joint modeling of survival and longitudinal ordered data using a semiparametric approach.*Australian & New Zealand Journal of Statistics

**58**, 153–172.