# Programme And Abstracts For Tuesday 12th Of December

Keynote: Tuesday 12th 9:10 098 Lecture Theatre (260-098)

## Could Do Better … A Report Card For Statistical Computing

Ross Ihaka and Brendon McArdle
University of Auckland

Abstract: Since the introduction of R, research in Statistical Computing has plateaued. Although R is, at best, a stop-gap system, there appears to be very little active research on creating better computing environments for Statistics.

When work on R commenced there were a multitude of software systems for statistical data analysis in use and under development. There was friendly competition and collaboration between developers. While R can be seen as providing a useful unification for users, its success and dominance can be viewed as now holding back research and the development of new systems.

In this talk we’ll examine what might be behind this and also look at some research aimed at exploring some of the design space for new systems. The aim is to show constructively that new work in the area is still possible.

Tuesday 12th 10:30 098 Lecture Theatre (260-098)

## R&D Policy Regimes In France: New Evidence From A Spatio-Temporal Analysis

Benjamin Montmartin1, Marcos Herrera2, and Nadine Massard3
1GREDEG CNRS
2CONICET
3GAEL

Abstract: Using a unique database containing information on the amount of R&D tax credits and regional, national and European subsidies received by firms in French NUTS3 regions over the period 2001-2011, we provide new evidence on the efficiency of R&D policies taking into account spatial dependency across regions. By estimating a spatial Durbin model with regimes and fixed effects, we show that in a context of yardstick competition between regions, national subsidies are the only instrument that displays total leverage effect. For other instruments internal and external effects balance each other resulting in insignificant total effects. Structural breaks corresponding to tax credit reforms are also revealed.

Keywords: Additionality, French policy mix, Spatial panel, Structural break

References:

Pesaran, M. H. (2007). A simple panel unit root test in the presence of cross-section dependence In: Journal of Applied Econometrics, 22, 265–312.

Hendry, D. F. (1979). Predictive failure and econometric modelling in macroeconomics: The transactions demand for money. In: P. Ormerod (Ed.), Economic Modelling: Current Issues and Problems in Macroeconomic Modelling in the UK and the US, 9, 217–242. Heinemann Education Books, London.

Tuesday 12th 10:30 OGGB4 (260-073)

## Analysing Scientific Collaborations Of New Zealand Institutions Using Scopus Bibliometric Data

Samin Aref1, David Friggens2, and Shaun Hendy1
1University of Auckland
2Ministry of Business Innovation & Employment

Abstract: Scientific collaborations are among the main enablers of development in small national science systems. Although analysing scientific collaborations is a well-established subject in scientometrics, evaluations of collaborative activities of countries remain speculative with studies based on a limited number of fields or using data too inadequate to fully represent collaborations at a national level. This study provides a unique view on the collaborative aspect of scientific activities in New Zealand. We perform a quantitative study based on all Scopus publications in all subjects for over 1500 New Zealand institutions over a period of 6 years to generate an extensive mapping of New Zealand scientific collaborations. The comparative results reveal the levels of collaboration between New Zealand institutions and business enterprises, government institutions, higher education providers, and private not for profit organisations in 2010-2015. Constructing a collaboration network of institutions, we observe a power-law distribution indicating that a small number of New Zealand institutions account for a large proportion of national collaborations. Network centrality measures are deployed to identify the most influential institutions of the country in terms of scientific collaboration. We also provide comparative results on 15 universities and crown research institutes based on 27 subject classifications. This study was based on Scopus custom data and supported by the Te Pūnaha Matatini internship program at Ministry of Business, Innovation & Employment.

Keywords: Big data modelling, Scientific collaboration, Scientometrics, Network analysis, Scopus, New Zealand

Tuesday 12th 10:30 OGGB5 (260-051)

## Family Structure And Academic Achievements Of High School Students In Tonga

Losana Vao Latu Latu
University of Canterbury

Abstract: In this study we examine how family structure affects the academic achievement of students at the secondary level of education age in Tonga. It is a comparative study aiming to find out whether there is a significant difference between the academic achievements of students from a traditional family and those from a non-traditional family. We define a Tongan traditional family as being two biological parents (or adoptive parents from birth), one male and one female where as non-traditional family can be a single parent family, or the student has no parent present (for example they are staying with relatives or friends). In our study we are looking at what are the key drivers of success and trying to understand the relationship between academic achievements and family structure. We hope the study will provide evidence-based information to aid the administrators, other educators and parents to adopt the best practices and actions for the students. The target population for this study is the high school students age 13 to 18 in Tonga. The study is limited to the high schools in the main island of Tonga- Tongatapu which has 12 high schools where two high schools are government schools and the others are private schools run by different religions. In April we surveyed 360 students, 60 from each of 6 high schools, and present here our preliminary results.

Keywords: Education, policy, stratified sampling

Tuesday 12th 10:30 Case Room 2 (260-057)

## Analysis Of Multivariate Binary Longitudinal Data: Metabolic Syndrome During Menopausal Transition

Geoff Jones
Massey University

Abstract: Metabolic syndrome (MetS) is a major multifactorial condition that predisposes adults to type 2 diabetes and cardiovascular disease. It is defined as having at least three of five cardiometabolic risk components: 1) high fasting triglyceride level, 2) low high-density lipoprotein (HDL) cholesterol, 3) elevated fasting plasma glucose, 4) large waist circumference (abdominal obesity) and 5) hypertension. In the US Study of Women’s Health Across the Nation (SWAN), a 15-year multi-centre prospective cohort study of women from five racial/ethnic groups, the incidence of MetS increased as midlife women underwent the menopausal transition (MT). A model is sought to examine the interdependent progression of the five MetS components and the influence of demographic covariates.

Keywords: Multivariate binary data, longitudinal analysis, metabolic syndrome

Tuesday 12th 10:30 Case Room 3 (260-055)

## Clustering Of Curves On A Spatial Domain Using A Bayesian Partitioning Model

Chae Young Lim
Seoul National University

Abstract: We propose a Bayesian hierarchical model for spatial clustering of the high-dimensional functional data based on the effects of functional covariates. We couple the functional mixed-effects model with a generalized spatial partitioning method for: (1) identifying subregions for the high-dimensional spatio-functional data; (2) improving the computational feasibility via parallel computing over subregions or multi-level partitions; and (3) addressing the near-boundary ambiguity in model-based spatial clustering techniques. The proposed model extends the existing spatial clustering techniques to produce spatially contiguous partitions for spatio-functional data. The model successfully captured the regional effects of the atmospheric and cloud properties on the spectral radiance measurements. This elaborates the importance of considering spatially contiguous partitions for identifying regional effects and small-scale variability.

Keywords: spatial clustering, Bayesian wavelets, Voronoi tessellation, functional covariates

Tuesday 12th 10:30 Case Room 4 (260-009)

## The Uncomfortable Entrepreneurs: Bad Working Conditions And Entrepreneurial Commitment

Catherine Laffineur
Université Côte d’Azur, GREDEG-CNRS

Abstract: In contrast to previous model dividing necessity entrepreneurs as individuals facing push factors due to lack of employment, we consider the possibility of push factors faced by employed individuals (Folta et al. (2010)). The theoretical model yields distinctive predictions relating occupation characteristics and the probability of entry into entrepreneurship. Using PSED and ONET data, we investigate how the characteristics of individuals? primary occupations affect nascent entrepreneurs? effort put into venture creation. The empirical evidences show that necessity entrepreneurs are not only confined to unemployed individuals. We find compelling evidence that individuals facing arduous working conditions (e.g. stressful environment and physical tiredness) have a higher likelihood of entering and succeeding in self-employment than others. Contrariwise, individuals who experience high degree of self-realization, independence and responsibility in the workplace are less committed to their business than individuals exposed to arduous working conditions. These findings have strong implication for how we interpret and analyze necessity entrepreneurs and provide novel insights into the role of occupational experience in the process of venture emergence.

Keywords: Entrepreneurship, Motivation, Occupational characteristics, Employment choice.

References:

Folta, T. B., Delmar, F., & Wennberg, K. 2010. Hybrid entrepreneurship. Management Science, 56(2), 253-269.

Tuesday 12th 10:50 098 Lecture Theatre (260-098)

## Spatial Surveillance With Scan Statistics By Controlling The False Discovery Rate

Xun Xiao
Massey University

Abstract: In this paper, I investigate a false discovery approach based on spatial scan statistics to detect the spatial disease clusters in a geographical region proposed by Li et al. (2016). The incidence of disease is assumed to follow an inhomogeneous Poisson model discussed in Kulldorff (1997). I show that, though spatial scan statistics are highly correlated, the simple Banjamini-Hochberg (linear step-up) procedure can control the false discovery rate of them by proving that the multivariate Poisson distribution satisfies the PRDS condition (positive regression dependence on a subset) in Benjamini and Yekutieli (2001).

Keywords: False Discovery Rate, Poisson Distribution, PRDS, Spatial Scan Statistics

References:

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, 29(4), 1165–1188.

Kulldorff, M. (1997). A spatial scan statistic, Communications in Statistics-Theory and Methods 26(6), 1481–1496.

Li, Y., Shu, L., and Tsung, F. (2016). A false discovery approach for scanning spatial disease clusters with arbitrary shapes, IIE transactions, 48(7), 684–698.

Tuesday 12th 10:50 OGGB4 (260-073)

## Statistical Models For The Source Attribution Of Zoonotic Diseases: A Study Of Campylobacteriosis

Sih-Jing Liao, Martin Hazelton, Jonathan Marshall, and Nigel French
Massey University

Abstract: Preventing and controlling zoonoses with a public health policy depends on the knowledge scientists have about the transmitted pathogens. Modelling jointly the epidemiological data and genetic information provides a methodology for tracing back the source of infection. However, this creates difficulties in assessing genetic efforts behind models of the final statistical inferences due to increased model complexity. To explore the genetic effects in the joint model, we develop a genetic free model and compare it to the joint model. We apply the two models to a recent campylobacteriosis study to estimate the attribution probability for each source. A spatial covariate is also considered in the models in order to investigate the effect of the level of rurality on the source attributions. Comparing the attributions generated by the two models, we find that: i) the genetic information integrated in the joint model gives a little more precise inference to the sparse cases observed in highly rural areas than the genetic free model; ii) on the logit scale, source attribution probabilities follow linear trends against level of rurality; and iii) poultry is the dominant source of campylobacteriosis in urban centres, whereas ruminants are the most attributable source when in rural areas.

Keywords: source attribution, Campylobacter, multinomial model, Dirichlet prior, HPD interval, DIC

References:

Bronowski, C., James, C.E. and Winstanley, C. (2014). Role of environmental survival in transmission of Campylobacter jejuni. FEMS Microbiol Lett., 356(1) 8–19.

Dingle, K.E., Colles, F.M., Wareing, D.R., Ure, R., Fox, A.J., Bolton, F.E., Bootsma, H.J., Willems, R.J. and Maiden, M.C. (2001). Multilocus sequence typing system for Campylobacter jejuni. J Clin Microbiol, 39(1):14–23.

Marshall, J.C. and French, N.P. (2015). Source attribution January to December 2014 of human Campylobacter jejuni cases from the Manawatu. Technical Report.

Wilson, D.J., Gabriel, E., Leatherbarrow, A.J., Cheesbrough, J., Gee, S., Bolton, E., Fox, A., Fearnhead, P., Hart, C.A. and Diggle, P.J. (2008). Tracing the source of campylobacteriosis. PLoS Genet, 4(9):e1000203.

Wagenaar, J.A., French, N.P. and Havelaar, A.H. (2013). Preventing Campylobacter at the source: why is it so difficult? Clin Infect Dis, 57(11):1600–1606.

Biggs, P.J., Fearnhead, P., Hotter, G., Mohan, V., Collins-Emerson, J., Kwan, E., Besser, T.E., Cookson, A., Carter, P.E. and French, N.P. (2011). Whole-genome comparison of two Campylobacter jejuni isolates of the same sequence type reveals multiple loci of different ancestral lineage. PLoS One, 6(11):e27121.

Tuesday 12th 10:50 OGGB5 (260-051)

## Towards An Informal Test For Goodness-Of-Fit

Anna Fergusson and Maxine Pfannkuch
University of Auckland

Abstract: Informal approaches to goodness-of-fit tests often involve examining the visual fit of the model to data ’by eye’. Such approaches are problematic for Year 13 and undergraduate students and teachers from a pedagogical perspective as key aspects such as sample size, the number of categories and expected variation of sample proportions are difficult to consider. In formal tests for goodness-of-fit a test statistic is used in reference to its sampling distribution to decide if the model distribution can be rejected. In general, a numeric test statistic does not have an obvious graphical representation within the data itself. This talk presents a new informal goodness-of-fit test that uses a simulation-based modelling tool. Drawing on ideas from graphical inference, the proposed test does not use numerical test statistics but plots as test statistics. Comparisons of performance demonstrate that the proposed test leads to similar decisions about the fit of the model distribution as the chi square goodness-of-fit test. A research study with Year 13 teachers indicated that there could be pedagogical benefits of using this informal goodness-of-fit test in terms of introducing important modelling and hypothesis test concepts.

Tuesday 12th 10:50 Case Room 2 (260-057)

## Identifying Clusters Of Patients With Diabetes Using A Markov Birth-Death Process

Mugdha Manda, Thomas Lumley, and Susan Wells
University of Auckland

Abstract: Estimating disease trajectories has increasingly become more essential to clinical practitioners to administer effective treatment to their patients. A part of describing disease trajectories involves taking patients’ medical histories and sociodemographic factors into account and grouping them into similar groups, or clusters. Advances in computerised patient databases have paved a way for identifying such trajectories in patients by recording a patient’s medical history over a long period of time (longitudinal data): we studied data from the PREDICT-CVD dataset, a national primary-care cohort from which people with diabetes from 2002-2015 were identified through routine clinical practice. We fitted a Bayesian hierarchical linear model with latent clusters to the repeated measurements of HbA$$_1c$$ and eGFR, using the Markov birth-death process proposed by Stephens (2000) to handle the changes in dimensionality as clusters were added or removed.

Keywords: Diabetes management, longitudinal data, Markov chain Monte Carlo, birth-death process, mixture model, Bayesian analysis, latent clusters, hierarchical models, primary care, clinical practice

References:

Stephens, M. (2000). Bayesian Analysis of Mixture Models with an Unknown Number of Components - An Alternative to Reversible Jump Methods. In: The Annals of Statistics, 28(1), 40-74.

Tuesday 12th 10:50 Case Room 3 (260-055)

## Bayesian Temporal Density Estimation Using Autoregressive Species Sampling Models

Youngin Jo1, Seongil Jo2, and Jaeyong Lee3
1Kakao Corporation
2Chonbuk National University
3Seoul National University

Abstract: We propose a Bayesian nonparametric (BNP) model, which is built on a class of species sampling models, for estimating density functions of temporal data. In particular, we introduce species sampling mixture models with temporal dependence. To accommodate temporal dependence, we define dependent species sampling models by modeling random support points and weights through an autoregressive model, and then we construct the mixture models based on the collection of these dependent species sampling models. We propose an algorithm to generate posterior samples and present simulation studies to compare the performance of the proposed models with competitors that are based on Dirichlet process mixture models. We apply our method to the estimation of densities for the price of apartment in Seoul, the closing price in Korea Composite Stock Price Index (KOSPI), and climate variables (daily maximum temperature and precipitation) of around the Korean peninsula.

Keywords: Autoregressive species sampling models; Dependent random probability measures; Mixture models; Temporal structured data

Acknowledgements: This work is a part of the first author’s Ph.D. thesis at Seoul National University. Research of Seongil Jo was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A3B03035235). Research of Jaeyong Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0030811).

Tuesday 12th 10:50 Case Room 4 (260-009)

## How Does The Textile Set Describe Geometric Structures Of Data?

Ushio Tanaka1 and Tomonari Sei2
1Osaka Prefecture University
2Unversity of Tokyo

Abstract: The textile set is defined from the textile plot proposed by Kumasaka and Shibata (2007, 2008), which is a powerful tool for visualizing high dimensional data. The textile plot is based on a parallel coordinate plot, where the ordering, locations and scales of each axis are simultaneously chosen so that all connecting lines, each of which signifies an observation, are aligned as horizontally as possible. The textile plot transforms a data matrix in order to delineate a parallel coordinate plot. Using the geometric properties of the textile set derived by Sei and Tanaka (2015), we show that the textile set describes an intrinsically geometric structures of data.

Keywords: Parallel coordinate plot, Textile set, Differentiable manifold

References:

Kumasaka, N. and Shibata, R. (2007). The Textile Plot Environment, Proceedings of the Institute of Statistical Mathematics, 55, 47–68.

Kumasaka, N. and Shibata, R. (2008). High-dimensional data visualisation: The textile plot, Computational Statistics and Data Analysis, 52, 3616–3644.

Sei, T. and Tanaka, U. (2015). Geometric Properties of Textile Plot: Geometric Science of Information, Lecture Notes in Computer Science, 9389, 732–739.

Tuesday 12th 11:10 098 Lecture Theatre (260-098)

## Intensity Estimation Of Spatial Point Processes Based On Area-Aggregated Data

Hsin-Cheng Huang and Chi-Wei Lai

Abstract: We consider estimation of intensity function for spatial point processes based on area-aggregated data. A standard approach for estimating the intensity function for a spatial point pattern is to use a kernel estimator. However, when data are only available in a spatially aggregated form with the numbers of events available in geographical subregions, traditional methods developed for individual-level event data become infeasible. In this research, a kernel-based method will be proposed to produce a smooth intensity function based on aggregated count data. Some numerical examples will be provided to demonstrate the effectiveness of the proposed method.

Keywords: Area censoring, inhomogeneous spatial point processes, kernel density estimation

Tuesday 12th 11:10 OGGB4 (260-073)

## Bayesian Inference For Population Attributable Measures

Sarah Pirikahu, Geoff Jones, Martin Hazelton, and Cord Heuer
Massey University

Abstract: Epidemiologists often wish to determine the population impact of an intervention to remove or reduce a risk factor. Population attributable type measures, such as the population attributable risk (PAR) and population attributable fraction (PAF), provide a means of assessing this impact, in a way that is accessible for a non-statistical audience. To apply these concepts to epidemiological data, the calculation of estimates and confidence intervals for these measures should take into account the study design (cross-sectional, case-control, survey) and any sources of uncertainty (such as measurement error in exposure to the risk factor). We provide methods to produce estimates and Bayesian credible intervals for the PAR and PAF from common epidemiological study types and assess the Frequentist properties. The model is then extended by incorporating uncertainty due to the use of imperfect diagnostic tests for disease or exposure. The resulting model can be non-identifiable, causing convergence problems for common MCMC samplers, such as Gibbs and Metropolis-Hastings. An alternative importance sampling method performs much better for these non-identifiable models and can be used to explore the limiting posterior distribution. The data used to estimate these population attributable measures may include multiple risk factors in addition to the one being considered for removal. Uncertainty regarding the distribution of these risk factors in the population affects the inference for PAR and PAF. To allow for this we propose a methodology involving the Bayesian bootstrap. We also extend the analysis to allow for complex survey designs with unequal weights, stratification and clustering.

Tuesday 12th 11:10 OGGB5 (260-051)

## An Information Criterion For Prediction With Auxiliary Variables Under Covariate Shift

Takahiro Ido1, Shinpei Imori1,2, and Hidetoshi Shimodaira2,3
1Osaka University
2RIKEN Center for Advanced Intelligence Project (AIP)
3Kyoto University

Abstract: It is beneficial for modeling data of interest to exploit secondary information. The secondary information is called auxiliary variables, which may not be observed in testing data because they are not of primary interest. In this paper, we incorporate the auxiliary variables into a framework of supervised learning. Furthermore, we consider a covariate shift situation that allows a density function of covariates to change between testing and training data. It is known that the Maximum Log-likelihood Estimate (MLE) is not a good estimator under model misspecification and the covariate shift. This problem can be resolved by the Maximum Weighted Log-likelihood Estimate (MWLE).

When we have multiple candidate models, it needs to select the best candidate model where its optimality is measured by the expected Kullback-Leibler (KL) divergence. The Akaike information criterion (AIC) is a well known criterion based on the KL divergence and using the MLE. Therefore, its validity is not guaranteed when the MWLE is used under the covariate shift. An information criterion under the covariate shift was proposed in Shimodaira (2000, JSPI) but this criterion does not take use of the auxiliary variables into account. Hence, we resolve this problem by deriving a new criterion. In addition, simulations are conducted to examine the improvement.

Keywords: Auxiliary variables; Covariate shift; Information criterion; Kullback-Leibler divergence; Misspecification; Predictions.

Tuesday 12th 11:10 Case Room 2 (260-057)

## Analysis Of A Brief Telephone Intervention For Problem Gambling And Examining The Impact On Co-Existing Depression?

Nick Garrett, Maria Bellringer, and Max Abbott
Auckland University of Technology

Abstract: This study investigated the outcomes of a brief telephone intervention for problem gambling. A total of 150 callers were recruited and followed for 36 months. After giving consent, participants received a baseline assessment followed by a manualised version of the helpline’s standard care. Eight-six percent of participants were re-assessed at three months, 79Depression is found to often be associated with problem gambling behaviour, and analysis was undertaken to examine the impact of a brief telephone intervention for problem gambling on rates of depression using logistic regression. At baseline depression was found to be associated with gender, problem gambling risk (PGSI), and deprivation (NZiDep). A multiple variable model found that PGSI and mental health medication best explained depression at baseline. A repeated measures logistic regression utilising all 36 months of data found that PGSI, NZiDep, and mental health medication were the best variables to explain the change over time. Conclusion was that the intervention’s impact on problem gambling behaviour also changed depression rates, however deprivation and mental health medication also contributed.

Tuesday 12th 11:10 Case Room 3 (260-055)

## Prior-Based Bayesian Information Criterion

M. J. Bayarri1, James Berger2, Woncheol Jang3, Surajit Ray4, Luis Pericchi5, and Ingmar Visser6
1University of Valencia
2Duke University
3Seoul National University
4University of Glasgow
5University of Puerto Rico
6University of Amsterdam

Abstract: We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size $$n$$). We also consider a modification of PBIC which is more favorable to complex models.

Keywords: Bayes factors, model selection, Cauchy priors, consistency, effective sample size, Fisher information, Laplace expansions, robust priors

Tuesday 12th 11:10 Case Room 4 (260-009)

## Early Childhood Dental Decay

Sarah Sonal
University of Canterbury

Abstract: Our teeth are some of our most useful tools. They let us eat tasty food, take those plastic tags off new clothes and enhance our smiles to convey joy. They also have to last us a lifetime and need to be looked after. Teeth are a mutually supportive structure, even one extraction can destabilize the remaining teeth. Early intervention in oral health can prevent a lifetime of discomfort, embarrassment and expensive treatments. An issue that is facing Dentists in New Zealand and abroad are preschool children missing treatment appointments. These children have more dental issues in later childhood.

The research question I aim to answer is: Does early dental neglect increase dental issues in later childhood? My thesis will use traditional statistics along with datamining and machine learning techniques to investigate these anecdotal claims.

Using the geographical information of the dataset I will be utilizing the Deprivation data from Statistics New Zealand to research if these children are from more deprived neighborhoods.

Tuesday 12th 11:30 098 Lecture Theatre (260-098)

## Geographically Weighted Principal Component Analysis For Spatio-Temporal Statistical Dataset

Narumasa Tsutsumida1, Paul Harris2, and Alexis Comber3
1Kyoto University
2Rothamsted Research
3Univerisity of Leeds

Abstract: Spatio-temporal statistical datasets are becoming widely available for social, ecomonic, and environmental researches, however it is often difficult to summarize it and undermine hidden spatial/temporal patterns due to its complexity. Geographically weighted principal component analysis (GWPCA), which uses a moving window or kernel and applies localized PCAs over geographical scape, may be worth to do it, while to optimize kernel bandwidth size and to determine the number of component to retain (NCR) were the most concern (Tsutsumida et al (2017)). In this research we determine both of them together simultaneously so as to minimize leave-one-out residual coefficient of variation of GWPCA with changing bandwidth size and NCR. As a case study we use annual goat population statistics across 341 administrative units in Mongolia in 1990-2012, and show spatiotemporal variations in data, especially influenced by natural disasters.

Keywords: Geographically weighted model, Spatio-temporal data, Parameter optimization

References:

Tsutsumida N., P. Harris, , A. Comber. 2017. The Application of a Geographically Weighted Principal Component Analysis for Exploring Twenty-three Years of Goat Population Change across Mongolia. Annals of the American Association of Geographers, 107(5), 1060–1074.

Tuesday 12th 11:30 OGGB4 (260-073)

## Dimensionality Reduction Of Multivariate Data For Bayesian Analysis

Anjali Gupta1, James Curran1, Sally Coulson2, and Christopher Triggs1
1University of Auckland
2ESR

Abstract: In 2004, Aitken and Lucy published an article detailing a two-level likelihood ratio for multivariate trace evidence. This model has been adopted in a number of forensic disciplines such as the interpretation of glass, drugs (MDMA), and ink. Modern instrumentation is capable of measuring many elements in very low quantities and, not surprisingly, forensic scientists wish to exploit the potential of this extra information to increase the weight of this evidence. The issue, from a statistical point of view, is that the increase in the number of variables (dimension) in the problem leads to increased data demand to understand both the variability within a source, and in between sources. Such information will come in time, but usually we don’t have enough. One solution to this problem is to attempt to reduce the dimensionality through methods such as principal component analysis. This practice is quite common in high dimensional machine learning problems. In this talk, I will describe a study where we attempt to quantify the effects of this this approach on the resulting likelihood ratios using data obtained from SEM-EDX instrument.

Tuesday 12th 11:30 OGGB5 (260-051)

## An EWMA Chart For Monitoring Covariance Matrix Based On Dissimilarity Index

Longcheen Huwang
National Tsing Hua University

Abstract: In this talk, we propose an EWMA chart for monitoring covariance matrix based on the dissimilarity index of two matrices. It is different from the conventional EWMA charts for monitoring covariance matrix which are either based on comparing the sum or product or both of the eigenvalues of the estimated EWMA covariance matrix with those of the IC covariance matrix. The proposed chart essentially monitors covariance matrix by comparing the individual eigenvalues of the estimated EWMA covariance matrix with those of the estimated covariance matrix from the IC phase I data. We evaluate the performance of the proposed chart by comparing it with the best existing chart under the multivariate normal process. Furthermore, to prevent the control limit of the proposed EMMA chart using the limited IC phase I data from having extensively excessive false alarms, we use a bootstrap method to adjust the control limit to guarantee that the proposed chart has the actual IC average run length not less than the nominal one with a certain probability. Finally, we use an example to demonstrate the applicability and implementation of the proposed chart.

Keywords: Average run length, dissimilarity index, EWMA; out-of-control

References:

Hawkins, D.M. and Maboudou-Tchao E.M. (2008). Multivariate exponentially weighted moving covariance matrix. Technometrics, 50, 155-166.

Kano, M., Hasebe, S. and Hashimoto, I. (2002). Statistical process monitoring based on dissimilarity of process data. AIChE Journal, 48, 1231-1240.

Tuesday 12th 11:30 Case Room 2 (260-057)

Patrick Graham
Stats NZ and Bayesian Research

Keywords: Record linkage, Missing data, Bayesian inference, Gibbs sampler, Multiple imputation

Tuesday 12th 11:30 Case Room 3 (260-055)

## Bayesian Semiparametric Hierarchical Models For Longitudinal Data Analysis With Application To Dose-Response Studies

Taeryon Choi
Korea University

Abstract: In this work, we propose semiparametric Bayesian hierarchical additive mixed effects models for analyzing either longitudinal data or clustered data with applications to dose-response studies. In the semiparametric mixed effects model structure, we estimate nonparametric smoothing functions of continuous covariates by using a spectral representation of Gaussian processes and the subject-specific random effects by using Dirichlet process mixtures. In this framework, we develop semiparametric mixed effects models that include normal regression and quantile regressions with or without shape restrictions. In addition, we deal with the Bayesian nonparametric measurement error models, or errors-in-variable regression models, using Fourier series and Dirchlet process mixtures, in which the true covariate is not observable, but the surrogate of the true covariate, is only observed. The proposed methodology is compared with other existing approaches to additive mixed models in simulation studies and benchmark data examples. More importantly, we consider a real data application for dose-response analysis, in which measurement errors and shape constraints in the regression functions need to be incorporated with inter-study variability.

Keywords: Cadmium toxicity, Cosine series, Dose-response study, Hierarchical Model, Measurement errors, Shape restriction

Tuesday 12th 11:30 Case Room 4 (260-009)

## Optimizing Junior Rugby Weight Limits

Emma Campbell, Ankit Patel, and Paul Bracewell
DOT Loves Data

Abstract: The New Zealand rugby community is aware of safety issues within the junior game and has applied weight limits for each tackle grade to minimize injury risk. However, for heavier children this can create an uncomfortable situation as they may no longer be playing with their peer group. The study evaluated almost 13,000 observations from junior rugby players across three seasons (2015-2017) using data supplied by Wellington Rugby. To protect privacy, the data was structured so that an individual could not be readily identified but could be tracked across seasons to determine churn. As data for several consecutive seasons was available, we could determine the likelihood of a junior player returning the following season and isolate the drivers of this behaviour. Applying a logistic regression and repeated measures analysis the study determined if children who are over the specified weight limit for their age group are more likely to leave the game. Furthermore, assuming the importance of playing with peers, the study identified the impact of age in relation to the date-of-birth cut-off of January 1st. This is of interest given that a child playing above their age-weight grade could be competing against individuals three school years above them. The study primarily focuses on determining the optimal age-weight bands while the secondary focus is on determining the likelihood of a junior Wellington rugby player returning the following season and isolating the drivers of this behaviour.

Keywords: Logistic regression, repeated measures, player retention, optimization

Tuesday 12th 11:50 098 Lecture Theatre (260-098)

## Spatial Scan Statistics For Matched Case-Control Data

Inkyung Jung
Yonsei University College of Medicine

Abstract: Spatial scan statistics are widely used for cluster detection analysis in geographical disease surveillance. While the method has been developed for various types of data such as binary, count and continuous data, spatial scan statistics for matched case-control data, which often arise in spatial epidemiology, have not been considered yet. In this paper, we propose two spatial scan statistics for matched case-control data. The proposed test statistics properly consider the correlations between matched pairs. We evaluate statistical power and cluster detection accuracy of the proposed methods through simulations comparing with the Bernoulli-based method. We illustrate the methods with the use of a real data example.

Keywords: Spatial epidemiology, cluster detection, SaTScan, McNemar test, conditional logistic regression

Tuesday 12th 11:50 OGGB4 (260-073)

## Whitebait In All Its Varieties: One Fish, Two Fish, Three, Four, Five Fish.

Bridget Armstrong
University of Canterbury

Abstract: There are five species of fishes of the genus Galaxias that make up whitebait catches in New Zealand, although one species (G. maculatus) makes up >90% of the catch. Whitebait are immature post-larval fish that have yet to develop the distinctive morphological traits of adults. However, in their tiny stages as whitebait the five species are difficult to tell apart. There are also distinct spatial (rivers) and temporal (different months in the whitebait fishing season) differences among the species and even within species. To manage the fishery better it is necessary to identify regional differences in the species composition of catches, which is difficult because of the time and effort required to sample catches and identify species morphologically or genetically. In my study, I will use a recently compiled database comprising 17,000 entries of whitebait samples, species composition, and variability to develop a statistical model to predict the likelihood of species-to-species composition of catches throughout New Zealand. This probabilistic model could potentially be a powerful tool in the fishery and conservation of whitebait species, some of which are considered to be threatened.

Tuesday 12th 11:50 OGGB5 (260-051)

## Latent Variable Models And Multivariate Binomial Data

John Holmes
University of Otago

Abstract: A large body of work has been devoted to latent variable models applicable to multivariate binary data. However little work has been put into extending these models to cases where the observed data is multivariate binomial. In this paper, we will first show that models that use either a logit or probit link function, offer the same level of modelling flexibility in the binary case, but only the logit link fits into a data augmentation approach that compactly extends from binary to binomial. Secondly, we will demonstrate that multivariate binomial data provides greater flexibility in how the link function can be represented. Lastly, we will consider properties of the implied distribution of latent probabilities under a logit link.

Keywords: Multivariate binomial data, principal components/factor analysis, item response theory, link functions, logit-normal distributions

References:

(ed.) Bartholomew, D. J. and Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis: A Unified Approach. Chichester: John Wiley & Sons.

Johnson, N.L. (1949). Systems of Frequency Curves Generated by Methods of Translation. Biometrika, 36, 149–276.

Polson, N. G. and Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.

Tuesday 12th 11:50 Case Room 2 (260-057)

## Asking About Sex In General Health Surveys: Comparing The Methods And Findings Of The 2010 Health Survey For England With Those Of The Third National Survey Of Sexual Attitudes And Lifestyles

Philip Prah1, Anne Johnson2, Soazig Clifton2, Jennifer Mindell2, Andrew Copas2, Chloe Robinson3, Rachel Craig3, Sarah Woodhall2, Wendy Macdowall4, Elizabeth Fuller3, Bob Erens2, Pam Sonnenberg2, Kaye Wellings4, Catherine Mercer2, and Anthony Nardone5
1Auckland University of Technology
2University College London
3NatCen
4London School of Hygiene & Tropical Medicine
5Public Health England

Abstract: Including questions about sexual health in the annual Health Survey for England (HSE) provides opportunities for regular measurement of key public health indicators, augmenting Britain’s decennial National Survey of Sexual Attitudes and Lifestyles (Natsal). However, contextual and methodological differences may limit comparability of the findings. For instance both surveys used self-completion for administering sexual behaviour questions but this was via computer-assisted self-interview (CASI) in Natsal-3 and a pen-and-paper questionnaire in HSE 2010. We examine the extent of these differences between HSE 2010 and Natsal-3 (undertaken 2010-2012) and investigate their impact on parameter estimates. For inclusion to this study, we restricted participants to men and women in the 2010 HSE (n = 2,782 men and 3,588 women) and Natsal-3 (n = 4,882 men and 6,869 women) aged 16-69 years and resident in England. We compared their demographic characteristics, the amount of non-response to, and estimates from, sexual health questions. We used complex survey analysis to take into account stratification, clustering, and weighting of the data in each survey. Logistic regression was used to measure the extent to which sexual health estimates differ in HSE 2010 relative to Natsal-3, with multivariable models to adjust for significant demographic confounders. Additionally, investigated age-group interactions to see if differences between the surveys varied by age. The surveys achieved similar response rates, both around 60While a relatively high response to sexual health questions in HSE 2010 demonstrates the feasibility of asking such questions in a general health survey, differences with Natsal-3 do exist. These are likely due to the HSE’s context as a general health survey and methodological limitations such as its current use of pen-and-paper questionnaires.

Tuesday 12th 11:50 Case Room 3 (260-055)

## Bayesian Continuous Space-Time Model Of Burglaries

Chaitanya Joshi, Paul Brown, and Stephen Joe
University of Waikato

Abstract: Building a predictive model of crime with good predictive accuracy has a great value in enabling efficient use of policing resources and reduction in crime. Building such models is not straightforward though due to the dynamic nature of the crime process. The crime not only evolves over both space and time, but is also related to several complex socio-economic factors, not all of which can be measured directly and accurately. The last decade or more has seen a surge in the effort to model crime more accurately. Many of the models developed so far have failed to capture the crime with a great degree of accuracy. The main reasons could be that all these models discretise the space using grid cells and that they are spatial, not spatio-temporal. We fit a log Gaussian Cox process model using the INLA-SPDE approach. This not only allows us to capture crime as a process continuous in both space and time, but also allows us to include socio-economic factors as well as the ’near repeat’ phenomenon. In this talk, we will discuss the model building process and the accuracy achieved.

Keywords: Bayesian spatio-temporal model, INLA-SPDE, predicting crime

Tuesday 12th 11:50 Case Room 4 (260-009)

## Tolerance Limits For The Reliability Of Semiconductor Devices Using Longitudinal Data

Vera Hofer1, Johannes Leitner1, Horst Lewitschnig2, and Thomas Nowak1
1University of Graz
2Infineon Technologies Austria AG

Abstract: Especially in the automotive industry, semiconductor devices are key components for the proper functioning of the entire vehicle. Therefore, issues concerning the reliability of these components are of crucial importance to manufacturers of semiconductor devices.

In this quality control task, we consider longitudinal data from high temperature operating life tests. Manufacturers then need to find appropriate tolerance limits for their final electrical product tests, such that the proper functioning of their devices is ensured. Based on these datasets, we compute tolerance limits that could then be used by automated test equipment for the ongoing quality control process. Devices with electrical parameters within their respective tolerance limits can successfully finish the production line, while all other devices will be discarded. In calculating these tolerance limits, our approach consists of two steps: First, the observed measurements are transformed in order to capture measurement biases and gauge repeatability and reproducibility. Then, in the second step, we compute tolerance limits based on a multivariate copula model with skew normal distributed margins. In order to solve the resulting optimization problem, we propose a new derivative-free optimization procedure.

The capability of the model is demonstrated by computing optimal tolerance limits for several drift patterns that are expected to cover a wide range of scenarios. Based on these computations, we show the resulting yield losses and analyze the performance of the tolerance limits a large simulation study.

Acknowledgment

This work was supported by the ECSEL Joint Undertaking under grant agreement No. 662133 - PowerBase. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and Austria, Belgium, Germany, Italy, Netherlands, Norway, Slovakia, Spain and United Kingdom.

Keywords: quality control, tolerance limits, copulas, skew normal distribution

Tuesday 12th 16:00 098 Lecture Theatre (260-098)

## Model-Checking For Regressions: A Local Smoothing-Based Global Smoothing Test

Lingzhu Li and Lixing Zhu
Hong Kong Baptist University

Abstract: As the two kinds of methods for model specification problem, local smoothing tests and global smoothing tests exhibit different characteristics. Compared with global smoothing tests, local smoothing tests can only detect local alternatives distinct from the null hypothesis at a much slower rate when the dimension of predictor vector is high, but can be more sensitive to high-frequency alternatives. We suggest a projection-based test that builds a bridge between the local and global smoothing methodologies to benefit from their own advantages. The test construction is based on a kernel estimation-based local smoothing method and the resulting test becomes a distance-based global smoothing test. A closed-form expression of the test statistic is derived and the asymptotic properties are investigated. Simulations and a real data analysis are conducted to evaluate the performance of the test in finite sample cases.

Keywords: Global smoothing test, projection-based methods, local smoothing test

References:

Zheng, J. X. (1996). Journal of Econometrics: A consistent test of functional form via nonparametric estimation techniques, 75(2), 263–289.

Bierens, H. J. (1982). Journal of Econometrics: Consistent model specification tests, 20, 105-134.

Lavergne, P. and Patilea, V. (2012). Journal of business & economic statistics: One for all and all for one: regression checks with many regressors. 30(1), 41–52. Taylor & Francis Group.

Tuesday 12th 16:00 OGGB4 (260-073)

## Breeding Value Estimation In Partially-Genotyped Populations

Alastair Lamont
University of Otago

Abstract: In livestock, a primary goal is the identification of individuals’ breeding values - a measure of their genetic worth. This identification can be used to aid with selective breeding, but is non trivial due to how large data can be.

Measured traits are typically modelled as being caused by both breeding values and also environmental fixed effects. An efficient method for fitting this model was developed by Henderson (1984), based upon generalized least squares. This method could be applied to data where the pedigree - how each animal was related to one another - was fully known.

Improvements in technology have allowed the genetic information of an animal to be directly measured. These measurements can be taken very early in life, with the goal of informing selective breeding faster and more efficiently. Meuwissen (2001) adapted the standard model to incorporate genetic data, and additionally developed multiple fitting methods for this model.

Modern datasets are frequently only partially genotyped. The methods of Meuwissen cannot be used for these data, as they are only applicable to populations in which every individual is gentoyped. Modern fitting approaches aim to make use of the available genetic information without requiring all individuals be genotyped.

These approaches tend to either impute or average over missing genotype data, which can affect the overall accuracy of breeding value estimation. We are developing an alternative which instead incorporates missing data within the model, rather than having to adapt fitting approaches to accommodate it.

Preliminary results suggest that approaching fitting is this way can lead to improved accuracy of estimation in certain situations.

Tuesday 12th 16:00 OGGB5 (260-051)

## BIVAS: A Scalable Bayesian Method For Bi-Level Variable Selection

Mingxuan Cai1, Mingwei Dai2, Jingsi Ming1, Jin Liu3, Can Yang4, and Heng Peng1
1Hong Kong Baptist University
2Xi’an Jiaotong University
3Duke-NUS Medical School
4Hong Kong University of Science and Technology

Abstract: In this paper we propose a bi-level variable selection approach, Bivas, for linear regression under the Bayesian framework. This model assumes that each variable is assigned to a pre-specified group where only a subset of the groups truly contribute to the response variable. Besides, within the active groups, there are only a small number of variables are important. A hierarchical formulation is adopted to mimic this pattern, where the spike-slab prior is put on both individual variable level and group level. A computationally efficient algorithm is developed using variational inference. Both simulation studies and real examples are analyzed, through which we illustrate the advantages of our method for both variable selection and parameter estimation under certain conditions.

Tuesday 12th 16:00 Case Room 2 (260-057)

## Ranking Potential Shoplifters In Real Time

Barry McDonald
Massey University

Abstract: A company with a focus on retail crime prevention brought to MINZ (Mathematics in Industry in New Zealand) the task of “Who is most likely to offend in my store, now”. The company supplied an anonymised set of data on incidents and offenders. The task, for the statisticians and mathematicians involved, was to try to find ways to use the data to nominate, say, the top ten likely offenders for any particular store and any particular time, using up-to-the-minute information (real time). The problem was analogous to finding a regression model when every row of data has response identically 1 (an incident), and for many places and times there is no data. This talk will describe how the problem was tackled.

Keywords: Retail crime, ranking, ZINB, regression, real time

Tuesday 12th 16:00 Case Room 3 (260-055)

## Two Stage Approach To Data-Driven Subgroup Identification In Clinical Trials

Toshio Shimokawa and Kensuke Tanioka
Wakayama Medical University

Abstract: A personalized medicine have been improved through the statistic analysis of Big data such as registry data. In these researches, subgroup identification analysis have been focused on. The purpose of the analysis is detecting subgroup such that the efficacy of the medical treatment is effective based on predictive factors for the treatment.

Foster et al., (2011) proposed the subgroup identification method based on two stage approach, called Virtual Twins (VT) method. In the first stage of VT, the difference of treatment effect between treatment group and control group is estimated by Random Forest. In the second stage, responders are identified by using CART, where the estimated these differences are set as the predictor variables.

However, the prediction accuracy of RandomForest tends to be lower than that of Boosting. Therefore, generalized boosted model (Ridgeway, 2006) is adopted in the first step. In addition to that, the number of rules tend to be large in the second step when CART is used. In this paper, we adopt a priori algorithm as the same way of SIDES(Lipkovich et al., 2011).

Keywords: A priori algorithm, boosting, personalized medicine

References:

Forster, J.C., Taylor, J.M.G and Ruberg, S.J. (2011). Subgroup identification from randomized clinical trial data. Stat.Med, 30, 2867-2880.

Lipkovich, I., Dmitrienko, A., Denne, J. and Enas, G. (2011). Subgroup identification based on differential effect search-recursive partitioning method for establishing response to treatment in patient subpopulations. Stat.Med, 30, 2601-2880.

Ridgeway, G. (2006).Gbm: Generalized boosted regression models. R package version 1.5-7. Available at http://www.i-pensieri.com/gregr/gbm.shtml.

Tuesday 12th 16:20 098 Lecture Theatre (260-098)

## Inverse Regression For Multivariate Functional Data

Ci-Ren Jiang1 and Lu-Hung Chen2
2National Chung Hsing University

Abstract: Inverse regression is an appearing dimension reduction method for regression models with multivariate covariates. Recently, it has been extended to the cases with functional or longitudinal covariates. However, the extensions focus on one functional/longitudinal covariate only. In this work, we extend functional inverse regression to the cases with multivariate functional covariates. The asymptotical properties of the proposed estimators are investigated. Simulation studies and data analysis are also provided to demonstrate the performance of our method.

Keywords: Multidimensional/Multivariate Functional Data Analysis, Inverse Regression, Parallel Computing, Smoothing

Tuesday 12th 16:20 OGGB4 (260-073)

## Including Covariate Estimation Error When Predicting Species Distributions: A Simulation Exercise Using Template Model Builder

Andrea Havron and Russell Millar
University of Auckland

Abstract: Ecological managers often require knowledge about species distributions across a spatial region in order to facilitate best management practices. Statistical models are frequently used to infer relationships between species observations (eg. presence, abundance, biomass, etc.) and environmental covariates in order to predict values at unobserved locations. Issues remain for situations where covariate information is not available for a predictive location. In these cases, spatial maps of covariates are often generated using tools such as kriging; however, the uncertainties from this statistical estimation are not carried through to the final species distribution map. New advances in spatial modelling using the automated differentiation software, Template Model Builder, allow both the spatial process of the environmental covariates and the observations to be modelled simultaneously by maximizing the marginal likelihood of the fixed effects with a Laplace approximation after integrating out the random spatial effects. This method allows for the uncertainty of the covariate estimation process to be included in the standard errors of final predictions as well as any derived quantities, such as total biomass for a spatial region. We intend to demonstrate this method and compare our predictions to those from a model where regional covariate information is supplied from a kriging model.

Keywords: spatial model, predicting covariates, Template Model Builder

References:

Kristensen, K.,Nielsen, A., Berg, C.W., Skuag, H. and Bell, B. (2015). TMB: Automatic Differentiation and Laplace Approximation. In: Journal of Statistical Software,70, 1–21.

Tuesday 12th 16:20 OGGB5 (260-051)

Ke Wan1, Kensuke Tanioka1, Kun Yang2, and Toshio Shimokawa1
1Wakayama Medical University
2Southwest Jiaotong University

Abstract: In questionnaire surveys, multiple regression analysis is usually used to evaluate influence factors. In addition to that, data mining methods such as Classification and Regression Trees (Breiman et al., 1984) are also used. In the research for tourism studies, it is difficult to contribute the policies for landscape or buildings from the results. In this paper, we call these factors “ uncontrollable exploratory variables“. On the other hands, the polices for amounts of garbages or inhabitant consciousness can be contributed from the results. We call these factors “controllable exploratory variables”. The purpose of this report is grading for each subject which is conducted based on controllable exploratory variables with adjusting the effects of uncontrollable exploratory variables. Concretely, we modified the AIM method (Tian and Tibshirani, 2010) and conduct gradings based on the sum of the production rules for controllable exploratory variables with adjusting the effects of uncontrollable exploratory variables.

Keywords: logistic regression, production rule, grading

References:

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Wadsworth.

Tian, L., and Tibshirani, R. (2011). Adaptive index models for marker-based risk stratification. Biostatistics, 12, 68–86.

Tuesday 12th 16:20 Case Room 2 (260-057)

## Factors Influencing On Growth Of Garments Industry In Bangladesh

Auckland University of Technology

Keywords: Bangladesh Garments, Growth of Garment Industry, Performance of Manufacturers and Traders, Statistical Model

Tuesday 12th 16:20 Case Room 3 (260-055)

## Comparison Of Exact And Approximate Testing Procedures In Clinical Trials With Multiple Binary Endpoints

Takuma Ishihara and Kouji Yamamoto
Osaka City University

Abstract: In confirmatory clinical trials, the efficacy of a test treatment are sometimes assessed by using multiple primary endpoints. We consider a trial in which the efficacy of a test treatment is confirmed only when it is superior to control for at least one of the endpoints and not clinically inferior for the remaining endpoints. Nakazuru et al. (2014) proposed a testing procedure that is applicable to the above case when endpoints are continuous variables. In this presentation, firstly, we propose a testing procedure in the case that all of the endpoints are binary.

Westfall and Troendle (2008) proposed multivariate permutation tests. Using this methods, we also propose an exact multiple testing procedure.

Finally, we compare an exact and approximate testing procedures proposed above. The performance of the proposed procedures was examined through Monte Carlo simulations.

Keywords: Clinical trial; Multivariate Bernoulli distribution; Non-inferiority; Superiority.

References:

Nakazuru, Y., Sozu, T., Hamada, C. and Yoshimura, I. (2014). A new procedure of one-sided test in clinical trials with multiple endpoints. Japanese Journal of Biometrics, 35, 17-35.

Westfall PH and Troendle JF. (2008). Multiple testing with minimal assumptions. Biometrical Journal, 50(5), 745-755.

Tuesday 12th 16:40 098 Lecture Theatre (260-098)

## Multiple Function-On-Function Linear Regression With Application To Weather Forecast Calibration

Min-Chia Huang, Xin-Hua Wang, and Lu-Hung Chen
National Chung Hsing University

Abstract: We suggest a direct approach to estimate the coefficient functions for function-on-function linear regression models. To avoid the risk of discarding useful information for regressions, the approach does not depend on basis representations or dimension reductions. It can accommodate for multiple functional responses and multiple functional predictors on different multidimensional domains, observed on dense or irregular sparse grids. We demonstrate the performances of the approach by simulation studies and a real application on calibrating numerical weather forecasts.

Tuesday 12th 16:40 OGGB4 (260-073)

## Modelling The Distribution Of Lifetime Using Compound Time-Homogenous Poisson Process

Kien Tran
Victoria University of Wellington

Abstract: Modelling the distribution of lifetime has traditionally been done by constructing a deterministic function for the survival function and/or force of mortality. This paper outlines previous research and presents the author’s initial attempts to model the force of mortality and remaining lifetime using time-homogenous compound Poisson processes.

The paper presents two models. In model 1, the force of mortality of an individual is modelled as a random sum of i.i.d random variables (i.e. a compound Poisson process). In model 2, each individual is assumed to have an initial normally distributed innate lifetime, and their remaining life is a shifted compound Poisson process. In other words, we assume that there are random events coming at a constant rate modifying either the force of mortality or remaining lifetime of individuals. Simulations in R are then run to find the optimized parameters and the empirical survival function, force of mortality and distribution of lifetime are then constructed. Finally, these outputs are compared existing models and actual demographic data.

It turns out that for model 1, it is very difficult to model the force of mortality using a time-homogenous compound Poisson process without introducing additional complications such as the inclusion of event times. For model 2, however, if we allow the events to be a Cauchy random variable, then we can model the survival function of New Zealand population much better than several existing well-known specifications such as Weibull.

Keywords: Distribution of lifetime, force of mortality, survival function, time-homogenous compound Poisson process, innate lifetime, R simulation

References:

Khmaladze, E (2013). Statistical methods with application to demography and life insurance. CRC Press.

Weibull, W (1939). A statistical theory of the strength of materials. Generalstabens litografiska anstalts frlag, 1st edition.

Gompertz, B (1825). On the Nature of the Function Expressive of the Law of Human Mortality, and on a New Mode of Determining the Value of Life. Philosophical Transactions of the Royal Society of London, 115, 513-583.

Tuesday 12th 16:40 OGGB5 (260-051)

## Detecting Change-Points In The Stress-Strength Reliability P(X<Y)

Hang Xu1, Philip L.H. Yu1, and Mayer Alvo2
1Unversity of Hong Kong
2University of Ottawa

Abstract: We address the statistical problem of detecting change-points in the stress-strength reliability $$R=P(X<Y)$$ in a sequence of paired variables $$(X,Y)$$. Without specifying their underlying distributions, we embed this non-parametric problem into a parametric framework and apply the maximum likelihood method via a dynamic programming approach to determine the locations of the change-points in R. Under some mild conditions, we show the consistency and asymptotic properties of the procedure to locate the change-points. Simulation experiments reveal that in comparison with existing parametric and non-parametric change-point detection methods, our proposed method performs well in detecting both single and multiple change-points in R in terms of the accuracy of the location estimation and the computation time. It offers robust and effective detection capability without the need to specify the exact underling distribution of the variables. Applications to real data demonstrate the usefulness of our proposed methodology for detecting the change-points in the stress-strength reliability R.

Keywords: Multiple change-points detection; Stress-strength model; Dynamic programming

Tuesday 12th 16:40 Case Room 2 (260-057)

## New Zealand Crime And Victims Survey: Filling The Knowledge Gap

Andrew Butcher and Michael Slyuzberg
NZ Ministry of Justice

Abstract: The key objective of the Ministry of Justice is to ensure that New Zealand has a strong justice system that contributes to a safe and just society. To achieve this objective, the ministry and the wider Justice Sector need to know whether they are focusing their efforts in the right places and really making a difference. It is often difficult because we lack a crucial piece of information: how much crime is actually out there. Administrative data does not provide an answer as only about 30 The New Zealand Crime and Victims Survey (NZCVS) is introduced to fill this knowledge gap. The survey which is currently on the pilot phase was designed to meet the recommendations of Statistics New Zealand and key stakeholders’ demand. It will interview about 8,000 of New Zealand residents aged from 15 years old and aims to: provide information about the extent (volumes and prevalence) and nature of crime and victimisation in New Zealand; provide geographical break-down of victimisation; provide extensive victims’ demographics; measure how much crime gets reported to Police; understand the experiences of victims; measure crime trends in New Zealand.

The paper summarises the core requirements to NZCVS obtained from extended discussions with key stakeholders and describes key design features to be implemented in order to meet these requirements. These key requirements include, but are not limited to: Measuring the extent and nature of reported and unreported crime across New Zealand; Providing in-depth story-telling of victims’ experiences; Providing frequent and timely information to support Investment Approach for Justice and wider decision making; Reducing information gaps by matching the NZCVS with administrative data in Statistics New Zealand’s Integrated Data Infrastructure (IDI).

In particular, the paper discusses modular survey design which includes core crime and victimisation questions and revolving modules added annually, stratified random sampling, a new highly automated approach to offence coding through extended screening, measuring harm from being victimised, obtaining respondents’ informed consent for data matching, use of survey data for extended analysis and forecasting and other important survey features.

Tuesday 12th 16:40 Case Room 3 (260-055)

## Missing Data In Randomised Control Trials: Stepped Multiple Imputation

Rose Sisk and Alain Vandal
Auckland University of Technology

Abstract: Missing data in Randomised Control Trials is usually unavoidable, but can present considerable issues to analysis in an Intention-to-Treat (ITT) setting. Multiple imputation is often regarded as the most appropriate method of handling missing data when compared with simpler methods such as complete case analysis and mean/mode imputation. However, in practice it can often be tricky to implement when working with large longitudinal datasets. The Sodium Lowering in Dialysate (SOLID) trial is a randomised control trial seeking to improve cardiovascular and other outcomes by lowering the dialysate concentration of sodium of patients on home haemodialysis. The trial contains 99 participants and over 30 primary and secondary outcomes. Missing data from various sources are present at baseline and at follow-up time points. Attempting to multiply impute a large number of outcomes, each measured at up to 4 follow-up times, proved to be a challenging task in this study. Several attempts to obtain sensible imputations were made but many of these failed due to the presence of highly correlated outcomes which were often missing together. This presentation discusses the approach taken to overcome this problem, which involved defining sets of outcomes to impute in various rounds, preventing sets of similar (highly correlated, missing together) outcomes being imputed in the same round. Once a round of imputation was completed, the next set of outcomes to be imputed was matched onto the completed dataset. This process is repeated until the full ITT dataset contains no missing values in any outcomes. We call this “stepped imputation”. Theory from mixed models was also applied to seek measures associated with the missingness mechanism, with the potential to include them in the final model to further reduce any possible bias resulting from missing data. Results from a simulation to test the validity of “stepped imputation” will be presented. In this simulation, an attempt is made to generate data related in a similar way to the outcomes in the SOLID trial. Results from the “gold standard” analysis with no missing data, and the complete case analysis is compared to the stepped imputation method.