Programme And Abstracts For Tuesday 12th Of December
Keynote: Tuesday 12th 9:10 098 Lecture Theatre (260-098)
Could Do Better … A Report Card For Statistical Computing
Ross Ihaka and Brendon McArdle
University of Auckland
Abstract: Since the introduction of R, research in Statistical Computing has plateaued. Although R is, at best, a stop-gap system, there appears to be very little active research on creating better computing environments for Statistics.
When work on R commenced there were a multitude of software systems for statistical data analysis in use and under development. There was friendly competition and collaboration between developers. While R can be seen as providing a useful unification for users, its success and dominance can be viewed as now holding back research and the development of new systems.In this talk we’ll examine what might be behind this and also look at some research aimed at exploring some of the design space for new systems. The aim is to show constructively that new work in the area is still possible.
Tuesday 12th 10:30 098 Lecture Theatre (260-098)
R&D Policy Regimes In France: New Evidence From A Spatio-Temporal Analysis
Benjamin Montmartin1, Marcos Herrera2, and Nadine Massard3
Abstract: Using a unique database containing information on the amount of R&D tax credits and regional, national and European subsidies received by firms in French NUTS3 regions over the period 2001-2011, we provide new evidence on the efficiency of R&D policies taking into account spatial dependency across regions. By estimating a spatial Durbin model with regimes and fixed effects, we show that in a context of yardstick competition between regions, national subsidies are the only instrument that displays total leverage effect. For other instruments internal and external effects balance each other resulting in insignificant total effects. Structural breaks corresponding to tax credit reforms are also revealed.
Keywords: Additionality, French policy mix, Spatial panel, Structural break
Pesaran, M. H. (2007). A simple panel unit root test in the presence of cross-section dependence In: Journal of Applied Econometrics, 22, 265–312.Hendry, D. F. (1979). Predictive failure and econometric modelling in macroeconomics: The transactions demand for money. In: P. Ormerod (Ed.), Economic Modelling: Current Issues and Problems in Macroeconomic Modelling in the UK and the US, 9, 217–242. Heinemann Education Books, London.
Tuesday 12th 10:30 OGGB4 (260-073)
Analysing Scientific Collaborations Of New Zealand Institutions Using Scopus Bibliometric Data
Samin Aref1, David Friggens2, and Shaun Hendy1
1University of Auckland
2Ministry of Business Innovation & Employment
Abstract: Scientific collaborations are among the main enablers of development in small national science systems. Although analysing scientific collaborations is a well-established subject in scientometrics, evaluations of collaborative activities of countries remain speculative with studies based on a limited number of fields or using data too inadequate to fully represent collaborations at a national level. This study provides a unique view on the collaborative aspect of scientific activities in New Zealand. We perform a quantitative study based on all Scopus publications in all subjects for over 1500 New Zealand institutions over a period of 6 years to generate an extensive mapping of New Zealand scientific collaborations. The comparative results reveal the levels of collaboration between New Zealand institutions and business enterprises, government institutions, higher education providers, and private not for profit organisations in 2010-2015. Constructing a collaboration network of institutions, we observe a power-law distribution indicating that a small number of New Zealand institutions account for a large proportion of national collaborations. Network centrality measures are deployed to identify the most influential institutions of the country in terms of scientific collaboration. We also provide comparative results on 15 universities and crown research institutes based on 27 subject classifications. This study was based on Scopus custom data and supported by the Te Pūnaha Matatini internship program at Ministry of Business, Innovation & Employment.
ArXiv preprint link: https://arxiv.org/pdf/1709.02897Keywords: Big data modelling, Scientific collaboration, Scientometrics, Network analysis, Scopus, New Zealand
Tuesday 12th 10:30 OGGB5 (260-051)
Family Structure And Academic Achievements Of High School Students In Tonga
Losana Vao Latu Latu
University of Canterbury
Abstract: In this study we examine how family structure affects the academic achievement of students at the secondary level of education age in Tonga. It is a comparative study aiming to find out whether there is a significant difference between the academic achievements of students from a traditional family and those from a non-traditional family. We define a Tongan traditional family as being two biological parents (or adoptive parents from birth), one male and one female where as non-traditional family can be a single parent family, or the student has no parent present (for example they are staying with relatives or friends). In our study we are looking at what are the key drivers of success and trying to understand the relationship between academic achievements and family structure. We hope the study will provide evidence-based information to aid the administrators, other educators and parents to adopt the best practices and actions for the students. The target population for this study is the high school students age 13 to 18 in Tonga. The study is limited to the high schools in the main island of Tonga- Tongatapu which has 12 high schools where two high schools are government schools and the others are private schools run by different religions. In April we surveyed 360 students, 60 from each of 6 high schools, and present here our preliminary results.Keywords: Education, policy, stratified sampling
Tuesday 12th 10:30 Case Room 2 (260-057)
Analysis Of Multivariate Binary Longitudinal Data: Metabolic Syndrome During Menopausal Transition
Abstract: Metabolic syndrome (MetS) is a major multifactorial condition that predisposes adults to type 2 diabetes and cardiovascular disease. It is defined as having at least three of five cardiometabolic risk components: 1) high fasting triglyceride level, 2) low high-density lipoprotein (HDL) cholesterol, 3) elevated fasting plasma glucose, 4) large waist circumference (abdominal obesity) and 5) hypertension. In the US Study of Women’s Health Across the Nation (SWAN), a 15-year multi-centre prospective cohort study of women from five racial/ethnic groups, the incidence of MetS increased as midlife women underwent the menopausal transition (MT). A model is sought to examine the interdependent progression of the five MetS components and the influence of demographic covariates.Keywords: Multivariate binary data, longitudinal analysis, metabolic syndrome
Tuesday 12th 10:30 Case Room 3 (260-055)
Clustering Of Curves On A Spatial Domain Using A Bayesian Partitioning Model
Chae Young Lim
Seoul National University
Abstract: We propose a Bayesian hierarchical model for spatial clustering of the high-dimensional functional data based on the effects of functional covariates. We couple the functional mixed-effects model with a generalized spatial partitioning method for: (1) identifying subregions for the high-dimensional spatio-functional data; (2) improving the computational feasibility via parallel computing over subregions or multi-level partitions; and (3) addressing the near-boundary ambiguity in model-based spatial clustering techniques. The proposed model extends the existing spatial clustering techniques to produce spatially contiguous partitions for spatio-functional data. The model successfully captured the regional effects of the atmospheric and cloud properties on the spectral radiance measurements. This elaborates the importance of considering spatially contiguous partitions for identifying regional effects and small-scale variability.Keywords: spatial clustering, Bayesian wavelets, Voronoi tessellation, functional covariates
Tuesday 12th 10:30 Case Room 4 (260-009)
The Uncomfortable Entrepreneurs: Bad Working Conditions And Entrepreneurial Commitment
Université Côte d’Azur, GREDEG-CNRS
Abstract: In contrast to previous model dividing necessity entrepreneurs as individuals facing push factors due to lack of employment, we consider the possibility of push factors faced by employed individuals (Folta et al. (2010)). The theoretical model yields distinctive predictions relating occupation characteristics and the probability of entry into entrepreneurship. Using PSED and ONET data, we investigate how the characteristics of individuals? primary occupations affect nascent entrepreneurs? effort put into venture creation. The empirical evidences show that necessity entrepreneurs are not only confined to unemployed individuals. We find compelling evidence that individuals facing arduous working conditions (e.g. stressful environment and physical tiredness) have a higher likelihood of entering and succeeding in self-employment than others. Contrariwise, individuals who experience high degree of self-realization, independence and responsibility in the workplace are less committed to their business than individuals exposed to arduous working conditions. These findings have strong implication for how we interpret and analyze necessity entrepreneurs and provide novel insights into the role of occupational experience in the process of venture emergence.
Keywords: Entrepreneurship, Motivation, Occupational characteristics, Employment choice.
References:Folta, T. B., Delmar, F., & Wennberg, K. 2010. Hybrid entrepreneurship. Management Science, 56(2), 253-269.
Tuesday 12th 10:50 098 Lecture Theatre (260-098)
Spatial Surveillance With Scan Statistics By Controlling The False Discovery Rate
Abstract: In this paper, I investigate a false discovery approach based on spatial scan statistics to detect the spatial disease clusters in a geographical region proposed by Li et al. (2016). The incidence of disease is assumed to follow an inhomogeneous Poisson model discussed in Kulldorff (1997). I show that, though spatial scan statistics are highly correlated, the simple Banjamini-Hochberg (linear step-up) procedure can control the false discovery rate of them by proving that the multivariate Poisson distribution satisfies the PRDS condition (positive regression dependence on a subset) in Benjamini and Yekutieli (2001).
Keywords: False Discovery Rate, Poisson Distribution, PRDS, Spatial Scan Statistics
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, 29(4), 1165–1188.
Kulldorff, M. (1997). A spatial scan statistic, Communications in Statistics-Theory and Methods 26(6), 1481–1496.Li, Y., Shu, L., and Tsung, F. (2016). A false discovery approach for scanning spatial disease clusters with arbitrary shapes, IIE transactions, 48(7), 684–698.
Tuesday 12th 10:50 OGGB4 (260-073)
Statistical Models For The Source Attribution Of Zoonotic Diseases: A Study Of Campylobacteriosis
Sih-Jing Liao, Martin Hazelton, Jonathan Marshall, and Nigel French
Abstract: Preventing and controlling zoonoses with a public health policy depends on the knowledge scientists have about the transmitted pathogens. Modelling jointly the epidemiological data and genetic information provides a methodology for tracing back the source of infection. However, this creates difficulties in assessing genetic efforts behind models of the final statistical inferences due to increased model complexity. To explore the genetic effects in the joint model, we develop a genetic free model and compare it to the joint model. We apply the two models to a recent campylobacteriosis study to estimate the attribution probability for each source. A spatial covariate is also considered in the models in order to investigate the effect of the level of rurality on the source attributions. Comparing the attributions generated by the two models, we find that: i) the genetic information integrated in the joint model gives a little more precise inference to the sparse cases observed in highly rural areas than the genetic free model; ii) on the logit scale, source attribution probabilities follow linear trends against level of rurality; and iii) poultry is the dominant source of campylobacteriosis in urban centres, whereas ruminants are the most attributable source when in rural areas.
Keywords: source attribution, Campylobacter, multinomial model, Dirichlet prior, HPD interval, DIC
Bronowski, C., James, C.E. and Winstanley, C. (2014). Role of environmental survival in transmission of Campylobacter jejuni. FEMS Microbiol Lett., 356(1) 8–19.
Dingle, K.E., Colles, F.M., Wareing, D.R., Ure, R., Fox, A.J., Bolton, F.E., Bootsma, H.J., Willems, R.J. and Maiden, M.C. (2001). Multilocus sequence typing system for Campylobacter jejuni. J Clin Microbiol, 39(1):14–23.
Marshall, J.C. and French, N.P. (2015). Source attribution January to December 2014 of human Campylobacter jejuni cases from the Manawatu. Technical Report.
Wilson, D.J., Gabriel, E., Leatherbarrow, A.J., Cheesbrough, J., Gee, S., Bolton, E., Fox, A., Fearnhead, P., Hart, C.A. and Diggle, P.J. (2008). Tracing the source of campylobacteriosis. PLoS Genet, 4(9):e1000203.
Wagenaar, J.A., French, N.P. and Havelaar, A.H. (2013). Preventing Campylobacter at the source: why is it so difficult? Clin Infect Dis, 57(11):1600–1606.Biggs, P.J., Fearnhead, P., Hotter, G., Mohan, V., Collins-Emerson, J., Kwan, E., Besser, T.E., Cookson, A., Carter, P.E. and French, N.P. (2011). Whole-genome comparison of two Campylobacter jejuni isolates of the same sequence type reveals multiple loci of different ancestral lineage. PLoS One, 6(11):e27121.
Tuesday 12th 10:50 OGGB5 (260-051)
Towards An Informal Test For Goodness-Of-Fit
Anna Fergusson and Maxine Pfannkuch
University of Auckland
Tuesday 12th 10:50 Case Room 2 (260-057)
Identifying Clusters Of Patients With Diabetes Using A Markov Birth-Death Process
Mugdha Manda, Thomas Lumley, and Susan Wells
University of Auckland
Abstract: Estimating disease trajectories has increasingly become more essential to clinical practitioners to administer effective treatment to their patients. A part of describing disease trajectories involves taking patients’ medical histories and sociodemographic factors into account and grouping them into similar groups, or clusters. Advances in computerised patient databases have paved a way for identifying such trajectories in patients by recording a patient’s medical history over a long period of time (longitudinal data): we studied data from the PREDICT-CVD dataset, a national primary-care cohort from which people with diabetes from 2002-2015 were identified through routine clinical practice. We fitted a Bayesian hierarchical linear model with latent clusters to the repeated measurements of HbA\(_1c\) and eGFR, using the Markov birth-death process proposed by Stephens (2000) to handle the changes in dimensionality as clusters were added or removed.
Keywords: Diabetes management, longitudinal data, Markov chain Monte Carlo, birth-death process, mixture model, Bayesian analysis, latent clusters, hierarchical models, primary care, clinical practice
References:Stephens, M. (2000). Bayesian Analysis of Mixture Models with an Unknown Number of Components - An Alternative to Reversible Jump Methods. In: The Annals of Statistics, 28(1), 40-74.
Tuesday 12th 10:50 Case Room 3 (260-055)
Bayesian Temporal Density Estimation Using Autoregressive Species Sampling Models
Youngin Jo1, Seongil Jo2, and Jaeyong Lee3
2Chonbuk National University
3Seoul National University
Abstract: We propose a Bayesian nonparametric (BNP) model, which is built on a class of species sampling models, for estimating density functions of temporal data. In particular, we introduce species sampling mixture models with temporal dependence. To accommodate temporal dependence, we define dependent species sampling models by modeling random support points and weights through an autoregressive model, and then we construct the mixture models based on the collection of these dependent species sampling models. We propose an algorithm to generate posterior samples and present simulation studies to compare the performance of the proposed models with competitors that are based on Dirichlet process mixture models. We apply our method to the estimation of densities for the price of apartment in Seoul, the closing price in Korea Composite Stock Price Index (KOSPI), and climate variables (daily maximum temperature and precipitation) of around the Korean peninsula.
Keywords: Autoregressive species sampling models; Dependent random probability measures; Mixture models; Temporal structured dataAcknowledgements: This work is a part of the first author’s Ph.D. thesis at Seoul National University. Research of Seongil Jo was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A3B03035235). Research of Jaeyong Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0030811).
Tuesday 12th 10:50 Case Room 4 (260-009)
How Does The Textile Set Describe Geometric Structures Of Data?
Ushio Tanaka1 and Tomonari Sei2
1Osaka Prefecture University
2Unversity of Tokyo
Abstract: The textile set is defined from the textile plot proposed by Kumasaka and Shibata (2007, 2008), which is a powerful tool for visualizing high dimensional data. The textile plot is based on a parallel coordinate plot, where the ordering, locations and scales of each axis are simultaneously chosen so that all connecting lines, each of which signifies an observation, are aligned as horizontally as possible. The textile plot transforms a data matrix in order to delineate a parallel coordinate plot. Using the geometric properties of the textile set derived by Sei and Tanaka (2015), we show that the textile set describes an intrinsically geometric structures of data.
Keywords: Parallel coordinate plot, Textile set, Differentiable manifold
Kumasaka, N. and Shibata, R. (2007). The Textile Plot Environment, Proceedings of the Institute of Statistical Mathematics, 55, 47–68.
Kumasaka, N. and Shibata, R. (2008). High-dimensional data visualisation: The textile plot, Computational Statistics and Data Analysis, 52, 3616–3644.Sei, T. and Tanaka, U. (2015). Geometric Properties of Textile Plot: Geometric Science of Information, Lecture Notes in Computer Science, 9389, 732–739.
Tuesday 12th 11:10 098 Lecture Theatre (260-098)
Intensity Estimation Of Spatial Point Processes Based On Area-Aggregated Data
Hsin-Cheng Huang and Chi-Wei Lai
Abstract: We consider estimation of intensity function for spatial point processes based on area-aggregated data. A standard approach for estimating the intensity function for a spatial point pattern is to use a kernel estimator. However, when data are only available in a spatially aggregated form with the numbers of events available in geographical subregions, traditional methods developed for individual-level event data become infeasible. In this research, a kernel-based method will be proposed to produce a smooth intensity function based on aggregated count data. Some numerical examples will be provided to demonstrate the effectiveness of the proposed method.Keywords: Area censoring, inhomogeneous spatial point processes, kernel density estimation
Tuesday 12th 11:10 OGGB4 (260-073)
Bayesian Inference For Population Attributable Measures
Sarah Pirikahu, Geoff Jones, Martin Hazelton, and Cord Heuer
Tuesday 12th 11:10 OGGB5 (260-051)
An Information Criterion For Prediction With Auxiliary Variables Under Covariate Shift
Takahiro Ido1, Shinpei Imori1,2, and Hidetoshi Shimodaira2,3
2RIKEN Center for Advanced Intelligence Project (AIP)
Abstract: It is beneficial for modeling data of interest to exploit secondary information. The secondary information is called auxiliary variables, which may not be observed in testing data because they are not of primary interest. In this paper, we incorporate the auxiliary variables into a framework of supervised learning. Furthermore, we consider a covariate shift situation that allows a density function of covariates to change between testing and training data. It is known that the Maximum Log-likelihood Estimate (MLE) is not a good estimator under model misspecification and the covariate shift. This problem can be resolved by the Maximum Weighted Log-likelihood Estimate (MWLE).
When we have multiple candidate models, it needs to select the best candidate model where its optimality is measured by the expected Kullback-Leibler (KL) divergence. The Akaike information criterion (AIC) is a well known criterion based on the KL divergence and using the MLE. Therefore, its validity is not guaranteed when the MWLE is used under the covariate shift. An information criterion under the covariate shift was proposed in Shimodaira (2000, JSPI) but this criterion does not take use of the auxiliary variables into account. Hence, we resolve this problem by deriving a new criterion. In addition, simulations are conducted to examine the improvement.Keywords: Auxiliary variables; Covariate shift; Information criterion; Kullback-Leibler divergence; Misspecification; Predictions.
Tuesday 12th 11:10 Case Room 2 (260-057)
Analysis Of A Brief Telephone Intervention For Problem Gambling And Examining The Impact On Co-Existing Depression?
Nick Garrett, Maria Bellringer, and Max Abbott
Auckland University of Technology
Tuesday 12th 11:10 Case Room 3 (260-055)
Prior-Based Bayesian Information Criterion
M. J. Bayarri1, James Berger2, Woncheol Jang3, Surajit Ray4, Luis Pericchi5, and Ingmar Visser6
1University of Valencia
3Seoul National University
4University of Glasgow
5University of Puerto Rico
6University of Amsterdam
Abstract: We present a new approach to model selection and Bayes factor determination, based on Laplace expansions (as in BIC), which we call Prior-based Bayes Information Criterion (PBIC). In this approach, the Laplace expansion is only done with the likelihood function, and then a suitable prior distribution is chosen to allow exact computation of the (approximate) marginal likelihood arising from the Laplace approximation and the prior. The result is a closed-form expression similar to BIC, but now involves a term arising from the prior distribution (which BIC ignores) and also incorporates the idea that different parameters can have different effective sample sizes (whereas BIC only allows one overall sample size \(n\)). We also consider a modification of PBIC which is more favorable to complex models.Keywords: Bayes factors, model selection, Cauchy priors, consistency, effective sample size, Fisher information, Laplace expansions, robust priors
Tuesday 12th 11:10 Case Room 4 (260-009)
Early Childhood Dental Decay
University of Canterbury
Abstract: Our teeth are some of our most useful tools. They let us eat tasty food, take those plastic tags off new clothes and enhance our smiles to convey joy. They also have to last us a lifetime and need to be looked after. Teeth are a mutually supportive structure, even one extraction can destabilize the remaining teeth. Early intervention in oral health can prevent a lifetime of discomfort, embarrassment and expensive treatments. An issue that is facing Dentists in New Zealand and abroad are preschool children missing treatment appointments. These children have more dental issues in later childhood.
The research question I aim to answer is: Does early dental neglect increase dental issues in later childhood? My thesis will use traditional statistics along with datamining and machine learning techniques to investigate these anecdotal claims.
Using the geographical information of the dataset I will be utilizing the Deprivation data from Statistics New Zealand to research if these children are from more deprived neighborhoods.
Tuesday 12th 11:30 098 Lecture Theatre (260-098)
Geographically Weighted Principal Component Analysis For Spatio-Temporal Statistical Dataset
Narumasa Tsutsumida1, Paul Harris2, and Alexis Comber3
3Univerisity of Leeds
Abstract: Spatio-temporal statistical datasets are becoming widely available for social, ecomonic, and environmental researches, however it is often difficult to summarize it and undermine hidden spatial/temporal patterns due to its complexity. Geographically weighted principal component analysis (GWPCA), which uses a moving window or kernel and applies localized PCAs over geographical scape, may be worth to do it, while to optimize kernel bandwidth size and to determine the number of component to retain (NCR) were the most concern (Tsutsumida et al (2017)). In this research we determine both of them together simultaneously so as to minimize leave-one-out residual coefficient of variation of GWPCA with changing bandwidth size and NCR. As a case study we use annual goat population statistics across 341 administrative units in Mongolia in 1990-2012, and show spatiotemporal variations in data, especially influenced by natural disasters.
Keywords: Geographically weighted model, Spatio-temporal data, Parameter optimization
References:Tsutsumida N., P. Harris, , A. Comber. 2017. The Application of a Geographically Weighted Principal Component Analysis for Exploring Twenty-three Years of Goat Population Change across Mongolia. Annals of the American Association of Geographers, 107(5), 1060–1074.
Tuesday 12th 11:30 OGGB4 (260-073)
Dimensionality Reduction Of Multivariate Data For Bayesian Analysis
Anjali Gupta1, James Curran1, Sally Coulson2, and Christopher Triggs1
1University of Auckland
Tuesday 12th 11:30 OGGB5 (260-051)
An EWMA Chart For Monitoring Covariance Matrix Based On Dissimilarity Index
National Tsing Hua University
Abstract: In this talk, we propose an EWMA chart for monitoring covariance matrix based on the dissimilarity index of two matrices. It is different from the conventional EWMA charts for monitoring covariance matrix which are either based on comparing the sum or product or both of the eigenvalues of the estimated EWMA covariance matrix with those of the IC covariance matrix. The proposed chart essentially monitors covariance matrix by comparing the individual eigenvalues of the estimated EWMA covariance matrix with those of the estimated covariance matrix from the IC phase I data. We evaluate the performance of the proposed chart by comparing it with the best existing chart under the multivariate normal process. Furthermore, to prevent the control limit of the proposed EMMA chart using the limited IC phase I data from having extensively excessive false alarms, we use a bootstrap method to adjust the control limit to guarantee that the proposed chart has the actual IC average run length not less than the nominal one with a certain probability. Finally, we use an example to demonstrate the applicability and implementation of the proposed chart.
Keywords: Average run length, dissimilarity index, EWMA; out-of-control
Hawkins, D.M. and Maboudou-Tchao E.M. (2008). Multivariate exponentially weighted moving covariance matrix. Technometrics, 50, 155-166.Kano, M., Hasebe, S. and Hashimoto, I. (2002). Statistical process monitoring based on dissimilarity of process data. AIChE Journal, 48, 1231-1240.
Tuesday 12th 11:30 Case Room 2 (260-057)
Bayesian Semiparametric Hierarchical Models For Longitudinal Data Analysis With Application To Dose-Response Studies
Abstract: In this work, we propose semiparametric Bayesian hierarchical additive mixed effects models for analyzing either longitudinal data or clustered data with applications to dose-response studies. In the semiparametric mixed effects model structure, we estimate nonparametric smoothing functions of continuous covariates by using a spectral representation of Gaussian processes and the subject-specific random effects by using Dirichlet process mixtures. In this framework, we develop semiparametric mixed effects models that include normal regression and quantile regressions with or without shape restrictions. In addition, we deal with the Bayesian nonparametric measurement error models, or errors-in-variable regression models, using Fourier series and Dirchlet process mixtures, in which the true covariate is not observable, but the surrogate of the true covariate, is only observed. The proposed methodology is compared with other existing approaches to additive mixed models in simulation studies and benchmark data examples. More importantly, we consider a real data application for dose-response analysis, in which measurement errors and shape constraints in the regression functions need to be incorporated with inter-study variability.Keywords: Cadmium toxicity, Cosine series, Dose-response study, Hierarchical Model, Measurement errors, Shape restriction
Tuesday 12th 11:30 Case Room 4 (260-009)
Optimizing Junior Rugby Weight Limits
Emma Campbell, Ankit Patel, and Paul Bracewell
DOT Loves Data
Abstract: The New Zealand rugby community is aware of safety issues within the junior game and has applied weight limits for each tackle grade to minimize injury risk. However, for heavier children this can create an uncomfortable situation as they may no longer be playing with their peer group. The study evaluated almost 13,000 observations from junior rugby players across three seasons (2015-2017) using data supplied by Wellington Rugby. To protect privacy, the data was structured so that an individual could not be readily identified but could be tracked across seasons to determine churn. As data for several consecutive seasons was available, we could determine the likelihood of a junior player returning the following season and isolate the drivers of this behaviour. Applying a logistic regression and repeated measures analysis the study determined if children who are over the specified weight limit for their age group are more likely to leave the game. Furthermore, assuming the importance of playing with peers, the study identified the impact of age in relation to the date-of-birth cut-off of January 1st. This is of interest given that a child playing above their age-weight grade could be competing against individuals three school years above them. The study primarily focuses on determining the optimal age-weight bands while the secondary focus is on determining the likelihood of a junior Wellington rugby player returning the following season and isolating the drivers of this behaviour.Keywords: Logistic regression, repeated measures, player retention, optimization
Tuesday 12th 11:50 098 Lecture Theatre (260-098)
Spatial Scan Statistics For Matched Case-Control Data
Yonsei University College of Medicine
Abstract: Spatial scan statistics are widely used for cluster detection analysis in geographical disease surveillance. While the method has been developed for various types of data such as binary, count and continuous data, spatial scan statistics for matched case-control data, which often arise in spatial epidemiology, have not been considered yet. In this paper, we propose two spatial scan statistics for matched case-control data. The proposed test statistics properly consider the correlations between matched pairs. We evaluate statistical power and cluster detection accuracy of the proposed methods through simulations comparing with the Bernoulli-based method. We illustrate the methods with the use of a real data example.Keywords: Spatial epidemiology, cluster detection, SaTScan, McNemar test, conditional logistic regression
Tuesday 12th 11:50 OGGB4 (260-073)
Whitebait In All Its Varieties: One Fish, Two Fish, Three, Four, Five Fish.
University of Canterbury
Tuesday 12th 11:50 OGGB5 (260-051)
Latent Variable Models And Multivariate Binomial Data
University of Otago
Abstract: A large body of work has been devoted to latent variable models applicable to multivariate binary data. However little work has been put into extending these models to cases where the observed data is multivariate binomial. In this paper, we will first show that models that use either a logit or probit link function, offer the same level of modelling flexibility in the binary case, but only the logit link fits into a data augmentation approach that compactly extends from binary to binomial. Secondly, we will demonstrate that multivariate binomial data provides greater flexibility in how the link function can be represented. Lastly, we will consider properties of the implied distribution of latent probabilities under a logit link.
Keywords: Multivariate binomial data, principal components/factor analysis, item response theory, link functions, logit-normal distributions
(ed.) Bartholomew, D. J. and Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis: A Unified Approach. Chichester: John Wiley & Sons.
Johnson, N.L. (1949). Systems of Frequency Curves Generated by Methods of Translation. Biometrika, 36, 149–276.Polson, N. G. and Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.
Tuesday 12th 11:50 Case Room 2 (260-057)
Asking About Sex In General Health Surveys: Comparing The Methods And Findings Of The 2010 Health Survey For England With Those Of The Third National Survey Of Sexual Attitudes And Lifestyles
Philip Prah1, Anne Johnson2, Soazig Clifton2, Jennifer Mindell2, Andrew Copas2, Chloe Robinson3, Rachel Craig3, Sarah Woodhall2, Wendy Macdowall4, Elizabeth Fuller3, Bob Erens2, Pam Sonnenberg2, Kaye Wellings4, Catherine Mercer2, and Anthony Nardone5
1Auckland University of Technology
2University College London
4London School of Hygiene & Tropical Medicine
5Public Health England
Tuesday 12th 11:50 Case Room 3 (260-055)
Bayesian Continuous Space-Time Model Of Burglaries
Chaitanya Joshi, Paul Brown, and Stephen Joe
University of Waikato
Abstract: Building a predictive model of crime with good predictive accuracy has a great value in enabling efficient use of policing resources and reduction in crime. Building such models is not straightforward though due to the dynamic nature of the crime process. The crime not only evolves over both space and time, but is also related to several complex socio-economic factors, not all of which can be measured directly and accurately. The last decade or more has seen a surge in the effort to model crime more accurately. Many of the models developed so far have failed to capture the crime with a great degree of accuracy. The main reasons could be that all these models discretise the space using grid cells and that they are spatial, not spatio-temporal. We fit a log Gaussian Cox process model using the INLA-SPDE approach. This not only allows us to capture crime as a process continuous in both space and time, but also allows us to include socio-economic factors as well as the ’near repeat’ phenomenon. In this talk, we will discuss the model building process and the accuracy achieved.Keywords: Bayesian spatio-temporal model, INLA-SPDE, predicting crime
Tuesday 12th 11:50 Case Room 4 (260-009)
Tolerance Limits For The Reliability Of Semiconductor Devices Using Longitudinal Data
Vera Hofer1, Johannes Leitner1, Horst Lewitschnig2, and Thomas Nowak1
1University of Graz
2Infineon Technologies Austria AG
Abstract: Especially in the automotive industry, semiconductor devices are key components for the proper functioning of the entire vehicle. Therefore, issues concerning the reliability of these components are of crucial importance to manufacturers of semiconductor devices.
In this quality control task, we consider longitudinal data from high temperature operating life tests. Manufacturers then need to find appropriate tolerance limits for their final electrical product tests, such that the proper functioning of their devices is ensured. Based on these datasets, we compute tolerance limits that could then be used by automated test equipment for the ongoing quality control process. Devices with electrical parameters within their respective tolerance limits can successfully finish the production line, while all other devices will be discarded. In calculating these tolerance limits, our approach consists of two steps: First, the observed measurements are transformed in order to capture measurement biases and gauge repeatability and reproducibility. Then, in the second step, we compute tolerance limits based on a multivariate copula model with skew normal distributed margins. In order to solve the resulting optimization problem, we propose a new derivative-free optimization procedure.
The capability of the model is demonstrated by computing optimal tolerance limits for several drift patterns that are expected to cover a wide range of scenarios. Based on these computations, we show the resulting yield losses and analyze the performance of the tolerance limits a large simulation study.
This work was supported by the ECSEL Joint Undertaking under grant agreement No. 662133 - PowerBase. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and Austria, Belgium, Germany, Italy, Netherlands, Norway, Slovakia, Spain and United Kingdom.Keywords: quality control, tolerance limits, copulas, skew normal distribution
Tuesday 12th 16:00 098 Lecture Theatre (260-098)
Model-Checking For Regressions: A Local Smoothing-Based Global Smoothing Test
Lingzhu Li and Lixing Zhu
Hong Kong Baptist University
Abstract: As the two kinds of methods for model specification problem, local smoothing tests and global smoothing tests exhibit different characteristics. Compared with global smoothing tests, local smoothing tests can only detect local alternatives distinct from the null hypothesis at a much slower rate when the dimension of predictor vector is high, but can be more sensitive to high-frequency alternatives. We suggest a projection-based test that builds a bridge between the local and global smoothing methodologies to benefit from their own advantages. The test construction is based on a kernel estimation-based local smoothing method and the resulting test becomes a distance-based global smoothing test. A closed-form expression of the test statistic is derived and the asymptotic properties are investigated. Simulations and a real data analysis are conducted to evaluate the performance of the test in finite sample cases.
Keywords: Global smoothing test, projection-based methods, local smoothing test
Zheng, J. X. (1996). Journal of Econometrics: A consistent test of functional form via nonparametric estimation techniques, 75(2), 263–289.
Bierens, H. J. (1982). Journal of Econometrics: Consistent model specification tests, 20, 105-134.Lavergne, P. and Patilea, V. (2012). Journal of business & economic statistics: One for all and all for one: regression checks with many regressors. 30(1), 41–52. Taylor & Francis Group.
Tuesday 12th 16:00 OGGB4 (260-073)
Breeding Value Estimation In Partially-Genotyped Populations
University of Otago
Abstract: In livestock, a primary goal is the identification of individuals’ breeding values - a measure of their genetic worth. This identification can be used to aid with selective breeding, but is non trivial due to how large data can be.
Measured traits are typically modelled as being caused by both breeding values and also environmental fixed effects. An efficient method for fitting this model was developed by Henderson (1984), based upon generalized least squares. This method could be applied to data where the pedigree - how each animal was related to one another - was fully known.
Improvements in technology have allowed the genetic information of an animal to be directly measured. These measurements can be taken very early in life, with the goal of informing selective breeding faster and more efficiently. Meuwissen (2001) adapted the standard model to incorporate genetic data, and additionally developed multiple fitting methods for this model.
Modern datasets are frequently only partially genotyped. The methods of Meuwissen cannot be used for these data, as they are only applicable to populations in which every individual is gentoyped. Modern fitting approaches aim to make use of the available genetic information without requiring all individuals be genotyped.
These approaches tend to either impute or average over missing genotype data, which can affect the overall accuracy of breeding value estimation. We are developing an alternative which instead incorporates missing data within the model, rather than having to adapt fitting approaches to accommodate it.Preliminary results suggest that approaching fitting is this way can lead to improved accuracy of estimation in certain situations.
Tuesday 12th 16:00 OGGB5 (260-051)
BIVAS: A Scalable Bayesian Method For Bi-Level Variable Selection
Mingxuan Cai1, Mingwei Dai2, Jingsi Ming1, Jin Liu3, Can Yang4, and Heng Peng1
1Hong Kong Baptist University
2Xi’an Jiaotong University
3Duke-NUS Medical School
4Hong Kong University of Science and Technology
Tuesday 12th 16:00 Case Room 2 (260-057)
Ranking Potential Shoplifters In Real Time
Abstract: A company with a focus on retail crime prevention brought to MINZ (Mathematics in Industry in New Zealand) the task of “Who is most likely to offend in my store, now”. The company supplied an anonymised set of data on incidents and offenders. The task, for the statisticians and mathematicians involved, was to try to find ways to use the data to nominate, say, the top ten likely offenders for any particular store and any particular time, using up-to-the-minute information (real time). The problem was analogous to finding a regression model when every row of data has response identically 1 (an incident), and for many places and times there is no data. This talk will describe how the problem was tackled.Keywords: Retail crime, ranking, ZINB, regression, real time
Tuesday 12th 16:00 Case Room 3 (260-055)
Two Stage Approach To Data-Driven Subgroup Identification In Clinical Trials
Toshio Shimokawa and Kensuke Tanioka
Wakayama Medical University
Abstract: A personalized medicine have been improved through the statistic analysis of Big data such as registry data. In these researches, subgroup identification analysis have been focused on. The purpose of the analysis is detecting subgroup such that the efficacy of the medical treatment is effective based on predictive factors for the treatment.
Foster et al., (2011) proposed the subgroup identification method based on two stage approach, called Virtual Twins (VT) method. In the first stage of VT, the difference of treatment effect between treatment group and control group is estimated by Random Forest. In the second stage, responders are identified by using CART, where the estimated these differences are set as the predictor variables.
However, the prediction accuracy of RandomForest tends to be lower than that of Boosting. Therefore, generalized boosted model (Ridgeway, 2006) is adopted in the first step. In addition to that, the number of rules tend to be large in the second step when CART is used. In this paper, we adopt a priori algorithm as the same way of SIDES(Lipkovich et al., 2011).
Keywords: A priori algorithm, boosting, personalized medicine
Forster, J.C., Taylor, J.M.G and Ruberg, S.J. (2011). Subgroup identification from randomized clinical trial data. Stat.Med, 30, 2867-2880.
Lipkovich, I., Dmitrienko, A., Denne, J. and Enas, G. (2011). Subgroup identification based on differential effect search-recursive partitioning method for establishing response to treatment in patient subpopulations. Stat.Med, 30, 2601-2880.Ridgeway, G. (2006).Gbm: Generalized boosted regression models. R package version 1.5-7. Available at
Tuesday 12th 16:20 098 Lecture Theatre (260-098)
Inverse Regression For Multivariate Functional Data
Ci-Ren Jiang1 and Lu-Hung Chen2
2National Chung Hsing University
Abstract: Inverse regression is an appearing dimension reduction method for regression models with multivariate covariates. Recently, it has been extended to the cases with functional or longitudinal covariates. However, the extensions focus on one functional/longitudinal covariate only. In this work, we extend functional inverse regression to the cases with multivariate functional covariates. The asymptotical properties of the proposed estimators are investigated. Simulation studies and data analysis are also provided to demonstrate the performance of our method.Keywords: Multidimensional/Multivariate Functional Data Analysis, Inverse Regression, Parallel Computing, Smoothing
Tuesday 12th 16:20 OGGB4 (260-073)
Including Covariate Estimation Error When Predicting Species Distributions: A Simulation Exercise Using Template Model Builder
Andrea Havron and Russell Millar
University of Auckland
Abstract: Ecological managers often require knowledge about species distributions across a spatial region in order to facilitate best management practices. Statistical models are frequently used to infer relationships between species observations (eg. presence, abundance, biomass, etc.) and environmental covariates in order to predict values at unobserved locations. Issues remain for situations where covariate information is not available for a predictive location. In these cases, spatial maps of covariates are often generated using tools such as kriging; however, the uncertainties from this statistical estimation are not carried through to the final species distribution map. New advances in spatial modelling using the automated differentiation software, Template Model Builder, allow both the spatial process of the environmental covariates and the observations to be modelled simultaneously by maximizing the marginal likelihood of the fixed effects with a Laplace approximation after integrating out the random spatial effects. This method allows for the uncertainty of the covariate estimation process to be included in the standard errors of final predictions as well as any derived quantities, such as total biomass for a spatial region. We intend to demonstrate this method and compare our predictions to those from a model where regional covariate information is supplied from a kriging model.
Keywords: spatial model, predicting covariates, Template Model Builder
References:Kristensen, K.,Nielsen, A., Berg, C.W., Skuag, H. and Bell, B. (2015). TMB: Automatic Differentiation and Laplace Approximation. In: Journal of Statistical Software,70, 1–21.
Tuesday 12th 16:20 OGGB5 (260-051)
Adjusted Adaptive Index Model For Binary Response
Ke Wan1, Kensuke Tanioka1, Kun Yang2, and Toshio Shimokawa1
1Wakayama Medical University
2Southwest Jiaotong University
Abstract: In questionnaire surveys, multiple regression analysis is usually used to evaluate influence factors. In addition to that, data mining methods such as Classification and Regression Trees (Breiman et al., 1984) are also used. In the research for tourism studies, it is difficult to contribute the policies for landscape or buildings from the results. In this paper, we call these factors “ uncontrollable exploratory variables“. On the other hands, the polices for amounts of garbages or inhabitant consciousness can be contributed from the results. We call these factors “controllable exploratory variables”. The purpose of this report is grading for each subject which is conducted based on controllable exploratory variables with adjusting the effects of uncontrollable exploratory variables. Concretely, we modified the AIM method (Tian and Tibshirani, 2010) and conduct gradings based on the sum of the production rules for controllable exploratory variables with adjusting the effects of uncontrollable exploratory variables.
Keywords: logistic regression, production rule, grading
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Wadsworth.Tian, L., and Tibshirani, R. (2011). Adaptive index models for marker-based risk stratification. Biostatistics, 12, 68–86.
Tuesday 12th 16:20 Case Room 2 (260-057)
Factors Influencing On Growth Of Garments Industry In Bangladesh
Md. Shahidul Islam and Mohammad Sazzad Mosharrof
Auckland University of Technology
Abstract: If globalization provides the backdrop for drama, then the achievements of the garment industry in Bangladesh are indeed dramatic. The garment industry particularly has played a pioneering role in the development of industrial sector of Bangladesh and has grown rapidly for the last 15 years and now one of the largest garment exporters in the world. The study of our research has examined the successful development process of the Bangladesh garments industry and explored the keys to its success. In point of view we collected some primary and secondary data of garment manufacturers and traders to investigate further the key role and mechanism of technology transfers to operate a garment industry in Bangladesh. After that we apply some statistical models such as random effect model, tobit model and probit model to generate the performance of our variables. Also we use some dummy variables in case of different years for all the models. The result of our statistical models indicate that the high education of manufacturers and enterprise performance are highly significant. The only reason of this close relationship is that manufacturers have to upgrade their skills and they know how continuously in order to survive the intense competition in the world garment market and the high levels of the general human capital of the entrepreneur are needed to manage an increasing number of managers and experts. The result also shows the formal training that the garment entrepreneur has received in a foreign country and the entrepreneur’s experience of working at a garments enterprise have small effect on growth of garment industry but not so much high. This is because those garment workers who had acquired skills and know-how but they could not helped smoothly new manufacturers and afford to start trading houses without good marketing and communication skills. But the traders who received formal training abroad have provided higher-valued services for manufacturers and contributed more to the proliferation of manufacturers. We have also found that foreign owned trading houses perform better than indigenous trading houses, which suggests that there still exist skills and know-how to be learned from foreign countries. So technology transfer seems to be a long-term process and its effect also seems last over the long term. Finally the key point of our findings strongly suggest that the performance of manufacturers and traders as well as production technologies are very potential for the high growth of industrial development. It has a great opportunity to earn a lot of foreign currency through developing garment industry and contribute economic development.Keywords: Bangladesh Garments, Growth of Garment Industry, Performance of Manufacturers and Traders, Statistical Model
Tuesday 12th 16:20 Case Room 3 (260-055)
Comparison Of Exact And Approximate Testing Procedures In Clinical Trials With Multiple Binary Endpoints
Takuma Ishihara and Kouji Yamamoto
Osaka City University
Abstract: In confirmatory clinical trials, the efficacy of a test treatment are sometimes assessed by using multiple primary endpoints. We consider a trial in which the efficacy of a test treatment is confirmed only when it is superior to control for at least one of the endpoints and not clinically inferior for the remaining endpoints. Nakazuru et al. (2014) proposed a testing procedure that is applicable to the above case when endpoints are continuous variables. In this presentation, firstly, we propose a testing procedure in the case that all of the endpoints are binary.
Westfall and Troendle (2008) proposed multivariate permutation tests. Using this methods, we also propose an exact multiple testing procedure.
Finally, we compare an exact and approximate testing procedures proposed above. The performance of the proposed procedures was examined through Monte Carlo simulations.
Keywords: Clinical trial; Multivariate Bernoulli distribution; Non-inferiority; Superiority.
Nakazuru, Y., Sozu, T., Hamada, C. and Yoshimura, I. (2014). A new procedure of one-sided test in clinical trials with multiple endpoints. Japanese Journal of Biometrics, 35, 17-35.Westfall PH and Troendle JF. (2008). Multiple testing with minimal assumptions. Biometrical Journal, 50(5), 745-755.
Tuesday 12th 16:40 098 Lecture Theatre (260-098)
Multiple Function-On-Function Linear Regression With Application To Weather Forecast Calibration
Min-Chia Huang, Xin-Hua Wang, and Lu-Hung Chen
National Chung Hsing University
Tuesday 12th 16:40 OGGB4 (260-073)
Modelling The Distribution Of Lifetime Using Compound Time-Homogenous Poisson Process
Victoria University of Wellington
Abstract: Modelling the distribution of lifetime has traditionally been done by constructing a deterministic function for the survival function and/or force of mortality. This paper outlines previous research and presents the author’s initial attempts to model the force of mortality and remaining lifetime using time-homogenous compound Poisson processes.
The paper presents two models. In model 1, the force of mortality of an individual is modelled as a random sum of i.i.d random variables (i.e. a compound Poisson process). In model 2, each individual is assumed to have an initial normally distributed innate lifetime, and their remaining life is a shifted compound Poisson process. In other words, we assume that there are random events coming at a constant rate modifying either the force of mortality or remaining lifetime of individuals. Simulations in R are then run to find the optimized parameters and the empirical survival function, force of mortality and distribution of lifetime are then constructed. Finally, these outputs are compared existing models and actual demographic data.
It turns out that for model 1, it is very difficult to model the force of mortality using a time-homogenous compound Poisson process without introducing additional complications such as the inclusion of event times. For model 2, however, if we allow the events to be a Cauchy random variable, then we can model the survival function of New Zealand population much better than several existing well-known specifications such as Weibull.
Keywords: Distribution of lifetime, force of mortality, survival function, time-homogenous compound Poisson process, innate lifetime, R simulation
Khmaladze, E (2013). Statistical methods with application to demography and life insurance. CRC Press.
Weibull, W (1939). A statistical theory of the strength of materials. Generalstabens litografiska anstalts frlag, 1st edition.Gompertz, B (1825). On the Nature of the Function Expressive of the Law of Human Mortality, and on a New Mode of Determining the Value of Life. Philosophical Transactions of the Royal Society of London, 115, 513-583.
Tuesday 12th 16:40 OGGB5 (260-051)
Detecting Change-Points In The Stress-Strength Reliability P(X<Y)
Hang Xu1, Philip L.H. Yu1, and Mayer Alvo2
1Unversity of Hong Kong
2University of Ottawa
Abstract: We address the statistical problem of detecting change-points in the stress-strength reliability \(R=P(X<Y)\) in a sequence of paired variables \((X,Y)\). Without specifying their underlying distributions, we embed this non-parametric problem into a parametric framework and apply the maximum likelihood method via a dynamic programming approach to determine the locations of the change-points in R. Under some mild conditions, we show the consistency and asymptotic properties of the procedure to locate the change-points. Simulation experiments reveal that in comparison with existing parametric and non-parametric change-point detection methods, our proposed method performs well in detecting both single and multiple change-points in R in terms of the accuracy of the location estimation and the computation time. It offers robust and effective detection capability without the need to specify the exact underling distribution of the variables. Applications to real data demonstrate the usefulness of our proposed methodology for detecting the change-points in the stress-strength reliability R.Keywords: Multiple change-points detection; Stress-strength model; Dynamic programming
Tuesday 12th 16:40 Case Room 2 (260-057)
New Zealand Crime And Victims Survey: Filling The Knowledge Gap
Andrew Butcher and Michael Slyuzberg
NZ Ministry of Justice
Abstract: The key objective of the Ministry of Justice is to ensure that New Zealand has a strong justice system that contributes to a safe and just society. To achieve this objective, the ministry and the wider Justice Sector need to know whether they are focusing their efforts in the right places and really making a difference. It is often difficult because we lack a crucial piece of information: how much crime is actually out there. Administrative data does not provide an answer as only about 30 The New Zealand Crime and Victims Survey (NZCVS) is introduced to fill this knowledge gap. The survey which is currently on the pilot phase was designed to meet the recommendations of Statistics New Zealand and key stakeholders’ demand. It will interview about 8,000 of New Zealand residents aged from 15 years old and aims to: provide information about the extent (volumes and prevalence) and nature of crime and victimisation in New Zealand; provide geographical break-down of victimisation; provide extensive victims’ demographics; measure how much crime gets reported to Police; understand the experiences of victims; measure crime trends in New Zealand.
The paper summarises the core requirements to NZCVS obtained from extended discussions with key stakeholders and describes key design features to be implemented in order to meet these requirements. These key requirements include, but are not limited to: Measuring the extent and nature of reported and unreported crime across New Zealand; Providing in-depth story-telling of victims’ experiences; Providing frequent and timely information to support Investment Approach for Justice and wider decision making; Reducing information gaps by matching the NZCVS with administrative data in Statistics New Zealand’s Integrated Data Infrastructure (IDI).In particular, the paper discusses modular survey design which includes core crime and victimisation questions and revolving modules added annually, stratified random sampling, a new highly automated approach to offence coding through extended screening, measuring harm from being victimised, obtaining respondents’ informed consent for data matching, use of survey data for extended analysis and forecasting and other important survey features.
Tuesday 12th 16:40 Case Room 3 (260-055)
Missing Data In Randomised Control Trials: Stepped Multiple Imputation
Rose Sisk and Alain Vandal
Auckland University of Technology