Mathematics and Statistics

Regression With Missing Data:  An Investigation of the Case with Uniform Predictors and Missingness Related to the Response Variable

Jack T. Ervasti

Missing data is a very important problem in many fields, including the social, behavioral and medicinal sciences. As a result, a number of techniques for analyzing data sets with missing values have been developed and refined in the last few decades. There has also been a significant amount of research done on the bias introduced with different types of missing data when these techniques are performed.

In this paper, I investigate how various types of missingness affect the bias of regression parameters under imputation and complete case analysis. Using simulated data sets, I examine cases with normally and uniformly distributed predictor variables and different types of simulated missingness. I find that uniformly distributed predictors cause bias under different circumstances than normally distributed predictors when missing values are imputed. In particular, I find that if the predictors are uniformly distributed, regression parameters are biased when missingness is related to the response variable and are approximately unbiased when missingness is related to missing values. These results indicate a lack of investigation into missing data with uniformly distributed variables and missingness that is conditional on the response variable. Based on these findings I perform an experiment to gain a deeper understanding of the relationship between types of missingness and the bias of regression parameters in the case with uniform predictor variables.

A Trajectory Smoothing and Clustering Method for the Identification of Potent shRNAs

Alexander H. Greaves-Tunnell

RNA interference (RNAi) is a potent and specific mechanism of gene silencing with extensive applications to research, biotechnology, and medicine. Recently, there has been considerable interest in short hairpin RNAs (shRNAs) as triggers for “programmable” RNAi, due in part to the fact that they enable stable and heritable gene silencing. However, the experimental identification of potent shRNAs is costly and inefficient, and prediction of potent shRNAs for novel targets remains a major challenge. In this paper, we introduce a smoothing and clustering method for data collected from the Sensor assay, the first massively parallel biological procedure for the identification of potent shRNAs. This method is based on a novel treatment of the data as fundamentally longitudinal in nature. We identify a set of roughly 300 top performing shRNAs for the given targets, and conduct preliminary validation based on three sequence and thermodynamic features of known potent shRNAs.S

Benford’s Law and Stick Fragmentation

Joy Jing

Many datasets and real-life functions exhibit a leading digit bias, where the first digit base 10 of a number equals 1 not 11% of the time as we would expect if all digits were equally likely, but closer to 30% of the time. This phenomenon is known as Benford’s Law, and has applications ranging from the detection of tax fraud to analyzing the Fibonacci sequence. It is especially applicable in today’s world of ‘Big Data’ and can be used for fraud detection to test data integrity, as most people are unaware of the phenomenon.

The cardinal goal is often determining which datasets follow Benford’s Law. We know that the decomposition of a finite stick based on a reiterative cutting pattern determined by a ‘nice’ probability density function will tend toward Benford’s Law. We extend these previous results to show that this is also true when the cuts are determined by a finite set of nice probability density functions. We further conjecture that when we apply the same exact cut at every level, as long as that cut is not equal to 0.5, the distribution of lengths will still follow Benford’s Law.

Perimeter-Minimizing Tilings by Convex and Non-Convex Pentagons

Zane K. Martin

We study the presumably unnecessary convexity hypothesis in the theorem of Chung et al. on perimeter-minimizing planar tilings by convex pentagons. We prove that the theorem holds without the convexity hypothesis in certain special cases, and we offer direction for further research.

Clustering Time Dependent PITCHf/x Data

Christopher P. Picardo

In this paper I extend the powerful model based clustering framework to data that incorporates an entire time period, specifically single seasons from the PITCHf/x database. Traditional clustering methods are reviewed and described in detail in order to motivate the introduction of model based clustering. In order to apply model based clustering to the time indexed data, a cluster consistency algorithm is proposed that treats the cluster selection problem as equivalent a model selection problem from the supervised learning literature. Finally, the cluster consistency procedure is applied to the PITCHf/x dataset to select the appropriate number of clusters for several pitchers over an entire season. The PITCHf/x season data for two starting pitchers is then analyzed using the cluster movements for the entire season.

Generalizing Nondeterminism for Algebraic Computation Machines

Scott Sanderson

In this thesis we present an introduction to the BSS Machine model, which serves as a generalization of the Turing Machine model of computation. Motivated by the classical equivalence of nondeterministic computation and deterministic verifiability, we develop an extension to the BSS Machine model that preserves important structural features of nondeterministic Turing Machines. We use our machines to develop a new family of relativized complexity classes, and we prove some containment relations between these and the BSS Machine generalizations of P and NP.

 

The Forest Through the Trees in Multilabel Classification

Benjamin Bradbury Seiler

Traditional machine learning classification algorithms are not suited for statistical classification problems in which an instance can simultaneously belong to more than one class. Such multilabel classification problems have prompted significant research in recent years including a concerted effort to bridge the gap between established classification techniques and this nonstandard framework. Based on such works as recently as Tsoumakas and Katakis [2007] and Vogrincic and Bosnic [2011], the vast majority of novel multilabel classification algorithms are compared to baseline problem transformation techniques using only support vector machines or linear models. In this study, we broaden the pool of potential base learners for problem transformation techniques and discover significant evidence to suggest the superiority of partition tree based methods in many cases, thereby, raising the bar for baseline competitiveness.

Formal Fibers of Height-n Primes and Completions of Complete Intersection Domains

Philip D. Tosteson

Of interest in commutative algebra is the relationship between a Noetherian local ring and its completion. This thesis investigates the relationship between a complete Noetherian local ring (T,M), and Notherian local subrings R of T that have I as their completion. In particular, given an ideal  I of T and a countable collection of prime ideals C of T, we ask whether there exists a subring R, with completion T, such that (I intersect R) is prime, and the formal fiber of R at (I intersect R) has maximal elements precisely C. This question quickly relates to the construction of complete intersection domains whose completions are complete intersection rings and which have specified generic formal fiber. We study this question in several specific special cases, and further discuss progress and a method of attack on a more general case.