During their thesis defense, PhD candidates introduce and motivate the problems they attacked during their course of studies, defend the novelty and significance of their research, and contextualize their contributions within their field. This is the final step in the process of obtaining a PhD, and a successful defense indicates the acknowledgment of the doctoral committee that the candidate is an expert in their field. The defense talks are open to all members of the RPI community, and we welcome those interested to attend.
Modern technological advancements and innovation has led to an explosive growth of data in various domains, ranging from physics and biological sciences to economics and social sciences. Research on mathematical libraries has been on the leading edge of the high-performance computing(HPC) community's effort to address the long imposing set of challenges posed by Big Data. Primary among those challenges are the need for asynchronous communication and bridging the gap between computing power and network bandwidth. This has led to the advent of randomization in math libraries to develop scalable algorithms for large-scale numerical linear algebra (NLA) problems. In this dissertation, we focus on the design, implementation and analysis of randomized algorithms for scalable mining of terabyte sized matrices and above. We focus on three fundamental problems that are pervasive throughout large-scale data analytics where randomized NLA algorithms have shown significant impact over state-of-the-art approaches: least-squares regression, low-rank approximation and kernel ridge regression.
This dissertation is divided into three parts. In part I, we explore the behavior of randomized matrix algorithms based on the Blendenpik algorithm in a distributed memory setting. We show that a variant of the algorithm that uses a batchwise transformation leads to an implementation that is not only faster than state-of-the-art implementations of baseline least-squares solvers, but is also able to scale to much larger matrix sizes. In particular, we show that a Blendenpik based algorithm can solve least-squares regression problems for dense terabyte-sized (and larger) input matrices as well as sparse ill-conditioned matrices that outperform state-of-the-art least-squares solvers in terms of performance and scalability while demonstrating comparable numerical stability in terms of established accuracy metrics.
In part II of the dissertation, we explore the behavior of randomized block iterative solvers to compute low rank matrix approximations for dense terabyte sized matrices. We are particularly interested in the behavior of randomized block iterative solvers on matrices with clustered singular values. We analyze the scalability and numerical stability of our block iterative solvers and demonstrate the performance of these randomized solvers for varying spectral gaps. Experiments with real-world large-scale datasets showed high quality approximations for the kernel PCA problem while achieving significant speedups over state-of-the-art direct solvers.
In part III of the dissertation, we explore the behavior of large-scale kernel approximations using the Nystrom approach to solve the kernel ridge regression (KRR) problem. We demonstrate the scalability of one such Nystrom approximation approach based on the FALKON algorithm and contrast it with a state-of-the-art algorithm based on randomized feature maps for the KRR problem.
Many approaches currently exist in the pursuit of knowledge representation from unstructured documents. With the extensive amount of information available, the need to derive a common meaning behind data has become more prevalent than ever before. In this thesis, we develop a workflow to support knowledge extraction and representation of data from the biomedical domain. To determine the hidden meaning behind text, we utilize popular Information Extraction methods to extract events from biomedical papers and store the output from these methods into a combined textual object and annotation representation format. This data is converted into an RDF Graph format based on mappings to scientific ontologies and provenance semantics.
By employing the extensible knowledge graph curation and analysis platform called Whyis, we can validate the constructed interactions discovered by our workflow in comparison to interactions stated in a domain-specific assertion database. Assessment of this data flow showcase the associated benefits, difficulties, precision, and potential usages alongside established ontologies. The data workflow implemented in this study provides an innovative approach for knowledge discovery, connectivity, and curation of information using a linked data methodology and assertion-based evaluation criteria.
Infrastructure-as-a-Service (IaaS) clouds such as Amazon EC2 offer various types of virtual machines (VMs) through pay-per-use pricing. Elastic resource allocation allows us to allocate and release VMs as computing demand changes while satisfying Quality-of-Service (QoS) requirements. In this thesis, we explore QoS-aware elastic resource allocation for three different data processing models: batch, micro-batch, and streaming.
First, we present two frameworks for elastic batch data processing. The first elastic batch data processing framework supports autonomous VM scaling using application-level migration. It does not require any prior knowledge about the target application, but dynamically reconfigures the application to keep the CPU utilization within a certain range. The second framework uses Workload-tailored Elastic Compute Units as a measure of computing resources analogous to Amazon EC2's ECUs. Given a deadline, our framework finds the cost-optimal resource configuration of heterogeneous VMs to satisfy the required throughput.
Next, we propose an elastic micro-batch data processing framework for continuous air traffic optimization. Air traffic optimization is commonly formulated as an integer linear programming (ILP) problem. For continuous optimization, we periodically solve ILP problems with regular intervals, where each problem is a micro-batch data processing job. Since the fluctuating number of flights creates dynamically changing computational demand, our framework predicts future workload and proactively schedules VMs to solve the ILP problems in a timely manner.
Finally, we propose a framework for sustainable elastic stream processing based on the concept of Maximum Sustainable Throughput (MST). It is the maximum processing throughput a streaming application can process indefinitely for a number of VMs. Stream processing is sustainable if the system's MST is always greater than the input data rates of incoming workload. Using MST and future workload prediction models, our framework proactively schedules VMs to keep the stream processing sustainable. It explicitly incorporates uncertainties in both MST and workload prediction models, and estimates the number of VMs to satisfy a certain probability criteria.
Our studies show that QoS-aware elastic data processing is effective for these processing models in both performance scalability and cost savings. For batch processing, elastic resource scheduling helps achieve the target QoS metrics such as CPU utilization and job completion time. For both micro-batch and stream processing with fluctuating workloads, QoS-aware elastic scheduling saves up to 49% cost compared to a static scheduling that covers the peak workload to achieve a similar level of QoS. These results show potential for future fully automated cloud computing resource management systems that efficiently enable truly elastic and scalable general-purpose workload.
Querying information over knowledge graphs is typically a user driven trial and error process. This is because there is no intuitive mechanism to help users to obtain information. Since knowledge graphs have some taxonomic content, automatic query reformulation systems have generally used relaxation to loosen the conditions of an input query.They typically move up the taxonomy to relax a specific concept to a more generic concept. However, a large percentage of the data in knowledge graphs is entity specific data. Also, query log statistics show that a majority of the queries issued to knowledge graphs are often entity based queries. Entities don't have a taxonomy and they end up being generalized. To address this issue, I propose a schema-data and instance-data aware query reformulation system. More specifically I utilize the information and entity features present in the graph to suggest rewrites. Once the features are identified, the entity in concern is reformulated as a set of features. Since entities in large-scale graphs can have a large number of features, we introduce strategies that select the top-k most relevant and informative ranked features. This is then augmented to the original query to create a valid reformulation. We then evaluate our approach by showing that our reformulation strategy produces results that are more contextual and crisp when compared with the state-of-the-art.
Software metrics represent an application of conventional computer science to itself in an attempt to quantify the complexity of software systems. Many of the conventional metrics proposed in academia and industry find their roots in a top-down composition that has parallels to the way conventional AI has been done prior to the surge in machine learning techniques. As software becomes increasingly networked together, the need for bottom-up approaches to software complexity becomes more apparent. The complexity concepts created for biological systems, such as ecological biodiversity, neural pattern recognition, and genetic mutation rate, offer ”proof of concept” illustrations that show how such bottom-up metrics can provide insights into the behavior of complex systems. Thus computational analogs to these biological complexity metrics may offer deeper understanding of how complexity occurs in contemporary networked computing systems. This dissertation compares traditional software metrics to these bio-inspired measures in several contexts, applying them to tasks such as bug-finding, measuring student comprehension, and using metrics as a way of quantifying generative justice.
First we study the topic modeling task, which is to summarize a collection of documents in terms of the themes within it. We show how the problem can be broken up into three independent modules. (1) Model specification and inference (2) Evaluation (3) Hyper-parameter optimization. We present tools to address the needs of each of the modules. We hope to facilitate future research by allowing practitioners to focus on one module at a time.
Next we study the non-negative matrix factorization (NMF) and singular value decomposition (SVD) in a row-distributed setting. The NMF and SVD have multiple applications in machine learning. For NMF these are topic modeling, recommender systems, hyper-spectral imaging, and sparse coding/dictionary learning. The SVD has even more applications; a well known one is principal component analysis. We study the scenario when rows of the matrix start distributed among multiple parties. E.g. maybe these parties are different business collecting customer purchase histories, or different datacenters of a social network, or a sensor network. We show how the exact NMF and SVD can be computed with less communication cost than what it would take to simply upload all the data to one server.
Finally, we study the privacy guarantees that we can make about our distributed NMF and SVD algorithms. In the row-distributed setting, privacy is interesting because business interests or legislation (e.g. the GDPR) make it undesirable or illegal for a party to share its data directly. We develop a method to measure how much information about the absence/presence of a particular document can be inferred from the communication transcript of our distributed algorithms. Our method is based on the Kolmogorov-Smirnov hypothesis test. We examine the relationship between our notion of privacy and Differential Privacy. Finally, we give empirical evidence to suggest that privacy is maintained for several applications of both NMF and SVD.
The requirement to extract the hidden information out of the data stream is rising, however, traditional stream processing systems cannot meet this requirement as they are not designed to do so. This gives birth to the new research domain of stream reasoning that aims to bring semantic reasoning into stream processing. An example is to predict highway traffic jam, given the explicit sensor data streams of cars' number and speed. It is very easy for humans to observe the traffic then forecast a traffic congestion. This is because humans know that a bigger car number and slower car speed can usually lead to a traffic jam. Unfortunately, machines do not. What they can ``see'' is probably a sequence of numerical numbers that are separated by commas.
Streaming data is boundless, enormous, and heterogeneous, which adds extra dimensions to the challenges of realizing the vision of stream reasoning, in addition to temporal constraints. A widely-adopted way to process the streams is via leveraging a window that isolates the latest streaming portion. This snapshot, mostly managed by the first in first out (FIFO) strategy under a popular silent assumption that the latest data is the most important, is all that a window can know about the stream. This inevitably provides only limited information during the processing. However, modeling the importance of the data is not necessarily based on pure arrival timestamps. If the latest data does not convey the necessary information to answer the query, there is surely no need to do anything other than evicting it.
Streaming data intrinsically has many different orderings, such as temporarily, precision, provenance, and trust, etc. If diverse data orderings can be utilized to model the data importance, stream reasoning can be benefited by being data-discriminative. It is able to understand the concept of importance so as to identify, and leverage more important data that are crucial to the query answering, which can improve the system performance. The notion that models the data importance is named as semantic importance. It is an umbrella-like concept with multiple branches, such that each branch models one aspect of currently included data orderings. The combinations of different branches describe the data importance, and enable various smart and flexible window management strategies that are previously dominated and limited by FIFO.
Generally speaking, this dissertation delivers a conceptual model, and a set of infrastructure that can facilitate its general application in stream reasoning. Specifically, the first contribution is an innovative notion of semantic importance. It is formalized in an ontology, represented in a priority vector, and works with carefully extended window semantics. The second contribution introduces a general sequential stream reasoning architecture, with the purpose of both showing how semantic importance can be used in stream reasoning systems, and providing pragmatic performance metrics to configure stream reasoning systems in different scale scenarios. Two exemplar real world use cases are implemented and evaluated based on this architecture and semantic importance. The third contribution proposes a generalization and benchmark framework for semantic importance. This part focuses on how to reuse and benchmark semantic importance in a generic and quantitative way. The semantic importance is generalized by connecting itself to the state of the art stream reasoning techniques. This framework also provides a benchmark interface compatible with a wide range of continuous queries, ontologies, data streams, and a set of built-in data-aware window management strategies enabled by semantic importance. The key performance indicators recorded for the benchmark includes precision, response time, memory consumption and throughput. The results are analyzed and visualized so as to facilitate decision-making on how to compose and deploy the suitable semantic importance in real use cases.
Modern networked systems are in perpetual need of novel tools that can fend off cyber attacks arising from different kinds of threats and vulnerabilities. The mas- sive upsurge in the number of devices connected to a network and associated traffic volume, as well as addition of new, complex technologies to existing ones have inten- sified the need to have a deep understanding of systems. From a security perspective, we are noticing a behavior bordering high on anxiety, which is further pushing the boundaries of data collection even before understanding the real need behind it. Mirroring this behavior is the inclusion of security experts as part of the core team that designs and builds systems and networks. This increases the awareness of potential attack vectors that may impact these systems. However, data analytics (not necessarily security-related) has further intensified the need to gather as much data as possible, leaving it to the security experts to come up with tools to analyze them. While this is a daunting task for all involved parties and cannot be excused from, it is critical to scale these monitoring and analysis infrastructures to meet the demands and purpose of collecting data.
Existing technologies come in the form of intrusion detection systems like anti- virus tools, and intrusion prevention systems like firewalls. They deploy various approaches to succeed in detecting and preventing intrusions. Generally, intrusion detection techniques are classified into two categories: misuse detection and anomaly detection. However, the implementation has been ever-changing keeping the types of systems, data analysis, increasing attack surface and intrusions in mind. And with the rapid evolution in the digital milieu, and an explosion in data collected, there is a need to propose novel approaches that address the big data problem in security.
This thesis proposes a new approach to intrusion detection in networks. This approach is based on Information Centrality (IC) using which systemic attacks can perform same level of intrusion detection using approximately 60% of the total nodes. IC labels network nodes with better vantage points for detecting network-based anomalies as central nodes. The main idea is that since these central nodes already ”observe” most of the data flowing through the network, they are in a good position to detect anomalous behavior much before other nodes.
This research first dives into the important role played by graphs in under- standing the topology and flow of information. We then introduce the usage of information centrality, a centrality based index, to reduce data collection in existing communication networks. IC identifies important nodes that can accelerate anomaly detection when armed with a suitable anomaly detection technique. We also come up with a more efficient way to compute Information centrality for large networks. Finally, we demonstrate that central nodes detect anomalous behavior much faster than other non-central nodes, given the anomaly is systemic in nature.
Since the introduction of the Semantic Web vision in 2001 as an extension to the Web, where machines can reason about the content, the main research focus in semantic reasoning was on the soundness and completeness of the reasoners. These reasoners also assume the veracity of the input data while the Web of data is inherently noisy. In 2010, Hitzler and van Harmelen called for questioning the model-theoretic semantic reasoning and investigation of machine learning (ML) for semantic reasoning  as ML techniques are more robust to noisy data. Four years later, a position paper about machine learning on linked data  considered the research efforts to couple both fields still ”disappointing”.
Recent research work on semantic reasoning with noise-tolerance focuses on type inference and does not aim for full RDFS reasoning. This thesis documents a novel approach that takes previous research efforts in noise-tolerance in the Semantic Web to the next level of full RDFS reasoning by utilizing advances in deep learning research. Deep learning techniques- even though robust to noise and very effective in generalizing across a number of fields including machine vision, natural language understanding, speech recognition etc. - require large amounts of data of low-level representations rather than “symbolic representations used in knowledge representation” (). This challenge constitutes what we refer to as the Neural-Symbolic gap.
This thesis aims to provide a stepping stone towards bridging the Neural-Symbolic gap specifically in the Semantic Web field and RDFS reasoning in particular. This is accomplished through layering Resource Description Framework (RDF) graphs and encoding them in the form of 3D adjacency matrices. Each layer layout in the 3D adjacency matrices form what termed as graph word. Every input graph and its corresponding inference are then represented as sequences of graph words. The RDFS inference becomes equivalent to graph words translation that is achieved through neural network translation.
The evaluation confirms that deep learning can in fact be used to learn RDFS rules from both synthetic as well as real-world Semantic Web data while showing noise-tolerance capabilities compared to rule-based reasoners.
 P. Hitzler and F. van Harmelen, “A reasonable semantic web,” Semantic Web, vol. 1, no. 1, 2, pp. 39–44, 2010.
 P. Bloem and G. K. De Vries, “Machine learning on linked data, a position paper,” Linked Data for Knowledge Discovery, p. 69, 2014.
 A. Garcez, T. R. Besold, L. De Raedt, P. Földiak, P. Hitzler, T. Icard, K.-U. Kühnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: Contributions and challenges,” 2015.