During their thesis defense, PhD candidates introduce and motivate the problems they attacked during their course of studies, defend the novelty and significance of their research, and contextualize their contributions within their field. This is the final step in the process of obtaining a PhD, and a successful defense indicates the acknowledgment of the doctoral #committee that the candidate is an expert in their field. The defense talks are open to all members of the RPI community, and we welcome those interested to attend.
Data sets invariably require versioning systems to manage changes due to an imperfect collection environment. Data versioning systems are employed to manage changes to data, logging new data sets and communicating that change to data consumers. Versioning discussion remains imprecise, lacking standardization or formal specifications. Many works tend to dene versions around examples and local characteristics but lack a broader foundation. This imprecision results in a reliance on change brackets and dot-decimal identifiers without quantitative measures to justify their application. No difference exists between the versioning practices of a group which updates their data regularly and a group which adds many new les but rarely replaces them. This work attempts to improve discussion by capturing version relationships into a linked data model, taking inspiration from provenance models that incorporate versioning concepts such as PROV and Provenance, Authorship, Versioning (PAV) ontologies. The model captures addition, invalidation, and modification relationships between versions to provide change log-like characterization of the differences.
The data set landscape largely lacks the practice of including detailed change documentation like the logs commonly accompanying software projects. The result likely originates from the size of data sets requiring time-intensive work to manually construct. A process is presented to automate change log creation for data sets, improving adoption as well as encoding the change logs with linked data. The linked data equipped change logs makes the documentation consumable by machines, ensuring not only efficient creation, but also utilization. A drawback, as found from larger data set change, the encoding causes significant bloating as compared to plain text change logs. The versioning graphs encoded into the document, describing the additions, invalidations, and modifications (AIM) made by the new version, allow new avenues of exploration into data that can be standardized across data sets.
The AIM changes allow versions within the same data set to be compared using counts of the changes. The comparisons provide quantifiable basis for communicating change as compared to the qualitative version identifiers currently used by the Global Change Master Directory (GCMD) Keywords. The analysis revealed conflicts between the observed change in the data set at Version 8.5's release depending upon the perspective of the data consumer or producers. Data from the Marine Biodiversity Virtual Laboratory was also explored to give new context to the linked data version comparison process. The changes computed provided evidence towards measurably better performance in marine biology classification using the versioning model.
Version change logs link an amount of change to a particular version, but versions do not have to be published at regular intervals. That observation points towards the idea that just looking at versions may obscure trends in change over time by breaking up changes by version. An analysis was performed to observe the change rates of each version distributed over time for the GCMD Keywords and discovered that the versions could be clustered according to different change behaviors. The Earth Observing Laboratories were also studied for trends over time to study a collection of similar data sets. The AIM changes revealed unexpected behaviors in the data sets with respect to the way versions were updated, showing a strong relation between adding and removing files. The distribution of changes over time were also compared to the general distribution of versions over time and revealed a significant difference in behavior.
Change analysis for data sets need a great amount of work as big data sets become more common. Terms and practices need to be standardize and formalized which begins with producing discoverable and consumable change documentation. The procedures explored showed promise using linked data models, but suffered from size bloating necessary to make the documents machine consumable. Once computable, producers can begin providing better quantitative measure for change in data, but analysis has shown that the perceived change may differ depending on the consumer of the data set. The experience highlights an obscured dynamic in change information between data producers and data consumer in which producers often dictate the means of evaluating version change. Diverging from version-primary practices and including more detailed change accounting becomes a priority after discovering that versions can hide trends in the actual change rate. The difference between data set change rate and behavior suggests that future research is necessary to determine if the differences indicate versioning practices also need to be different.
Increases in the computational power of massively parallel machines have facilitated the implementation and execution of numerical simulations of unprecedented scale and fidelity. When the physical mechanism governing the action of interest in a simulation occurs at a scale orders removed from the desired resolution of the solution, even modern leadership class machines are insufficient to provide full granularity. Multi-scale numerical simulations allow the coupling of multiple scales of interest in order for the multi-scale simulation to reflect important properties of each scale.
The whole-clothe implementation of multi-scale simulations targeting massively parallel machines requires a substantial investment of man-hours. Leveraging existing, well-established single-scale simulations already in wide use in order to construct new multi-scale systems will allow a reduction in development time and costs to produce novel new multi-scale systems.
The design and development of the Adaptive Multi-scale Simulation Infrastructure (AMSI) along with the design and implementation of a multi-scale soft biological tissues code is discussed. The parallel characteristics and implications of the AMSI component-code coupling approach are considered in the context of the soft tissue multi-scale code.
Modern simulations on leadership-class machines also leverage dynamic and adaptive capabilities to load balance simulations during execution and reduce time-to-solution. Integration of such codes into a multi-scale system must allow for the use of dynamic load balancing operations while maintaining all inter-scale linkages.
The deployment of codes designed for massively parallel machines relies on the leveraging of various component libraries, often fulfilling similar purposes but targeted at specific machine architectures to provide optimal performance. Accessing these component libraries through mechanisms providing for the use of a generic interface over a component code targeted to a particular platform allows for the development of algorithms once and their deployment and optimization on various platforms without the need to explicitly modify code. Implementation of such a mechanism should necessarily introduce minimal overhead into the resulting simulation, with a goal of zero-overhead abstraction of the various component back-ends.
Image classification is an approach of pattern recognition in computer vision. The objective is to differentiate between different classes of images based on the quantification of contextual information in the images. Image classification is performed using data analytic pipelines. These pipelines are organized as interdependent and individual components. The components include image acquisition, image preprocessing, feature extraction, feature preprocessing, dimensionality reduction and learning algorithms among others. More steps may be added or removed from the pipeline. The quality of the image classification pipeline is measured by the validation error or test error in classification of unseen image instances that quantify the generalization error. Even though the error appears as a result of application of the learning algorithm, it is actually an accumulation of the errors introduced from different parts of the pipeline starting from the acquisition of the image from the sample, the feature preprocessing algorithms, the feature extraction algorithms and finally to the learning algorithm used for performing the classification. This error is propagated down the pipeline which is finally accumulated in the form of the aforementioned classification error. In this thesis, we attempt to holistically, optimize, quantify and understand the contribution of error from the various components of image classification pipelines.
Mobile manipulators can be useful in many scenarios, including warehouses, factories, hospitals, and so forth. Working in these places often requires the robots to have some level of autonomy and intelligence, so that they can automatically adapt to changes without needing constant supervisions from humans. This is non-trivial as many challenges mix into the problem: robust object recognition, compliant control, optimal motion planning, etc. Among all of them, we make contributions in two planning related directions: 1) capturing a manipulator's own structural capability to utilize its reachable and dexterous workspace to accomplish tasks, and 2) learning and planning with task-relevant knowledge critical to the task's success.
A reachability database, or a reachability map, characterizes a robot arm's reachable workspace through deterministic discretizations. Matching the reachable portions of a robot's reachability database with a desired Cartesian end effector (e.g. hand) path helps the robot find good poses for its base, from where it can reach the entire path. Sometimes, it also helps to have a reachability database for a grasped tool frame instead of the original end effector frame. However, building a new reachability database for the tool frame would take hours or even days. Thus we introduced a new orientation-based structure for reachability database that supports fast extensions. By keeping clustered elements in the same cluster during extension, our method allows new reachability databases to be extended from an existing one in fractions of a second while maintaining the orientation-based structure.
Another contribution of the dissertation is our virtual learning from demonstration system, which lets us obtain motion demonstrations through virtual reality and use them to guide motion planning. With enough demonstrations, Gaussian mixture models (GMMs) are used to fit the data. The learned models help provide biased samples as well as calculate costs for an optimizing motion planner. We show that GMM-based sampling and cost calculation both favor motions similar to the demonstrations, hence capable of providing guidance to motion planning.
Combining our efforts in both directions, we designed a planning system that uses a reachability database together with data and models obtained from the virtual learning from demonstration system. The reachability database helps identify reachable demonstrated motions and plan base placements if the robot cannot reach a task with its current base placement. Only reachable motions are fit into GMMs so that the models focus on reachable portions of the robot's workspace. By providing samples and calculating costs, the learned GMMs guide an optimizing motion planner to find effective plans that honor implicit task knowledge conveyed in the demonstrations.
Modern technological advancements and innovation has led to an explosive growth of data in various domains, ranging from physics and biological sciences to economics and social sciences. Research on mathematical libraries has been on the leading edge of the high-performance computing(HPC) community's effort to address the long imposing set of challenges posed by Big Data. Primary among those challenges are the need for asynchronous communication and bridging the gap between computing power and network bandwidth. This has led to the advent of randomization in math libraries to develop scalable algorithms for large-scale numerical linear algebra (NLA) problems. In this dissertation, we focus on the design, implementation and analysis of randomized algorithms for scalable mining of terabyte sized matrices and above. We focus on three fundamental problems that are pervasive throughout large-scale data analytics where randomized NLA algorithms have shown significant impact over state-of-the-art approaches: least-squares regression, low-rank approximation and kernel ridge regression.
This dissertation is divided into three parts. In part I, we explore the behavior of randomized matrix algorithms based on the Blendenpik algorithm in a distributed memory setting. We show that a variant of the algorithm that uses a batchwise transformation leads to an implementation that is not only faster than state-of-the-art implementations of baseline least-squares solvers, but is also able to scale to much larger matrix sizes. In particular, we show that a Blendenpik based algorithm can solve least-squares regression problems for dense terabyte-sized (and larger) input matrices as well as sparse ill-conditioned matrices that outperform state-of-the-art least-squares solvers in terms of performance and scalability while demonstrating comparable numerical stability in terms of established accuracy metrics.
In part II of the dissertation, we explore the behavior of randomized block iterative solvers to compute low rank matrix approximations for dense terabyte sized matrices. We are particularly interested in the behavior of randomized block iterative solvers on matrices with clustered singular values. We analyze the scalability and numerical stability of our block iterative solvers and demonstrate the performance of these randomized solvers for varying spectral gaps. Experiments with real-world large-scale datasets showed high quality approximations for the kernel PCA problem while achieving significant speedups over state-of-the-art direct solvers.
In part III of the dissertation, we explore the behavior of large-scale kernel approximations using the Nystrom approach to solve the kernel ridge regression (KRR) problem. We demonstrate the scalability of one such Nystrom approximation approach based on the FALKON algorithm and contrast it with a state-of-the-art algorithm based on randomized feature maps for the KRR problem.
Many approaches currently exist in the pursuit of knowledge representation from unstructured documents. With the extensive amount of information available, the need to derive a common meaning behind data has become more prevalent than ever before. In this thesis, we develop a workflow to support knowledge extraction and representation of data from the biomedical domain. To determine the hidden meaning behind text, we utilize popular Information Extraction methods to extract events from biomedical papers and store the output from these methods into a combined textual object and annotation representation format. This data is converted into an RDF Graph format based on mappings to scientific ontologies and provenance semantics.
By employing the extensible knowledge graph curation and analysis platform called Whyis, we can validate the constructed interactions discovered by our workflow in comparison to interactions stated in a domain-specific assertion database. Assessment of this data flow showcase the associated benefits, difficulties, precision, and potential usages alongside established ontologies. The data workflow implemented in this study provides an innovative approach for knowledge discovery, connectivity, and curation of information using a linked data methodology and assertion-based evaluation criteria.
Infrastructure-as-a-Service (IaaS) clouds such as Amazon EC2 offer various types of virtual machines (VMs) through pay-per-use pricing. Elastic resource allocation allows us to allocate and release VMs as computing demand changes while satisfying Quality-of-Service (QoS) requirements. In this thesis, we explore QoS-aware elastic resource allocation for three different data processing models: batch, micro-batch, and streaming.
First, we present two frameworks for elastic batch data processing. The first elastic batch data processing framework supports autonomous VM scaling using application-level migration. It does not require any prior knowledge about the target application, but dynamically reconfigures the application to keep the CPU utilization within a certain range. The second framework uses Workload-tailored Elastic Compute Units as a measure of computing resources analogous to Amazon EC2's ECUs. Given a deadline, our framework finds the cost-optimal resource configuration of heterogeneous VMs to satisfy the required throughput.
Next, we propose an elastic micro-batch data processing framework for continuous air traffic optimization. Air traffic optimization is commonly formulated as an integer linear programming (ILP) problem. For continuous optimization, we periodically solve ILP problems with regular intervals, where each problem is a micro-batch data processing job. Since the fluctuating number of flights creates dynamically changing computational demand, our framework predicts future workload and proactively schedules VMs to solve the ILP problems in a timely manner.
Finally, we propose a framework for sustainable elastic stream processing based on the concept of Maximum Sustainable Throughput (MST). It is the maximum processing throughput a streaming application can process indefinitely for a number of VMs. Stream processing is sustainable if the system's MST is always greater than the input data rates of incoming workload. Using MST and future workload prediction models, our framework proactively schedules VMs to keep the stream processing sustainable. It explicitly incorporates uncertainties in both MST and workload prediction models, and estimates the number of VMs to satisfy a certain probability criteria.
Our studies show that QoS-aware elastic data processing is effective for these processing models in both performance scalability and cost savings. For batch processing, elastic resource scheduling helps achieve the target QoS metrics such as CPU utilization and job completion time. For both micro-batch and stream processing with fluctuating workloads, QoS-aware elastic scheduling saves up to 49% cost compared to a static scheduling that covers the peak workload to achieve a similar level of QoS. These results show potential for future fully automated cloud computing resource management systems that efficiently enable truly elastic and scalable general-purpose workload.
Querying information over knowledge graphs is typically a user driven trial and error process. This is because there is no intuitive mechanism to help users to obtain information. Since knowledge graphs have some taxonomic content, automatic query reformulation systems have generally used relaxation to loosen the conditions of an input query.They typically move up the taxonomy to relax a specific concept to a more generic concept. However, a large percentage of the data in knowledge graphs is entity specific data. Also, query log statistics show that a majority of the queries issued to knowledge graphs are often entity based queries. Entities don't have a taxonomy and they end up being generalized. To address this issue, I propose a schema-data and instance-data aware query reformulation system. More specifically I utilize the information and entity features present in the graph to suggest rewrites. Once the features are identified, the entity in concern is reformulated as a set of features. Since entities in large-scale graphs can have a large number of features, we introduce strategies that select the top-k most relevant and informative ranked features. This is then augmented to the original query to create a valid reformulation. We then evaluate our approach by showing that our reformulation strategy produces results that are more contextual and crisp when compared with the state-of-the-art.
Software metrics represent an application of conventional computer science to itself in an attempt to quantify the complexity of software systems. Many of the conventional metrics proposed in academia and industry find their roots in a top-down composition that has parallels to the way conventional AI has been done prior to the surge in machine learning techniques. As software becomes increasingly networked together, the need for bottom-up approaches to software complexity becomes more apparent. The complexity concepts created for biological systems, such as ecological biodiversity, neural pattern recognition, and genetic mutation rate, offer ”proof of concept” illustrations that show how such bottom-up metrics can provide insights into the behavior of complex systems. Thus computational analogs to these biological complexity metrics may offer deeper understanding of how complexity occurs in contemporary networked computing systems. This dissertation compares traditional software metrics to these bio-inspired measures in several contexts, applying them to tasks such as bug-finding, measuring student comprehension, and using metrics as a way of quantifying generative justice.
First we study the topic modeling task, which is to summarize a collection of documents in terms of the themes within it. We show how the problem can be broken up into three independent modules. (1) Model specification and inference (2) Evaluation (3) Hyper-parameter optimization. We present tools to address the needs of each of the modules. We hope to facilitate future research by allowing practitioners to focus on one module at a time.
Next we study the non-negative matrix factorization (NMF) and singular value decomposition (SVD) in a row-distributed setting. The NMF and SVD have multiple applications in machine learning. For NMF these are topic modeling, recommender systems, hyper-spectral imaging, and sparse coding/dictionary learning. The SVD has even more applications; a well known one is principal component analysis. We study the scenario when rows of the matrix start distributed among multiple parties. E.g. maybe these parties are different business collecting customer purchase histories, or different datacenters of a social network, or a sensor network. We show how the exact NMF and SVD can be computed with less communication cost than what it would take to simply upload all the data to one server.
Finally, we study the privacy guarantees that we can make about our distributed NMF and SVD algorithms. In the row-distributed setting, privacy is interesting because business interests or legislation (e.g. the GDPR) make it undesirable or illegal for a party to share its data directly. We develop a method to measure how much information about the absence/presence of a particular document can be inferred from the communication transcript of our distributed algorithms. Our method is based on the Kolmogorov-Smirnov hypothesis test. We examine the relationship between our notion of privacy and Differential Privacy. Finally, we give empirical evidence to suggest that privacy is maintained for several applications of both NMF and SVD.
The requirement to extract the hidden information out of the data stream is rising, however, traditional stream processing systems cannot meet this requirement as they are not designed to do so. This gives birth to the new research domain of stream reasoning that aims to bring semantic reasoning into stream processing. An example is to predict highway traffic jam, given the explicit sensor data streams of cars' number and speed. It is very easy for humans to observe the traffic then forecast a traffic congestion. This is because humans know that a bigger car number and slower car speed can usually lead to a traffic jam. Unfortunately, machines do not. What they can ``see'' is probably a sequence of numerical numbers that are separated by commas.
Streaming data is boundless, enormous, and heterogeneous, which adds extra dimensions to the challenges of realizing the vision of stream reasoning, in addition to temporal constraints. A widely-adopted way to process the streams is via leveraging a window that isolates the latest streaming portion. This snapshot, mostly managed by the first in first out (FIFO) strategy under a popular silent assumption that the latest data is the most important, is all that a window can know about the stream. This inevitably provides only limited information during the processing. However, modeling the importance of the data is not necessarily based on pure arrival timestamps. If the latest data does not convey the necessary information to answer the query, there is surely no need to do anything other than evicting it.
Streaming data intrinsically has many different orderings, such as temporarily, precision, provenance, and trust, etc. If diverse data orderings can be utilized to model the data importance, stream reasoning can be benefited by being data-discriminative. It is able to understand the concept of importance so as to identify, and leverage more important data that are crucial to the query answering, which can improve the system performance. The notion that models the data importance is named as semantic importance. It is an umbrella-like concept with multiple branches, such that each branch models one aspect of currently included data orderings. The combinations of different branches describe the data importance, and enable various smart and flexible window management strategies that are previously dominated and limited by FIFO.
Generally speaking, this dissertation delivers a conceptual model, and a set of infrastructure that can facilitate its general application in stream reasoning. Specifically, the first contribution is an innovative notion of semantic importance. It is formalized in an ontology, represented in a priority vector, and works with carefully extended window semantics. The second contribution introduces a general sequential stream reasoning architecture, with the purpose of both showing how semantic importance can be used in stream reasoning systems, and providing pragmatic performance metrics to configure stream reasoning systems in different scale scenarios. Two exemplar real world use cases are implemented and evaluated based on this architecture and semantic importance. The third contribution proposes a generalization and benchmark framework for semantic importance. This part focuses on how to reuse and benchmark semantic importance in a generic and quantitative way. The semantic importance is generalized by connecting itself to the state of the art stream reasoning techniques. This framework also provides a benchmark interface compatible with a wide range of continuous queries, ontologies, data streams, and a set of built-in data-aware window management strategies enabled by semantic importance. The key performance indicators recorded for the benchmark includes precision, response time, memory consumption and throughput. The results are analyzed and visualized so as to facilitate decision-making on how to compose and deploy the suitable semantic importance in real use cases.
Modern networked systems are in perpetual need of novel tools that can fend off cyber attacks arising from different kinds of threats and vulnerabilities. The mas- sive upsurge in the number of devices connected to a network and associated traffic volume, as well as addition of new, complex technologies to existing ones have inten- sified the need to have a deep understanding of systems. From a security perspective, we are noticing a behavior bordering high on anxiety, which is further pushing the boundaries of data collection even before understanding the real need behind it. Mirroring this behavior is the inclusion of security experts as part of the core team that designs and builds systems and networks. This increases the awareness of potential attack vectors that may impact these systems. However, data analytics (not necessarily security-related) has further intensified the need to gather as much data as possible, leaving it to the security experts to come up with tools to analyze them. While this is a daunting task for all involved parties and cannot be excused from, it is critical to scale these monitoring and analysis infrastructures to meet the demands and purpose of collecting data.
Existing technologies come in the form of intrusion detection systems like anti- virus tools, and intrusion prevention systems like firewalls. They deploy various approaches to succeed in detecting and preventing intrusions. Generally, intrusion detection techniques are classified into two categories: misuse detection and anomaly detection. However, the implementation has been ever-changing keeping the types of systems, data analysis, increasing attack surface and intrusions in mind. And with the rapid evolution in the digital milieu, and an explosion in data collected, there is a need to propose novel approaches that address the big data problem in security.
This thesis proposes a new approach to intrusion detection in networks. This approach is based on Information Centrality (IC) using which systemic attacks can perform same level of intrusion detection using approximately 60% of the total nodes. IC labels network nodes with better vantage points for detecting network-based anomalies as central nodes. The main idea is that since these central nodes already ”observe” most of the data flowing through the network, they are in a good position to detect anomalous behavior much before other nodes.
This research first dives into the important role played by graphs in under- standing the topology and flow of information. We then introduce the usage of information centrality, a centrality based index, to reduce data collection in existing communication networks. IC identifies important nodes that can accelerate anomaly detection when armed with a suitable anomaly detection technique. We also come up with a more efficient way to compute Information centrality for large networks. Finally, we demonstrate that central nodes detect anomalous behavior much faster than other non-central nodes, given the anomaly is systemic in nature.
Since the introduction of the Semantic Web vision in 2001 as an extension to the Web, where machines can reason about the content, the main research focus in semantic reasoning was on the soundness and completeness of the reasoners. These reasoners also assume the veracity of the input data while the Web of data is inherently noisy. In 2010, Hitzler and van Harmelen called for questioning the model-theoretic semantic reasoning and investigation of machine learning (ML) for semantic reasoning  as ML techniques are more robust to noisy data. Four years later, a position paper about machine learning on linked data  considered the research efforts to couple both fields still ”disappointing”.
Recent research work on semantic reasoning with noise-tolerance focuses on type inference and does not aim for full RDFS reasoning. This thesis documents a novel approach that takes previous research efforts in noise-tolerance in the Semantic Web to the next level of full RDFS reasoning by utilizing advances in deep learning research. Deep learning techniques- even though robust to noise and very effective in generalizing across a number of fields including machine vision, natural language understanding, speech recognition etc. - require large amounts of data of low-level representations rather than “symbolic representations used in knowledge representation” (). This challenge constitutes what we refer to as the Neural-Symbolic gap.
This thesis aims to provide a stepping stone towards bridging the Neural-Symbolic gap specifically in the Semantic Web field and RDFS reasoning in particular. This is accomplished through layering Resource Description Framework (RDF) graphs and encoding them in the form of 3D adjacency matrices. Each layer layout in the 3D adjacency matrices form what termed as graph word. Every input graph and its corresponding inference are then represented as sequences of graph words. The RDFS inference becomes equivalent to graph words translation that is achieved through neural network translation.
The evaluation confirms that deep learning can in fact be used to learn RDFS rules from both synthetic as well as real-world Semantic Web data while showing noise-tolerance capabilities compared to rule-based reasoners.
 P. Hitzler and F. van Harmelen, “A reasonable semantic web,” Semantic Web, vol. 1, no. 1, 2, pp. 39–44, 2010.
 P. Bloem and G. K. De Vries, “Machine learning on linked data, a position paper,” Linked Data for Knowledge Discovery, p. 69, 2014.
 A. Garcez, T. R. Besold, L. De Raedt, P. Földiak, P. Hitzler, T. Icard, K.-U. Kühnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: Contributions and challenges,” 2015.