• PhD Proposal Presentation: Liliya Besaleva

    Smart e-Commerce Personalization Using Customized Algorithms


    Monday, December 21, 2015
    Liliya Besaleva
    Advisor: Alf Weaver
    Attending Faculty: Worthy Martin (Chair); Jack Stankovic, Hongning Wang and Larry Richards (MAE).

    1:00 PM, Rice Hall, Rm. 242

    PhD Proposal Presentation
    Smart e-Commerce Personalization Using Customized Algorithms

    ABSTRACT

    Applications for machine learning algorithms can be observed in numerous places in our modern lives. From medical diagnosis predictions to smarter ways of shopping online, big fast data is streaming in and being utilized constantly. Unfortunately, unusual instances of data, called imbalanced data, are still being ignored at large because of the inadequacies of analytical methods that are designed to handle homogenized data sets and to “smooth out” outliers. Consequently, rare use cases of significant importance remain neglected and lead to high-cost loses or even tragedies. In the past decade, a myriad of approaches handling this problem that range from data modifications to alterations of existing algorithms have appeared with varying success. Yet, the majority of them have major drawbacks when applied to different application domains because of the non-universal nature of the generated data. Within the vast domain of e-commerce we are proposing a new approach for handling imbalanced data, which is a hybrid classifier that will consist of a mixed solution of both adaptable data format and algorithmic modifications. Our solution will be divided into two main phases serving different purposes. In phase one, we will classify the outliers with less accuracy for faster, more urgent situations, which require immediate predictions that can withstand possible errors in the classification. In phase two, we will do a deeper analysis of the results and aim at precisely identifying high-cost imbalanced data with larger impact. The goal of this work is to provide a solution that improves the data usability, classification accuracy and resulting costs of analyzing massive data sets in e-commerce.

  • PhD Qualifying Exam Presentation: Chunkun Bo

    Entity Resolution Acceleration using Micron's Automata Processor


    Tuesday, December 15, 2015
    Chunkun Bo
    Advisor: Kevin Skadron
    Attending Faculty: Worthy Martin (Chair); Gabriel Robins, Yanjun Qi

    2:00 PM, Rice Hall, Rm. 242

    PhD Qualifying Exam Presentation
    Entity Resolution Acceleration using Micron's Automata Processor

    ABSTRACT

    Entity Resolution (ER), the process of finding identical entities across different databases, is critical to many information integration applications. As sizes of databases explode in the big-data era, it becomes computationally expensive to recognize identical entities for all records with variations allowed across multiple databases. Profiling results show that approximate matching is the primary bottleneck. Micron's Automata Processor (AP), an efficient and scalable semiconductor architecture for parallel automata processing, provides a new opportunity for hardware acceleration for ER. We propose an AP-accelerated ER solution, which accelerates the performance bottleneck of fuzzy matching for similar but potentially inexactly-matched names, and use a real-world application to illustrate its effectiveness. Results show $121$x to $4200$x speedups for matching one record, with better accuracy (9.2\% more correct pairs and 43\% less generalized merge distance cost) over the existing CPU method. The proposed method works even faster with improved algorithm.

  • Master's Project Presentation: Zeming Lin

    Robust classifiers against adversarial attacks using model randomization


    Monday, December 14, 2015
    Zeming Lin
    Advisor: Yanjun (Jane) Qi
    Attending Faculty: Westley Weimer

    10:00 AM, Rice Hall, Rm. 504

    Master's Project Presentation
    Robust classifiers against adversarial attacks using model 

    ABSTRACT

    Machine learning models are widely used for detecting malwares in technology like antivirus software and email attachment scanners. However, learning models usually assume a stationary data distribution, which is mostly violated in the presence of an attacker that can manipulate test samples. To alleviate such attacks, we propose a defense strategy for randomly generating many different models through diversity tactic. Our technique can generate models quickly and allows us to implement this technique on millions of users. We provide experimental results in two different attack scenarios and show that our technique can prevent attacks on both image classifiers and PDF malware classifiers.

  • PhD Qualifying Exam Presentation: Beilun Wang

    Fast and Scalable Joint Estimators for Learning Multiple Related Sparse Gaussian Graphical Models


    Wednesday, November 4, 2015
    Beilun Wang
    Advisor: Jane Qi
    Attending Faculty: Gabe Robins (Chair), Worthy Martin and James Cohoon

    Rice Hall, Rm. 242 at 11:00 AM

    PhD Qualifying Exam Presentation
    Fast and Scalable Joint Estimators for Learning Multiple Related Sparse Gaussian Graphical Models

    ABSTRACT
    In this paper, we infer multiple sparse Gaussian Graphical Models (sGGMs) jointly from data samples of many tasks (large $K$) and under a high-dimension (large $p$) situation. Most previous studies for the joint estimation of multiple sGGMs rely on penalized log-likelihood estimators that involve expensive and difficult non-smooth optimization. We propose a novel approach, FASJEM (fast and scalable joint estimator for multiple sGGMs) for structure estimation of multiple sGGMs at a large scale. As the first study using the M-estimator framework, FASJEM has the following sound properties: (1) We solve FASJEM through an entry-wise manner which is parallelizable. (2) We choose a proximal algorithm to optimize the FASJEM. This improves the computational efficiency of FASJEM significantly from $O(Kp^3)$ to $O(Kp^2)$ and ease the memory requirement from $O(Kp^2)$ to $O(K)$. (3) We theoretically prove that FASJEM achieves a consistent estimation with $\max\{ O(\log(Kp)/n_{tot}), O(p\log(Kp/n_{tot})) \}$ convergence rate. On one synthetic and three real-world datasets, FASJEM shows significant improvements over baselines in terms of accuracy, computational complexity and memory requirements.

  • PhD Dissertation Defense: Vidyabhushan Mohan

    Towards Building Energy Efficient, Reliable and Scalable NAND Flash based Storage Systems


    Friday, October 16, 2015
    Vidyabhushan Mohan
    Advisors: Mircea Stan and Kevin Skadron
    Attending Faculty: Jack Davidson (Chair), Marty Humphrey, Jiwei Lu and Jack Frayer (SanDisk)

    1:00 PM, Rice Hall, Rm. 242

    Ph.D. Dissertation Defense Presentation
    Towards Building Energy Efficient, Reliable and Scalable NAND Flash based Storage Systems

    ABSTRACT

    NAND Flash (or Flash) is the most popular solid-state non-volatile memory technology used today. As memory scales and costs reduce, flash has replaced Hard Disk Drives (HDDs) to become the de facto storage technology. However, flash memory scaling has adversely impacted the power efficiency and reliability of flash based storage systems. While smaller flash geometries have driven storage system capacity to approach petabyte limit, performance of such high capacity storage systems is also a major limitation. In this dissertation, we address the power, reliability and performance scalability challenges of NAND flash based storage systems by modeling key metrics, exploring the tradeoffs between these metrics and evaluating the design space to build application optimal NAND flash based storage systems.

    To address the power efficiency of flash memory, this dissertation presents FlashPower, a detailed analytical power model for flash memory chips. Using FlashPower, this dissertation provides detailed insights on how various parameters affect flash energy dissipation and proposes several architecture level optimizations to reduce memory power consumption. 

    To address the reliability challenges facing modern flash memory systems, this dissertation presents FENCE, a transistor-level model to study various failure mechanisms that affect flash memories and analyze the trade-off between flash geometries and operation conditions like temperature and usage frequency. Using FENCE, this dissertation proposes both firmware and architecture level solutions to design reliable and application optimal storage systems. 

    Finally, to address scalability limitations of flash based high capacity Solid State Disks (SSDs), this dissertation evaluates the bottlenecks faced by conventional SSD architectures to show that the processing power available in conventional SSD architectures severely limit SSD performance at petabyte scale capacity. This dissertation proposes FScale, a scalable distributed processor based SSD architecture that can match the scaling rate of NAND flash memory and enable high performance petabyte scale SSDs.

  • PhD Proposal Presentation: Juhi Ranjan

    Object User Recognition Using Smart Wearable Devices


    Friday, October 9, 2015
    Juhi Ranjan
    Advisor: Kamin Whitehouse
    Attending Faculty: Jack Stankovic (Chair); James Scott (Microsoft Research, Cambridge, UK), Gabriel Robins and Laura Barnes

    9:00 AM, Rice Hall, Rm. 404

    PhD Proposal Presentation
    Object User Recognition Using Smart Wearable Devices

    ABSTRACT

    Context-aware computing is now an integral part of many commercial 'smart' products. Some examples of the contexts used are home occupancy, road traffic, and global position coordinates. Identity of a person is also an important context for many ubiquitous computing applications. For example, some bathroom scales can recognize the person stepping on them to provide long-term weight and other body measurement trends. In order for objects to perform personalized or contextual functions based on the object user's identity, they must solve what we call the object user recognition problem: understanding who is actually using an object. 

    Many techniques have already been designed to solve this problem. Some objects, such as computers or smartphones, identify a user based on a pass code or fingerprint. Other techniques use RFID tags to detect when a person's hand is near an object, or thermal cameras to detect when a person wearing a unique thermal tag is near an object. Other objects that have embedded sensors can recognize the object user based on the unique way in which an object is touched or held. In our research, we hypothesize that it is possible to perform object user recognition using sensors currently available in commercial smart wrist devices. These sensors provide rich information about the location and hand gestures of the device bearer. We speculate that this information is sufficient to identify an object user from the set of possible people, with access to the object.

  • PhD Qualifying Exam Presentation: Weilin Xu

    Automatically Evading Classifiers: A Case Study on PDF Malware Classifiers


    Thursday, October 1, 2015
    Weilin Xu
    Advisors: David Evans and Yanjun Qi
    Attending Faculty: Wes Weimer (Chair) Hongning Wang

    11:00 AM, Rice Hall, Rm. 242

    PhD Qualifying Exam Presentation 
    Automatically Evading Classifiers: A Case Study on PDF Malware Classifiers

    ABSTRACT

    Machine learning is widely used to develop classifiers for security tasks. However, the robustness of these methods against motivated adversaries is uncertain. In this work, we propose a generic method to evaluate the robustness of classifiers under attack. The key idea is to stochastically manipulate a malicious sample to find a variant that preserves the malicious behavior but is classified as benign by the classifier. We present a general approach to search for evasive variants and report on results from experiments using our techniques against two PDF malware classifiers, PDFrate and Hidost. Our method is able to automatically find evasive variants for all of the 500 malicious seeds in our study. Our results suggest a general method for evaluating classifiers used in security applications, and raise serious doubts about the effectiveness of classifiers based on superficial features in the presence of adversaries.

  • PhD Qualifying Exam Presentation: Dezhi Hong

    The Building Adapter: Towards Quickly Applying Building Analytics at Scale


    Tuesday, September 22, 2015
    Dezhi Hong
    Advisor: Kamin Whitehouse
    Attending Faculty: Marty Humphrey (Committee Chair), Yanjun Qi, and Alfred Weaver.

    9:30 AM, Rice Hall, Rm. 404

    PhD Qualifying Exam Presentation 
    The Building Adapter: Towards Quickly Applying Building Analytics at Scale

    ABSTRACT

    Commercial and industrial buildings account for a considerable fraction of all the energy consumed in the U.S., and reducing this energy consumption has become a national grand challenge. Based on the large deployment of sensors in modern commercial buildings, many organizations are applying data analytic solutions to the thousands of sensing and control points to detect wasteful, incorrect and inefficient operations for energy savings. Scaling this approach is challenging, however, because the metadata about these sensing and control points is inconsistent between buildings, or even missing altogether. As a result, an analytics engine cannot be applied to a new building without first addressing the issue of mapping: creating a match between the sensor streams and the inputs of a data analytic engine. The mapping process requires significant integration effort and anecdotally can take a week or longer for each commercial building. Thus, metadata mapping is a major obstacle to scaling up building analytics.

    In this work, we demonstrate first steps towards an automatic metadata mapping solution that requires minimal human intervention. We develop two different techniques, i.e., fully automated mapping and semi-automated mapping, to differentiate sensors in buildings by type, e.g., temperature v.s. humidity. Our first technique performs automatic mapping without any manual intervention. The approach builds on and improves upon techniques from transfer learning: it learns a set of statistic classification classifiers based on the metadata from a labeled building and adaptively integrates those models to another unlabeled building, even if the two buildings have very different metadata conventions. The second approach involves iterative manual labeling where a clustering-based active learning algorithm exploits data clustering structure to acquire human labels for informative instances and propagates labels to their nearby unlabeled neighbors to accelerate the learning process.

    We perform a comprehensive study on a data set collected from over 20 different sensor types and 2,500 sensor streams in three commercial buildings on two campuses. The transfer learning based solution can automatically label at least 36% of the points with more than 85% accuracy, while the best baseline achieves only 63% label accuracy on average. Our active learning based technique is able to achieve more than 92% accuracy for type classification with much less labeled instances than baselines. As a proof-of-concept, we also demonstrate a typical analytic application enabled by the normalized metadata. These techniques represent a first step towards technology that would enable any new building analytics engine to scale quickly to the 10's of millions of commercial buildings across the globe, with the minimal need for manual mapping on a per-building basis.

  • PhD Proposal Presentation: Jian Xiang

    Real-World Types and Their Applications


    Tuesday, September 8, 2015
    Jian Xiang
    Adviser: John Knight
    Attending Faculty: Jack Davidson (Chair), Kevin Sullivan, Hongning Wang, and Houston Wood

    10:00 AM in Rice 404

    Ph.D. Proposal Presentation
    Real-World Types and Their Applications

    ABSTRACT
    Software systems, especially cyber-physical systems and embedded systems, interact with the real world in order to sense and affect it. For these systems to function correctly, such software should obey rules inherited from the real world that it senses, models, and affects, as well as from the machine world in which the software executes. Frequently, however, important characteristics of real-world entities are often undocumented, or are documented incompletely, informally, and implicitly in source code. As a result, real-world attributes, such as units and dimensions, and associated real-world constraints, such as not mixing units, are stated and enforced either in ad-hoc ways or not at all. In addition, crucial relationships between machine-world representations and real-world entities, such as accuracy of sensed values, remain under-specified. The result is that programs tend to treat machine-world representations as if they were the real-world entities themselves. This practice leads to the introduction of faults into systems due to unrecognized discrepancies, and executions end up violating rules inherited from the real world. The results are software and system failures and adverse downstream consequences.

    This thesis proposes the notion of real-world types to document relevant characteristics of real-world entities for use in programs and to document relationships between real-world types and machine-level representations. By using real-world types, programmers are able to enforce real-world constraints on programs in a systematic way, thereby enabling a new class of software fault detection. This research will examine the principles and practices necessary to (1) comprehensively document real-world entities and (2) leverage such documentations to systematically enforce real-world constraints. This proposal provides an overview of real-world types along with key research questions, a research plan to answer these research questions, and preliminary results.

  • PhD Qualifying Exam Presentation: Soheil Nematihaji

    On the Power of Algebraic Tools in Code Obfuscation


    Monday, February 29, 2016

    10:00 AM in Rice 314

    Advisor & Chair:Mohammad Mahmoody
    Attending Faculty: David Evans, Gabriel Robins, abhi shelat.

    Title: On the Power of Algebraic Tools in Code Obfuscation

    ABSTRACT:
    Obfuscating programs to make them “unintelligible” while preserving their functionality is one of the most sought after holy grails in cryptography due to its numerous applications. The celebrated work of Barak et al. [BGI+01] was the first to launch a formal study of this notion in its various forms. Virtual Black-Box (VBB) obfuscation is a strong form of obfuscation in which the obfuscated code does not reveal any secret bit about the obfuscated program unless that information could already be obtained through a black-box access to the program. The same work [BGI+01] also defined a weaker notion of obfuscation, called indistinguishability obfuscation (iO). The security of iO only requires that the obfuscation of two equivalent and same-size circuits C1, C2 be computationally indistinguishable to efficient adversaries.

    In this proposal, studying the complexity of obfuscation plays the central role. We already know that VBB obfuscation is impossible in general, but that shall not stop us from exploring the possibility of VBB in some idealized models or achieving VBB for a special class of functions. The question of whether VBB obfuscation is possible in some appealing and natural idealized models is one of the fundamental questions in cryptography. There are lots of interesting idealized models such as TDP, generic group model of Shoup [Sho97] (GGM) and GEM [BGK+14]. One can ask the same questions about IO as well. In this proposal the main focus is to study the existence of VBB obfuscation and IO in some of these idealized models.

  • PhD Proposal Presentation: Md Anindya Prodhan

    Market Based on-Demand Computing in Co-operative Grid Environment (CG-Market)


    Wednesday, March 2, 2016
    11:00 AM in Rice 504

    Advisor: Andrew Grimshaw
    Attending Faculty: Alfred Weaver (Committee Chair), Kamin Whitehouse, Malathi Veeraraghavan and Federico Ciliberto (Minor Representative)

    Title: Market Based on-Demand Computing in Co-operative Grid Environment (CG-Market)

    ABSTRACT:

    Computational and data scientists at universities have heterogeneous job resource requirements. Most universities maintain a set of shared resources to support these computational needs. Access to these resources is often free and the access policy is First Come First Serve (FCFS). However, FCFS policies on shared resources often lead to sub-optimal value from the organization's point of view as different jobs contribute different values to the users. Furthermore, the set of resources at a single institution may fail to satisfy the diverse needs of the institutions researchers. We argue the solution is differentiated quality of service (QoS) based on the user's willingness to pay to rationalize resource usage and federation of university resources to improve both the size of the resource pool as well as the diversity of resources. The proposed XSEDE Campus Bridging (CB) Shared Virtual Compute Facility (SVCF) provides both. The CB-SVCF will be deployed using existing XSEDE Execution Management Services (EMS) and the XSEDE Global Federated File System (GFFS).

    Although, there has been rich literature in both computational grids and grid economy, much of the work has all been done in simulation. The researchers did not have access to production quality systems on which to test their simulation results, nor did they have a production quality infrastructure and a real user community. We propose to carry out our experiments on a production quality testbed with real data traces from real users and answer the fundamental question in grid economy: “whether federation of resources among Universities with market based resource allocation yields more value to the organizations individually and/or overall or not?” Three university resources from UVa and IU will serve as a hardware testbed for the system. A modest number of researchers at each university will be authorized to submit and run jobs on the testbed and will be charged based on their resource and QoS requirement. With these real data traces we will calculate the organizational value achieved using and without CG-Market for the two universities and compare them to verify whether the advantages of market based grid actually holds in real infrastructure.

  • PhD Proposal Presentation: In Kee Kim

    Proactive Resource Management to Ensure Predictable End-to-End Performance for Cloud Applications


    Monday, February 29, 2016
    9:00 AM in Rice 504

    Advisor: Marty Humphrey
    Attending Faculty: Alfred Weaver, Yanjun Qi, Hongning Wang, Byungkyu Brian Park (Civil Engineering at UVA)

    Title:  Proactive Resource Management to Ensure Predictable End-to-End Performance for Cloud Applications

    ABSTRACT:

    Public IaaS clouds have become an essential infrastructure for many organizations to run their applications due to diverse types of resources, cost efficient pay-as-you-go pricing models, scalability and elasticity of resources. To effectively leverage public IaaS clouds, cloud users tend to employ resource managers that elastically control cloud resources to handle dynamic changes of workloads. These resource managers typically have two interrelated goals: maximizing SLA (Service Level Agreement) satisfaction and minimizing execution cost. However, existing cloud resource managers have difficulty meeting these two goals due to the low accuracy and poor generalizability of a workload predictor and a cloud performance model. Designing these two components and a resource manager combined with them is challenging because of uncertainties in public IaaS clouds, namely 1) uncertainty in future workload patterns and 2) uncertainty in cloud resource performance.

    This project creates a new cloud resource management framework that contains a workload predictor, a performance model, and a dynamic resource reconfiguration mechanism. This framework provides capabilities to maximize SLA satisfaction and minimize cloud cost through ensuring predictable end-to-end performance for cloud applications. The project consists of four parts. First, we develop a workload prediction model that forecasts future job arrivals to cloud applications by dynamically aggregating best workload predictors for diverse workload patterns. This workload predictor provides accurate predictions to enable the resource manager to determine when to scale proper cloud resources. Second, we develop a cloud performance model that predicts the performance uncertainty. This model is based on actual measurement of performance on real cloud infrastructures and provides a statistical guarantee to end-to-end executions of user jobs. Third, we develop a resource management framework that provides a dynamic resource reconfiguration capability by near-optimal combination of horizontal and vertical scaling mechanisms. This framework offers online and adaptive approaches to reconfigure available cloud resources to minimize the financial cost for resource use with SLA requirements. And last, we develop a simulation framework that supports a large scale evaluation for cloud applications and resource management policies. This simulation framework provides trustworthy results under particular workloads and enables users to test various real-world test scenarios with minimal amount of effort.

    This research improves performance of cloud resource management systems under workload and performance uncertainties in public IaaS clouds. In addition, two main cloud scaling problems when and how to scale will be addressed by this research.

     

  • PhD Proposal Presentation: John Hott

    Evolving Networks: Structure & Dynamics


    Tuesday, March 15, 2016
    3:00 PM in Rice 242

    Advisor: Worthy Martin
    Attending Faculty: Alfred Weaver, Gabriel Robins, Luther Tychonievich, Jeffrey Holt (minor representative)

    Evolving Networks: Structure & Dynamics

    ABSTRACT
    Network analysis, especially social network analytics, has become widespread with the growing amount of linked data available. Many researchers have started to consider evolving networks, i.e. Time-Varying Graphs (TVGs), to begin to understand how these networks change over time. I expand on current practice in three ways: I develop sampling methods for examining an evolving network at any given time-point, apply social network measures to the sampled graphs, and examine evolving networks which contain various definitions of node identity. Specifically, I propose six methods for sampling a TVG to include contextual events in an interval around any time-point, producing representative static graphs of the TVG. I then apply betweenness and harmonic centralities throughout the overall lifespan of the TVG to yield distributions characterizing the dynamics of the network evolution. I also propose the new concepts of “node-identity class” and “node-identity function” to analyze and compare different views of the same graph. Each of these proposed extensions and concepts will be validated against synthetic and real-world datasets, including Mormon marriages in mid-1800s Nauvoo, IL, an arXiv co-citation network, and the Social Networks and Archival Contexts historical social-document network.

  • PhD Qualifying Exam Presentation: Md Mustafizur Rahman

    Hidden Topic Sentiment Model


    1:30 PM - Tuesday, April 5th, 2016

    Rice Hall, Room 504

    Advisor: Hongning Wang

    Attending Faculty: Worthy Martin (Chair), Westley Weimer, Kamin Whitehouse

    Title: Hidden Topic Sentiment Model

    Abstract: Various topic models have been developed for sentiment analysis tasks. But the simple topic-sentiment mixture assumption prohibits them from finding fine-grained dependency between topical aspects and sentiments. In this project, we build a Hidden Topic Sentiment Model (HTSM) to explicitly capture topic coherence and sentiment consistency in an opinionated text document to accurately extract latent aspects and corresponding sentiment polarities. In HTSM, 1) topic coherence is achieved by enforcing words in the same sentence to share the same topic assignment and modeling topic transition between successive sentences; 2) sentiment consistency is imposed by constraining topic transitions via tracking sentiment changes; and 3) both topic transition and sentiment transition are guided by a parameterized logistic function based on the linguistic signals directly observable in a document. Extensive experiments on four categories of product reviews from both Amazon and NewEgg validate the effectiveness of the proposed model.