Data Reduction by Generalized Graph Domination – OSU10-ENTR

This project reduces a massive data-set to a manageable size without significant loss of information through a graph-based data mining approach.

Sponsor:

Entero Technologies, LLC

Research Team:

Balabhaskar Balasundaram, Baski Balasundaram

Universities Involved:

Oklahoma State University

Start Date:

02/01/10

End Date:

12/31/10

Summary:

Entero Technologies deals with massive multi-attribute data in which each data point is a vector whose components prescribe settings for a collection of engineering design parameters related to oil field engineering services. The problem is to reduce this large data-set to a manageable size without significant loss of information represented.

A graph-based data mining approach was developed for this problem that enabled the reduction of a 30,000 point sample data to less than 6000 points using the concept of dominating sets in the graph model of this data-set.

Problem Statement:

——————

Entero Technologies deals with massive multi-attribute data in which each data point is a vector whose components prescribe settings for a collection of engineering design parameters related to oil field engineering services. Each data point (vector) describes a design that needs to be evaluated by simulation. Each simulation takes significant amount of time to complete, and there are 30,000 designs to be evaluated. The problem is to reduce this large dataset to a manageable size without significant loss of information represented.

Technical Approach:

——————-

A graph-based data mining approach is developed for this problem. Each design is a numerical vector, which is represented by a vertex in the graph model. An edge is created between two vertices if the distance between them is less than a specified threshold. Adjacent vertices then represent similar points in the dataset. A minimum k-dominating set in this graph then represents the smallest set of vertices such that every non-dominated vertex has at least k neighbors in the set. This identifies data points to be simulated such that every data point not selected for simulation experiments has at least k points similar to it among those selected for simulation.
The proposed graph-based data mining approach was implemented and tested. Different approaches to measuring pairwise distance, selection of threshold parameter for the graph model, and the choice of parameter k, were investigated. The optimization model was formulated as an integer program and implemented using C++ and IBM ILOG CPLEX. The implementation was able to solve the problem on the given 30,000 point data-set to under 10% optimality gap. The solution was validated by the sponsor regarding how well it represented the large data-set, by a combination of user experience and simulation of a limited number of randomly selected points from the ones not chosen in the solution, and comparing them to the points that dominated them. The sponsor found the solution to meet their expectations and subsequently used the results in their continuing simulation studies.

Broader Value to CELDi Members:

——————————-

In general, while the model is developed for a graph-based data mining problem, it is also applicable to robust facility location problems. That is, when a subset of locations need to be selected from a large set of candidate locations for establishing facilities so that every location either receives a facility or is close to at least k facilities. The case where k=1 can be viewed as an alternative to the well-known p-median problem, when the number of facilities is not fixed a priori (to be p), rather a minimum number of locations needs to be identified so that the demand points are close enough (according to a user-specified threshold) to at least one facility. For larger values of k, we get more “robust” solutions where every location that doesn’t receive a facility is close enough to at least k facilities. Here, the solution is said to be robust in the sense that if one facility is unavailable due to some reason, an alternate facility is nearby.

Future Research and Potential Extensions:

—————————————–

The proposed methodology is suitable for solving the problem on one large data-set to obtain a good feasible solution without any particular consideration to run-time. Metaheuristic approaches could be investigated to find solutions faster in applications without any emphasis on optimality (or bounds). Polyhedral approaches can lead to better exact algorithms via facets and strong valid inequalities used in a branch-and-cut approach.