Performance Evaluation of Nature-Inspired Algorithms in constrained Optimization

where S = �?⃗? ∈ R�lj ≤ xj ≤ uj �, j = 1, ... ,n and F = {?⃗? ∈ S|gi(?⃗?) ≤ 0 and hi(?⃗?) = 0 }, ?⃗? is solution vector ?⃗? = [x1, ... , xn], r is the number of inequality and m-r is the Abstract— (In almost all scientific contributions to the field of Nature-Inspired Algorithms (NIAs), the researchers select some benchmark test suites, which makes possible to draw conclusions on the merit of the proposed algorithm. Hence, it is a vital task to compose comprehensive test suites with the aim of covering variety of different scenarios. Furthermore, while conducting comparative analysis of results obtained with NIAs, selection of the proper performance indicators are of paramount importance. This paper intends to address these two topics with a special stress on NIAs designed for constrained optimization.)

There have been innumerable benchmark problems introduced in the field of nonlinear constrained optimization by the researchers. In the vast realm of nonlinear programming with nature-inspired algorithms (NIAs), collecting all constrained optimization problems (COPs) is a cumbersome task to be realized. Keeping this fact in mind, this work attempts to bring together the most common COPs while testing the performance of NIAs by the practitioners. Moreover, the composition of test suites should allow evaluating the performance of the algorithm under various conditions.
Second aim of this work is to address the performance measures used in the domain and contrast their informative characteristics. While conducting comparative analysis of results obtained with NIAs, selection of the proper performance indicators are of paramount importance.
The rest of this work is organized as follows: The essential guidelines when selecting a representative benchmark set and the general properties of most common COPs benchmarks are given in Section II. Section III is devoted to setting up proper performance measures to verify the merit of a proposed NIA in constrained optimization domain. Lastly, the findings are summarized in Section IV.

Selection Guidelines
The benchmark problems in the domain may be grouped into two main classes based on the resource that they originated from (El-Ghazali, 2009): 1. Artificial Problems: From the early stages of the intelligent problem solving with NIAs, different researchers from various backgrounds have introduced countless many COPs. Gradually, their individual contributions have been collected to form benchmark problem sets. In general, the primary goal of such benchmark sets is to cover as many problems variants as possible in order to provide a reliable test basis. On the other hand, the advantage of such collections is that they are evaluated by various methods and the results obtained by these methods are easily accessible (Eiben and Smith, 2008).
Another main source of benchmark problems is the problem generators that have been developed by the researchers. Two most referred examples of such generators can be found in (Spears, 2004) and (Gallagher and Yuan, 2006). Their main drawback is that the problems generated by a certain generator are of the same or similar type since the test problem generators usually rely on certain problem construction principles. Thus, they may not yield to reliable results for test algorithms claiming to be robust (Yu and Gen, 2010).

Real life problems:
This type of problems originates from a real life challenge. They form especially valuable benchmarks while testing the applicability of the algorithm in various field of science. On the other hand, the data of a real life problem is usually unavailable to the whole research community due to the potential copyrights and confidentiality consequences. Thus, repeating the numerical experiments on the data is mostly not a feasible option. This, in turn, makes the comparison of different algorithms on the same problem barely possible (Yu and Gen, 2010). Furthermore, the generalization of the obtained results is highly difficult since such problems usually involve some domain specific aspects.
After introduction of the two main sources of benchmark problems, we would like to specify the main characteristics that a good benchmark should possess. Namely, a NIA must be evaluated on various test problems covering different variants of COPs to investigate its convergence behavior. In other words, the set of instances must be diverse in terms of their difficulties and structure (El-Ghazali, 2009). Especially, COPs with challenging properties, including multimodality, sparseness of feasible space, non-separability should be included in the test bed to evaluate the robustness of an algorithm. However, we should keep in mind that robustness is only one of the criterions while proving the merit of an algorithm. Alternatively, a performance evaluation study may reveal the conditions under which the algorithm can be successfully applied.
Throughout the extensive course of research in the domain, some general guidelines for generating test beds for NIAs have been generated by Eiben and Smith (2008) from various resources: • A few unimodal instances should be provided to test the convergence speed of the algorithm.
• Several multimodal functions with a large number of local optima must be included in the test suite to examine the behavior of the algorithm when dealing with many local optima. Additionally, this type of functions may be good touchstones to evaluate the robustness of the algorithm. Namely, an algorithm reaching the solutions of the same quality among many local optima in each distinct run proves its robustness.
• As an additional robustness measure, the algorithm should be tested on problems with random noise.
• The test suite must contain some scalable problems in terms of problem variables and search range. By testing the algorithm on scalable problems, we may check the possible performance deficiencies due to the enlargement of the search region.
• The convergence performance of the algorithm should be tested also on non-separable objective functions.
In addition to the points stated by them for a general purpose NIA, we extracted some standards for nonlinear constrained optimization test suites. We propose the following guidelines specifically apply for constrained optimization problems: • The algorithm should provide a proper way of handling not only inequality constraints but also equality type ones. Thus, COPs with high number of equality constraints must be present in the test set. Especially, highly nonlinear equality constraints may be challenging for a constrained handling method.
• Some COPs with sparse feasible domain should be included in the benchmark set. This allows testing the explorative power of the algorithm under such scenario.
• COPs with global optimum lying on the constrained boundary are challenging benchmarks for most NIAs and they should comprise an essential part of the test suite.
• Some NIAs may exploit the above mention property of a COP. Thus, the opposite case, global optimum not on the boundary, must be analyzed.
• Some constrained handling mechanism may easily distract the search process in COPs with small number of constraints and simple structures although they behave well in highly constrained environments. The distraction is usually paid as unnecessary computational cost in such COPs. Thus, some COPs of low complexity may be added to observe this effect.
Obviously, there is still space for more work in this direction. More proper test suites may be composed to test the above-mentioned issues. The next section states the most common test suites used to test NIAs and summarizes the general properties of the COPs included in these test suites.

CEC2006 Benchmark
The CEC2006 test suite is composed of 24 problems that have been proposed by various researchers (Liang et al., 2006). Since the problems are collected from different resources, the problem set may be considered as a unique case covering a wide range of problem classes with novel properties.
The 24 problems brought together by Liang et al. (2006) and were subject to a competition organized by IEEE community in 2006. The general properties of the benchmark problems are given in Table 1 (Liang et al., 2006) where n indicates the number of variables, and ρ is the ratio of feasible individuals over 1,000,000 random individuals generated with uniform distribution within the definition domain of the problem. LI, NI, LE, and NE stand for the numbers of linear inequalities, nonlinear inequalities, linear equalities, and nonlinear inequalities, respectively. Here, act shows the number of active constraints at the optimum point, ⃗ * .
Mezura-Montes et al. (2010) classifies the CEC2006 problems based on two criterions; number of decision variables, and type of constraints, given in Table 2, and Table  3.

CEC2010 Benchmark
The second test suite worth mentioning is generated by Mallipeddi and Suganthan (2010c), see Table 4. Their intention was to offer an alternative benchmark set to the well-known 24 problems, which have been extensively studied and solved by many algorithms. Thus, it has become almost impossible to demonstrate the superiority of newly designed constrained handling methods. Another motivation was to construct a test suite with scalable problems in terms of decision variables (Mallipeddi and Suganthan, 2010).
Additionally, in this test suite, the objective functions and constraints are rotated by a certain rotation matrix M. The rotation operation is justified by the fact that a COP with multiple sparse feasible regions parallel to the coordinate axes can be solved better by algorithms employing line search or difference of two or more solution vectors. Namely, the rotation aims fostering a fair environment of comparison (Mallipeddi and Suganthan, 2010). Since all problems in the test set are scalable, a classification based on the number of problem variables is not reasonable. The competition organized by IEE (CEC 2010 Competition of Constrained Real-Parameter Optimization) required the participating researchers to solve the problems for = 10, and = 30. The classification of the problems based on the type of the constraints is shown in Table 5.

Engineering Design Problems
The third group of problems introduced here is the engineering design problems collected from various resources. Here, we note that the number of problems possibly put in this set is more than we included in this work (Yiqing, Xigang and Yongjian, 2007;Lin, Hwang and Wang, 2004;Costa and Oliveira, 2001). However, we have selected the most common problems. The first problem is pressure vessel design problem (g40) stated in (Zahara and Kao, 2009). The problem g41 is named in the literature as welded beam design problem. It is firstly formulated in (Kannan and Kramer, 1994) while it attain its standard form in later works. Tension/compression spring (g42) and speed reducer (g43) problems are also the two most cited design problems in the literature (Liu, Cai and Wang, 2010) whereas the car side impact design (g44) and stepped cantilever beam (g45) problems are not frequently used for test purposes (Gandomi, Yang and Alavi, 2011).
Their current formulations we referred are the most common ones stated in the conventional research papers. Namely, there are slight differences in the problem statements in various sources. In general, the engineering design problems have been modified and standardized since their first appearance in the NIA domain. Thus, the collection process was not a straightforward task. The general properties of the selected problems are summarized in Table 6.

PERFORMANCE MEASURES
Performance evaluation plays an essential role while proving the merit of a heuristic algorithm. Namely, we need objective criterions relying on a sound basis to confirm the effectiveness of the proposed method. Hence, it is highly important to define the performance measures in a systematical manner. This is a necessary step not only when comparing two different algorithms but also when tuning the control parameters of the algorithm during the design stage.
The employed performance measures may be directly obtained from the solutions found by the algorithm or extracted indirectly based on the statistical comparison methods. Needless to say, the performance evaluations should be conducted based on the experiments over several independent runs due to the stochastic nature of the NIAs. In general, the success of an algorithm is measured based on three basic metrics or some indirect values derived from these (Eiben and Smith, 2008): • Solution quality (effectiveness) • Speed (efficiency) • Success rate (robustness) The first performance measure considers the objective function value achieved within a limited computational time as the main indicator of the success of an algorithm whereas the second metric aims to measure the success based on the computational cost needed to achieve a predefined solution. Namely, the former specifies the computational time and assess the obtained objective function value while the latter one does the reverse.
In general, the solution quality is defined as the mean best objective function value (MBOV) over a certain number of independent runs. MBOV is calculated by identifying the best objective function value achieved in each run and taking the average of these values (Eiben and Smith, 2008). The bestever-or the worst-ever-objective function value may be also of great interest for some test cases A complementary strategy is to identify a satisfactory candidate solution and to measure the computational effort needed to reach that fitness level. A frequently used metric of computational effort is the number of fitness function evaluations (FES) required to find that solution.
Although FES is the most common metric, the direct indicator of computational effort is the CPU time which is defined by El-Ghazali (2009) as the time a processor spends in the execution of the algorithm. However, CPU time may change depending on the specifications of the machine, on which the algorithm is run, operating system, the programming language, and software architecture. Namely, a comparison based on CPU times necessitates the recreation of the algorithms under consideration. Because recreating an algorithm based on the details given in a scientific paper is, in general, very cumbersome, the most common practice in the literature we referred is to report more generic indicators, including the number of function evaluations (FES), the best value obtained so far, percentage of successful runs, and standard deviation, which are indifferent to machine settings.
While employing FES as a performance metric, it is assumed that an essential proportion of the CPU time is used by the evaluation process of fitness function. This assumption holds for most of the real-world optimization problems with computation-intensive objective functions (El-Ghazali, 2009). In this regard, the ratio of the total CPU time needed to execute an algorithm for certain number of iterations, 2 , over the time spent for fitness function evaluations, 1 , can be calculated, which may indicate the complexity of the algorithm, (Mallipeddi and Suganthan, 2010): To test whether FES is a proper metric in CEC2010 for computational effort, we have extracted the algorithm complexity values, , reported in each paper for 10 dimensional problem set participated in the competition. Surprisingly, the values range from 6.59% to 853%. Namely, the computational overhead induced by an algorithm may vary in a broad interval. Hence, FES may not be good indicator of the speed for some algorithms with very complex structures.
Additional to the criticism above, Eiben and Smith (2008) state that employing FES may be misleading if a NIA uses "hidden labor", for instance, a time consuming local search heuristics embedded into the algorithm (Pelley, Innocente and Sienz, 2011;Sun and Garibaldi, 2010). In such cases, extra computational costs or additional function evaluations required by the local search procedure remain usually neglected.
The third criterion evaluates the robustness of the algorithm in terms of number of runs in which a predefined objective function value within the specified computation time has been achieved. The success rate (SR) is defined as the ratio of successful runs over total number of runs. The total number of runs should be specified such that the obtained results allow drawing statistical conclusions.
To sum up, a proper performance evaluation method should consider all these three aspects together. Also, we should keep in mind that different performance measures may yield different conclusions (Garcia et al., 2008;Smith, 2007).
Additionally, some graphical tools may be relatively useful to visualize the performance differences between various methods. Convergence graphs and box plots are very common means of visualization. The convergence graphs demonstrate the convergence behavior of the algorithm to a given objective value over the number of fitness function evaluations or the generations. They are usually represented in logarithmic scale. The box plots are used to show the solution quality and reliability of the algorithm over a specified number of independent runs. Obviously, the algorithm with higher mean fitness value and lower deviation is desired.

3.1.
Performance Measures for COPs This section is devoted to the performance measures specifically designed for COPs. In addition to the objective function value, we may define two additional solution quality measures related to the feasibility status of the solution: (i) the sum of constraint violation, and (ii) the number of constraints violated. In some cases, a feasible solution, though not be the optimal one, may be of paramount importance. Thus, finding feasible points quickly in a highly constrained search region is also a desired property. Mezura-Montes et al. (2010) have utilized four metrics to measure the performance of NIAs for constrained optimization. At this point, we note that the performance measures given below rely on the assumption that a target solution can be identified while it is either the known global optimum or a satisfactory candidate solution. We denote this target solution with ( ⃗ * ).
Feasibility Rate (FR): With reference to the above discussion, they define feasibility rate (FR) metric showing the percentage of runs where feasible solutions are found. A feasible run is an independent trial at least with one feasible solution. FR is formulated as; where is the number of feasible trials, and is total number of independent runs. Of necessity, is between 0, and 1 (Mezura-Montes, Miranda-Varela and Del Carmen Gómez-Ramón, 2010).

Success Rate (SR):
A successful trial is an independent run, where the absolute difference between the best solution ( ⃗) and the target/optimal value ( ⃗ * ) is less than a predefined threshold.
Average Number of Fitness Evaluations for Optimality (AFESO): It is calculated by averaging the number of FES on each successful trial needed to reach the close neighborhood of ( ⃗ * ). AFESO is formulated as; Average Number of Fitness Evaluations for Feasibility (AFESF): Similarly, we propose an additional metric to measure the average convergence FES required by the algorithm to identify the first feasible solution. AFESF is formulated as; where denotes the number of fitness function evaluations needed to find the first feasible point in trial .
Success Performance (SP): Mezura-Montes et al. (2010) combines two metrics, and , to measure the speed and reliability of an algorithm; = . (9) A low measure indicates that the algorithm is able to find the global optimum in less FES with high consistency.
Feasibility Performance (FP): Similar to SP, we may generate a feasibility performance metric which combines two metrics, and , to measure the speed and reliability of an algorithm to identify at least one feasible solution; = . (10) A low measure is preferred as it means that the algorithm requires less FES to find the first feasible solution and it is able to show the same behavior over several runs.
The performance metrics SR, and AFESO employ a stopping criterion, | ( ⃗) − ( ⃗ * )| ≤ . Namely, the absolute error should be less than a predefined threshold before the search is terminated. However, the magnitude of ( ⃗ * ) is not considered while setting . More specifically, a threshold value, ~1. − 4, is not a proper choice for a target fitness value of similar order of magnitude. Alternatively, we propose to replace the absolute error criterion with relative percentage error (RPE) measure to terminate the search process; Hence, Equation 4.8 sets a relative comparison metric and the search is terminated when ≤ .

Figure 1: Precision-accuracy trade-off in optimization
Another point that must be highlighted is that the temporal requirements of the optimization problem to be solved may play an essential role during the selection process of a proper set of performance measures. Obviously, the principle of precision-accuracy trade-off applies in optimization domain and each optimization problem has a certain focal point that mainly lies on one side of precision-accuracy-scale, as depicted in Figure 1.
Namely, certain type of problems may require that the optimization algorithm delivers a good solution with high precision in a single run due to the temporal limitations. In general, this may be a case where the optimization problems must be solved repetitively in a dynamic environment within a short time interval (for instance, optimization of the traffic load in a network system). Thus, the available computational budget is finite. Because of this, the algorithm cannot be rerun several times and it should precisely provide good (may not be globally optimal) solutions. For this type of COPs, high average performance and low deviation are vital aspects. Hence, FR, SR, AFES or their combinations might be appropriate measures.
On the other hand, the nature optimization problem may allow exploring thoroughly the whole search region for a global optimum within adequately high temporal budget. The engineering design problems are usually of this type where a high quality solution in close vicinity of the global optimum should be found with several trials. In this case, precision is not the main objective since the best candidate point over all runs is selected as the final solution. Best-ever-objective function value is a proper indicator of performance.

CONCLUSION
Firstly, this chapter illustrated the principal characteristics of a good benchmark. Based on the abstracted guidelines, the benchmark problems are selected to test the different variants of the proposed algorithm.
Precisely defined performance metrics are of paramount importance during the design stage of a NIA dealing with COPs, Also, a sound basis is necessary while comparing the proposed algorithm with other methods. Thus, the most common performance measures are discussed in this chapter. Moreover, we highlighted that the set of proper performance measures may change due to the specifications of the COP. Based on the findings of this chapter, a parameters analysis on the proposed algorithms will be conducted and the resulting algorithm will be compared with state-of-the-art methods addressing the same set of benchmark problems.