A novel reinforcement learning framework for disassembly sequence planning using Q-learning technique optimized using an enhanced simulated annealing algorithm

Mirothali Chand; Chandrasekar Ravi

doi:10.1017/S0890060424000039

A novel reinforcement learning framework for disassembly sequence planning using Q-learning technique optimized using an enhanced simulated annealing algorithm

Published online by Cambridge University Press: 01 April 2024

Mirothali Chand

and

Chandrasekar Ravi

Show author details

Mirothali Chand: Affiliation:
Department of Computer Science and Engineering, National Institute of Technology Puducherry, Karaikal, India
Chandrasekar Ravi*: Affiliation:
Department of Computer Science and Engineering, National Institute of Technology Puducherry, Karaikal, India
*: Corresponding author: Chandrasekar Ravi; Email: chand191987@gmail.com

Article contents

Abstract
Introduction
Related works
Background study
Proposed methodology
Experiments and analysis
Conclusion
Data availability statement
Funding statement
Competing interest
References

Rights & Permissions

Abstract

The increase in Electrical and Electronic Equipment (EEE) usage in various sectors has given rise to repair and maintenance units. Disassembly of parts requires proper planning, which is done by the Disassembly Sequence Planning (DSP) process. Since the manual disassembly process has various time and labor restrictions, it requires proper planning. Effective disassembly planning methods can encourage the reuse and recycling sector, resulting in reduction of raw-materials mining. An efficient DSP can lower the time and cost consumption. To address the challenges in DSP, this research introduces an innovative framework based on Q-Learning (QL) within the domain of Reinforcement Learning (RL). Furthermore, an Enhanced Simulated Annealing (ESA) algorithm is introduced to improve the exploration and exploitation balance in the proposed RL framework. The proposed framework is extensively evaluated against state-of-the-art frameworks and benchmark algorithms using a diverse set of eight products as test cases. The findings reveal that the proposed framework outperforms benchmark algorithms and state-of-the-art frameworks in terms of time consumption, memory consumption, and solution optimality. Specifically, for complex large products, the proposed technique achieves a remarkable minimum reduction of 60% in time consumption and 30% in memory usage compared to other state-of-the-art techniques. Additionally, qualitative analysis demonstrates that the proposed approach generates sequences with high fitness values, indicating more stable and less time-consuming disassembles. The utilization of this framework allows for the realization of various real-world disassembly applications, thereby making a significant contribution to sustainable practices in EEE industries.

Keywords

disassembly planning simulated annealing machine learning optimization reinforcement learning

Type: Research Article
Information: AI EDAM , Volume 38 , 2024 , e5

DOI: https://doi.org/10.1017/S0890060424000039 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Introduction

Rapid advancements in technologies, rise in population, and urbanization led to a high demand for electrical and electronic products. After usage due to their depreciation and damages, the products accumulate, which is a significant problem. The repair–reuse–recycle (RRR) process comes into action to manage this equipment’s wastage. To carry out this process smoothly, efficient planning has to be followed. Here, the disassembly planning comes into action. Disassembly planning is the process of forming an efficient plan to dismantle the products into separate entities. After that, the RRR process is carried out based on their quality and requirements.

The disassembly sequence planning (DSP) comes under the primary disassembly planning method. In DSP, a sequence is generated based on which the parts are disassembled. The sequence generation is based on the various input data of the product. Both humans and robots can be utilized to disassemble the products. But robot-based disassembly decreases manpower and time consumption. Also, it increases the safety of the laborers. DSP is considered a problem in the manufacturing sector because of its difficulty in planning sequences for complex products, time, and cost consumption.

After completing the assembling process in the manufacturing unit, the next stage is the repair and maintenance process. Based on the product model obtained from the assembling unit, the repair and maintenance of the products are done in factories and service centers. The disassembling process has three primary purposes. They are repairing, remanufacturing, and recycling the products. These three purposes help reduce the mining of new raw materials and encourage the re-usage of the same product/parts. However, in most cases, due to high time and cost consumption, the disassembling process is discouraged and not utilized by the manufacturers. So, to overcome this problem, an efficient method for generating a disassembly sequence must be planned. This particular disassembly sequence must reduce the respective time and cost consumption. DSP is classified based on the types of disassembly and levels of disassembly (Chand and Ravi, Reference Chand and Ravi2023). The type of DSP varies from product to product based on its structure and complexity.

Based on the type of disassembling process, DSP is divided into the following:

1. Sequential DSP – The disassembly of parts is sequentially done one by one.
2. Parallel DSP – The disassembly of parts is done in a parallel manner, where two or three parts are disassembled simultaneously at a time.

Similarly, DSP is divided into three types based on the levels of disassembly.

1. Complete DSP – Every part of the product is dismantled into individual parts.
2. Partial DSP – Dismantling is done up to a particular level. Based on the requirements, it can be the product’s initial, middle, or final levels.
3. Selective DSP – A particular part of the product is selectively disassembled based on its requirement.

Various algorithms have been employed for the DSP sequence generation process, including primary, traditional methods to advanced meta-heuristic techniques. In the DSP process, two primary stages are involved. Initially, the product model undergoes analysis, and disassembly attributes are extracted. Subsequently, algorithms analyze these disassembly attributes to generate both feasible and optimal/near-optimal disassembly sequences.

Disassembly attributes

The disassembly attributes are extracted from the computer-aided design (CAD) models of the product automatically. The various types of matrices used for product representation in this work are explained below using the ink–pen example shown in Figure 1.

1. Stability
2. Liaison
3. Geometric feasibility
4. Precedence

Figure 1. Ink–pen example.

Stability matrix

The stability matrix (S) is generated based on the stability data between the parts. The stability data gives information about the product, whether the other parts of the product remain stable when one part is disassembled. The matrix shows the relationship between part i and part j. If a part i (P_i) faces no disturbance at any direction during the dismantling of part j (P_j), then this relation is considered as “completely stable” and denoted as “2” in the matrix S. In case P_i is stable in one direction and gets disturbed in another direction during the disassembly of P_j, then this relation is termed as “partially stable” and denoted as “1.” If the P_i has no possibility of proper stability during the disassembly of P_j, it is considered as “unstable” and denoted as “0” in the matrix S. This stability equation is given in Eq. (1), and the stability matrix for the considered ink–pen example is represented in Figure 2.

(1)

$$ S\left({P}_i,{P}_j\right)=\left\{\begin{array}{l}2, if\;{P}_i\; is\ completely\ independent\ of\;{P}_j\\ {}1, if\;{P}_i\; is\ partially\ dependent\ of\;{P}_j\\ {}0, if\;{P}_i\; is\ completely\ independent\ of\;{P}_j\end{array}\right\} $$

Figure 2. Stability matrix.

Liaison matrix

This matrix L denotes the contact relationship between the elements. If part i has contact or connection with part j, it is denoted by “1.” If not, it is denoted by “0.” The contact within the parts of a product is represented in this matrix. The liaison matrix equation is given in Eq. (2). The liaison matrix for the ink–pen example is given in Figure 3.

Figure 3. Liaison matrix

(2)

$$ L\left({P}_i,{P}_j\right)=\left\{\begin{array}{c}1, if\ there\ exist s\ relations\ between\;{P}_i\; and\;{P}_j\\ {}0, if\ there\ exist\; no\; relation\end{array}\right\} $$

Disassembly feasibility matrix (geometric)

The parts’ geometric direction-based relationship is given in the disassembly feasibility matrix (D). A part can be disassembled in any direction based on the product structure. The ±XYZ directions (d) in which a part can be dismantled from another part are given as six matrices. If part i can be disassembled from part j in that direction, it is denoted as “1.” If it is not feasible for dismantling, it is denoted as “0.” The XY-axis represents the horizontal and vertical directions, respectively, whereas the Z-axis represents the gravitational direction. To explain the concept of geometric feasibility with a simple example, six directions are represented. In the case of more oversized products with complex connections, the number of directions is high. A part can be disassembled in different angles, which cannot be grouped into these six directions. In such cases, each possible angle will be considered as a direction (Anil Kumar et al., Reference Anil Kumar, Bahubalendruni, Prasad and Sankaranarayanasamy2021). This representation is termed as “oblique-directional interference matrix.” However, the objective will be to reduce the total number of directional changes that occurred during the disassembly process.

The disassembly feasibility matrix equation is given in Eq. (3). The disassembly feasibility matrix for ink–pen is given in Figure 4. The orientation change (Oc) score is calculated from the matrix (D) based on the number of possible directions between part connections i and j. It is given in Eq. (4).

(3)

$$ D({P}_i,{P}_j)=\left\{\begin{array}{c}1, if\hskip0.35em {P}_i\hskip0.35em can\hskip0.35em be\hskip0.35em disassembled\ without\ disturbing\hskip0.35em {P}_j\\ {}0, if\hskip0.35em {P}_i\hskip0.35em disturbs\hskip0.35em {P}_j\hskip0.35em during\ disasembly.\end{array}\right\} $$

(4)

$$ {Oc}_{ij}={\left(\left(+X\right)+\left(+Y\right)+\left(+Z\right)+\left(-X\right)+\left(-Y\right)+\left(-Z\right)\right)}_{ij} $$

Figure 4. Disassembly feasibility matrix.

Precedence matrix

The precedence matrix (P_r) gives information regarding the precedence relationship between the parts of a product. The matrix is constructed between any two parts i and j. It is given the value “1” if a particular part i has to be disassembled before part j. If the part has no dependency on the other part and can be disassembled freely, the matrix value is “0.” These matrix data are used to check the feasibility condition of the disassembly sequence. The precedence matrix for the ink–pen example is given in Figure 5, and its equation is given in Eq. (5).

Figure 5. Precedence matrix.

(5)

$$ {P}_r\left({P}_i,{P}_j\right)=\left\{\begin{array}{c}1, if\;{P}_i\; has\; to\; be\; disassmbled\ before\;{P}_j\;\\ {}0, if\;{P}_i\; can\; be\; disassembled\ before\; Pj\end{array}\right\} $$

Usually, a product comprises various parts with multiple connections, resulting in many possible disassembly sequences. As the product’s part count increases, the connections and its complexity also increase exponentially. In addition to this, DSP requires the processing of more product and materials’ data based on the objectives considered. This makes the DSP, a challenging and complex problem that requires an efficient and systematic framework to solve it and generate feasible and optimal sequences.

The objectives of DSP

The common objective of DSP is to minimize the time and cost taken for the disassembling of a complete product into separate parts. The primary disassembly time ( $ {D}_p $ ) is the essential time required to dismantle one part of the given product. A constant value is taken as the primary disassembly time (Luo et al., Reference Luo, Peng and Gu2016). The disassembly fitness function derived from the total time and cost incurred to do the disassembly process is used to evaluate the objective. The optimality of a solution is based on the minimization of disassembly time ( D _t) and disassembly cost ( D _c). The disassembly time ( D _t) depends on the number of directional changes. The disassembly cost is based on the cost consumed for various tools and labors used for the disassembly process.

In this work, due to the non-availability of actual cost data, the feasibility property between various parts is considered for the calculation of D_c. Here, the cost does not refer to the actual cost. Instead, it refers to how the parts’ connection affects the disassembly sequence’s feasibility. So, if a sequence has a high cost, the particular sequence is less feasible or non-feasible. Based on the liaison, stability attributes, and assumed weight factors, D_c is calculated. The liaison property is the basic need for a feasible sequence; hence, its weight factor must be high and taken as 10. If the connection of parts (P_i, P_j) that are to be disassembled has no liaison relationship between them, the initial cost $ ({D}_c=1 $ ) is increased by 10. The next important requirement is that the parts in the sequence must be either completely stable or partially stable, so their weight factor is given accordingly. If the connection of the parts in the sequence has low stability, the cost is increased by 3, and for no stability, it is 5. This is done in order to make the feasible sequence have a low-cost value. After calculating the cost for every part connection, the disassembly cost (D_c) data for all the part connections are generated. The disassembly time (D_t) is calculated based on the time penalty (T _p) and the primary disassembly time (D_p) as mentioned in Eq. (6). In T _p, a time penalty of 3 is given for part connections with the number of feasible orientations less than or equal to 2. The D_p is assumed to be “1.” The DSP fitness function (D_f) is based on the maximization of the inverse of D_t and D_c. It is given in Eq. (7).

(6)

$$ {D}_t={T}_p+{D}_p $$

(7)

$$ {D}_f=\max \hskip0.35em \left(1/\left({D}_t+{D}_c\right)\right) $$

Table 1 represents the disassembly sequences for the ink–pen with their respective directions and directional changes, which are generated through the optimization algorithms used for comparison purposes in this work.

Table 1. Sequence and directional data

Related works

Several research works have been published based on the DSP problem. For better clarity, the related works are divided based on their basic types and explained in this section.

Graph-based methods

At the beginning stage of research in DSP, graph-based methods were used by most of the researchers. The nodes denote the parts in the graph representation, and the edges represent the relationship between those parts. The most used graph-based approach is the AND/OR graph (De Floriani and Nagy, Reference De Floriani and Nagy1989), and an extended version of the AND/OR graph was introduced by Ma et al. (Reference Ma, Jun, Kim and Lee2011). Other graph methods include the extended process graph (Kim and Lee, Reference Kim and Lee2017; Tian et al., 2019a), the disassembly precedence graph (Han et al., Reference Han, Yu and Lee2013), and the graph cut-set method (Gunji et al., Reference Gunji, Pabba, Rajaram, Sorakayala, Dubey, Deepak, Biswal and Bahubalendruni2021). The graph-based approach is preferred because it generates feasible sequences from the product data. To process the uncertainties in dynamic DSP, Vongbunyong et al. (Reference Vongbunyong, Kara and Pagnucco2012) proposed a cognitive robot-based DSP approach. Later work (Vongbunyong et al., Reference Vongbunyong, Kara and Pagnucco2013) introduced a cognition-based robot’s basic and advanced behavior control strategies for disassembly planning. The graph-based method gives a proper representation of various disassembly possibilities, but the definition for graphs needs to be given manually; also, it has the problem of combinatorial explosion when generated in an automatic manner.

Matrix-based methods

Various matrix-based representations like contact (liaison), translational (Smith et al., Reference Smith, Smith and Chen2012), precedence (Azab et al., Reference Azab, Ziout and ElMaraghy2011; Ren et al., Reference Ren, Tian, Zhao, Yu and Zhang2017), directional, and interference (Kheder et al., Reference Kheder, Trigui and Aifaoui2017) matrices are proposed by researchers. These matrices depict the pairwise relationship between the parts of a product regarding their stability, priority, and space interference. Matrix-based data are the input in the optimization and sequence generation processes. Apart from graph and matrix-based approaches, Petri net-based (Petri and Reisig, Reference Petri and Reisig2008; Kuo, Reference Kuo2013) representation is also used. The matrix-based representation is the most suitable type for different computational processes. Another advantage is that the matrix data are obtained from the respective product’s CAD models automatically.

Mathematical and meta-heuristic methods

Various computational techniques and optimization algorithms are used to generate feasible or optimal disassembly sequences. Mathematics-based computational methods like branch-and-bound (Kim and Lee, Reference Kim and Lee2017), linear (Zhu et al., Reference Zhu, Sarigecili and Roy2013), and nonlinear (Ullerich, Reference Ullerich2014) methods are used to solve DSP. The mathematical models have the capability to find the optimal solutions, but the quality of the solution is entirely based on the product’s objective function and representation format. Meta-heuristic methods like genetic algorithm (GA) (Giudice and Fargione, Reference Giudice and Fargione2007; Hui et al., Reference Hui, Dong and Guanghong2008; Tseng and Lee, Reference Tseng and Lee2018), ant colony optimization (ACO) (Wang et al., Reference Wang, Liu, Li and Zhong2003; Kheder et al., Reference Kheder, Trigui and Aifaoui2017), particle swarm optimization (PSO) (Kheder et al., Reference Kheder, Trigui and Aifaoui2017), and artificial bee colony optimization (Ren et al., Reference Ren, Zhang, Zhao, Xiao and Tian2018; Tian et al., 2019a,b; Liu et al., Reference Liu, Zhou, Pham, Xu, Yan, Liu, Ji and Liu2018, Reference Liu, Zhou, Pham, Xu, Ji and Liu2020) are used by various researchers to solve the DSP problem. The meta-heuristic approaches can be applied to complex products to get near-optimal or optimal solutions, but the quality of solutions varies based on the constraints of that particular approach.

Hence, these approaches are enhanced and combined with other meta-heuristic methods to produce hybrid approaches (Tian et al., Reference Tian, Zhou and Li2018). For instance, Tseng et al. (Reference Tseng, Chang, Lee and Huang2019) introduced a hybrid technique based on ACO. A combined version of GA and artificial fish swarm algorithm was submitted by Guo et al. (Reference Guo, Zhong, Li, Du and Guo2019). The recent work of Xing et al. (Reference Xing, Wu and Qu2021) is based on an improved version of the max–min and ant colony system (IMMAS). Hybrid methods can generate better solutions when compared to single-heuristic or meta-heuristic methods. One main problem with these methods is that they are not in a straightforward manner. They use different techniques in different stages of the algorithm, which is not a generalized approach.

Other approaches in DSP

In addition to computational and optimization approaches, various advanced techniques and technologies have been explored for DSP. They are simulation-dependent techniques like CAD (Issaoui et al., Reference Issaoui, Aifaoui and Benamara2017), decision-making-based de-manufacturing (Anil Kumar et al., Reference Anil Kumar, Bahubalendruni, Prasad and Sankaranarayanasamy2021), virtual reality (Mitrouchev et al., Reference Mitrouchev, Wang and Chen2016, Reference Mitrouchev, Wang, Chen, Eynard, Nigrelli, Oliveri, Fajarnes and Rizzuti2017), and augmented reality (AR) (Osti et al., Reference Osti, Ceruti, Liverani and Caligiana2017; Chang et al., Reference Chang, Nee and Ong2020). The robot–human collaborative approach for DSP was implemented following the artificial bee colony algorithm (Liu et al., Reference Liu, Zhou, Pham, Xu, Yan, Liu, Ji and Liu2018, Reference Liu, Zhou, Pham, Xu, Ji and Liu2020; Xu et al., Reference Xu, Tang, Liu, Liu, Zhou and Pham2020). This work tries to minimize the labor process and utilizes human knowledge for both single-objective and multi-objective problems. All these advanced techniques can be used to generate disassembly sequences for specific products with special orientations so that they cannot be used as a general technique for other products, and benchmarking with standard algorithms is a difficult process.

Synthesis of literature study

Though many works are done in DSP, most are based on traditional optimization algorithms and nature-inspired heuristic algorithms. The human–robot collaborative approaches have been employed for various disassembly problems. But specifically, for DSP only a few works have been published. More research work can be done in those areas to solve the DSP problem. Machine learning approaches are used only to a limited extent in DSP due to its problems of slow convergence and local minima. To address these challenges, the utilization of adaptive parameter techniques has emerged as a viable solution. Numerous adaptive parameter techniques have been introduced across various learning problems, including learning automata-based approaches (Beigy and Meybodi, Reference Beigy and Meybodi2000, Reference Beigy and Meybodi2001; Meybodi and Beigy, Reference Meybodi and Beigy2000, Reference Meybodi and Beigy2002). Simulations of these techniques demonstrate their feasibility and effectiveness in various learning problems. In particular, the adaptation of parameters in the back-propagation (BP) algorithm has been applied to train multilayer neural networks. Inspired by this concept of parameter adaptation to overcome the problems of local minima and slow convergence rate, an enhanced simulated annealing (ESA) algorithm has been implemented in this work. By enhancing the convergence rates, a wide range of machine learning models can be explored to address the challenges in DSP (Syed Shahul Hameed and Rajagopalan, Reference Syed Shahul Hameed and Rajagopalan2022, Reference Syed Shahul Hameed and Rajagopalan2023).

Based on the study of related works, it is evident that there is a lack of exploration of machine learning methods, particularly reinforcement learning (RL), in solving DSP problems. Additionally, there is a significant need for an efficient framework capable of effectively handling various products.

Research contributions

To address these gaps, an optimized Q-learning (OQL)-based RL approach, specifically tailored for DSP (RL-DSP), is proposed in this research. The primary objective of this work is to overcome the challenges in DSP by utilizing the potential of RL technique. The contributions in this work are as follows:

1. A novel RL framework for DSP: An original RL framework has been contributed that generates optimal sequences for various products with diverse levels of complexity. While limited DSP attributes have been incorporated for processing in most existing works, all necessary attributes have been considered and appropriately weighted according to their influence in generating optimal disassembly sequences in this study.
2. Implementation of proposed ESA: An ESA is proposed in this work. Its innovative approach narrows the search space over time and optimally balances between exploration and exploitation, thus resulting in avoiding local optima and enhancing the likelihood of finding global optima.
3. The development of the OQL approach: The proposed OQL approach innovatively integrates an epsilon-greedy (EG) approach and the ESA. This approach improves the action selection process by introducing a flexible, adaptive learning model that increases the ability to solve problems more robustly. This leads to the OQL method outperforming the classic QL (CQL) technique in a variety of complex scenarios.

The proposed work is benchmarked against standard and state-of-the-art techniques, selected based on their demonstrated performance in current literature. The results show that the proposed RL-DSP approach provides better solutions in terms of optimality, as well as reduced time and memory consumption.

The organization of the remainder of this paper is as follows: Section “Background study” presents a background study of RL and QL in relation to the research. Section “Proposed methodology” outlines the proposed methodology, including the novel ESA approach and the OQL-based RL framework for DSP. Section “Experiments and analysis” analyzes the experimental study and discusses the results, highlighting the effectiveness of the proposed approach. Finally, the paper concludes with a summary of key findings, contributions, and suggestions for future research.

Background study

Reinforcement learning (RL)

The RL technique is explored because of its efficiency in handling nondeterministic polynomial time (NP)-hard problems (Sutton and Barto, Reference Sutton and Barto2018). This technique aims to make the agent select an appropriate action in the current state that produces the best results. Based on the reward (feedback) received for every action, the agent (decision-maker) analyzes the current state and environment (scenario) to take a particular action (step) at that time. The state of a given problem changes based on the actions picked by the agent. If the agent takes appropriate action to solve the problem, it is rewarded. The agent aims to get maximum reward points by taking the desired actions to maximize the reward based on its feedback experience. In this method, the state (S) denotes the group of all possible states in a problem, action (A(S)) represents the group of possible actions that can be taken at a particular state (S), and reward (R) denotes the reward points given to the agent for taking a desirable action; the penalty is the low reward or negative points given to the agent for taking an undesirable or wrong action, the cost is the measure that denotes the quality of the solution or state, and time indicates the period taken for the learning process. The general framework of RL is illustrated in Figure 6. The main task of an agent is policy $ (\pi $ ) learning, $ \pi $ : S → A. The policy (knowledge) learned must be able to generate a maximum of total rewards (M). Both learning rate (α) and discount factor (γ) must be 0 < α and γ < 1. The total rewards M are defined as r0 + γ*r1 + γ2*r2 + …

Figure 6. General framework of reinforcement learning process.

Classic Q-learning (CQL) approach

CQL is the standard QL technique, which is based on temporal difference (TD) learning. The updating of Q-values is based on the Q-value [Eq. (8)]. The calculated values are updated in the Q-table. From Eq. (8), s and a are the state and action at the current time (t). Q_t (s, a) is the current Q-value considering the current state and the action taken. Q_t + 1 is the Q-value of the following state (s’) and action (a’). R (s, a) denotes the reward obtained for that pair of (s, a). max Q’ (s’, a’) defines the maximum number of rewards gained given the new state (s’) and all possible actions at the new state. Algorithm 1 gives the step-by-step approach of the CQL algorithm. Figure 7 depicts the flowchart of the CQL process.

(8)

$$ {Q}_{t+1}\left(s,\alpha \right)={Q}_t\left(s,\alpha \right)+\alpha \left(R\left(s,\alpha \right)+\gamma {\max}_{\alpha^{\prime }}Q\Big({s}^{\prime },{\alpha}^{\prime}\left)-{Q}_t\right(s,\alpha \Big)\right) $$

Algorithm 1. Classic Q-Learning (CQL) Algorithm.

1 Set the parameters: (learning rate α, discount factor γ)
2 For each pair of state (s) & action (a), Q-matrix (s,a) = 0
3 Observe the state s
4 repeat
5 Select the action a using Epsilon-greedy method
6 Take the action a
7 Receive immediate reward r(s, a)
8 Observe the new state s’
9 Update Q (s, a) with Eq. (8)
10 s = s’
11 until all the stopping criterion is satisfied

Figure 7. Classic Q-learning (CQL) flowchart.

Epsilon-greedy (EG) approach

The exploitation and exploration trade-offs are the pivotal aspect of RL. The agent must choose the best from already exploited actions to maximize the rewards. Still, it is also required to explore more actions to find the other potentially best solutions for a given problem. For action selection, there are various strategies followed. One of them is the epsilon-greedy method. The epsilon $ (\varepsilon $ ) values should be in the range of $ 0<\varepsilon <1 $ . Initially, when the epsilon rates are higher (~1), the agents explore the environment more; eventually, the epsilon rate decreases, and consequently, the agent starts to exploit more.

Based on the increasing exploration process, the agent gets more knowledge about the environment and the required policy is built. The policy $ \pi $ (s) is applied according to the given Eq. (9) (Ottoni et al., Reference Ottoni, Nepomuceno, de Oliveira and de Oliveira2021). However, this EG approach lacks efficiency in its epsilon-decreasing structure.

(9)

$$ \pi \left(\mathrm{s}\right)=\left\{\begin{array}{l}a\ast, with\ probability\hskip0.4em 1-\varepsilon \\ {} ar, with\ probability\hskip0.4em \varepsilon\;\end{array}\right. $$

$ \pi (s) $ denotes the decision policy for the current state s, a* denotes the best-estimated action for the state s at the current time, and ar denotes the random action selected with probability $ \varepsilon $ .

Proposed methodology

Proposed enhanced simulated annealing (ESA) algorithm

The proposed ESA algorithm is the enhanced version of the standard SA method. In the standard SA method, the new solution is found using the solution function and compared with the old for calculating the difference value (Kirkpatrick et al., Reference Kirkpatrick, Gelatt and Vecchi1983). The difference value is compared with the temperature (temp) factor, and the next decision is taken. For annealing, the temperature reduction factor (β) is used. The standard SA algorithm is given in Algorithm 2.

Algorithm 2. The Standard Simulated Annealing Algorithm.

1 Set the parameters: (temp, β)
2 solution_new = sol_function
3 diff = Solution_new - Solution_prev
4 repeat
5 if diff <0 or e-(−diff/temp) > random (0, 1));
6 solution = new_solution
7 temp = temp*β
8 until all the stopping criterion is satisfied

In SA, the temperature has to be reduced in a phased manner from its initial rate. This helps the algorithm achieve convergence. In ESA, the additional parameters ε, λ, and t_r are used. The SA needs a more structured decreasing rate for the temperature. In order to solve this issue, λ and t_r are added. λ and t_r denote the temperature decay factor and the temperature regularizing criterion, respectively. Their values are given between 0 and 1. The temperature regularizing criterion is used to reduce the temperature rate in a slow-phased and structured manner, whereas the temperature decay factor is used to maintain the temperature value within the positive range of values.

Based on these parameters, the equation for ESA is formulated. The epsilon gets decreased for each run based on Eq. (10). Then, a decision is taken based on the comparison with random values as given in Algorithm 3.

(10)

$$ \varepsilon =\varepsilon -\Big(1/\left(\mathit{\log}\left( temp+\lambda \right)+{t}_r\right) $$

Algorithm 3. Proposed Enhanced Simulated Annealing (ESA) Algorithm.

1 Set the parameters: (temp, $ \lambda $ , t_r)
2 solution_new = sol_function
3 diff = Solution_new – Solution_prev
4 repeat
5 if diff <0 or (ε > random (0, 1));
6 solution = new_solution
7 $ \varepsilon =\varepsilon $ – (1 / (log (temp+𝜆 ) + t_r)
8 until all the stopping criterion is satisfied

Proposed optimized Q-learning (OQL) approach

The existing EG approach in CQL uses immense randomness to decrease the epsilon rate, resulting in unstructured decay of epsilon values. A structured decrease in the epsilon rate is required to obtain more accurate results from the algorithm. The proposed ESA approach acts as a good optimization technique to gradually reduce the epsilon value. Its cooling schedule, which reduces the exploration rate over time, aligns well with the RL learning process. This approach is used in conjunction with the standard EG approach to make the eventual decay of epsilon value more controlled and organized. To define the standard values for the various parameters in this proposed OQL technique, the temperature (temp) is defined as the total number of iterations. This is because the more iterations the program runs, the less exploration is required to find the best solution.

Thus, 1/log(temp) decreases the epsilon value based on the number of iterations consumed by the problem. Additional parameters are used to add more regularization to the decaying epsilon values. The value of decay rate $ (\lambda $ ) is set as (1-α) due to the interdependence of exploration and learning factors. The decreasing rate of epsilon is also based on the learning rate value of the algorithm. As the agent learns more about the environment, less effort is required for exploration. There is a need for a regularizing value that further adds structure to the decreasing rate. For that, the temperature regularizing criterion (t_r) is set to the range (0, 1) to maintain the same proportion as the other parameters used in RL. The algorithm for the proposed OQL technique is given in Algorithm 4.

Algorithm 4. Proposed Optimized Q-Learning Technique (OQL) Algorithm.

1 Set the parameters: (learning rate α)
2 For each pair of state (s) & action (a), Q-matrix (s,a) = 0
3 Observe the state s
4 repeat
5 Select the action a using OQL method

ε ∈ (0, 1); ¥ = random(0,1); t_r = 0.1995

temp = no. of. Iterations; $ \lambda $ = (1-α);

ε - = 1 / (log (temp + $ \lambda $ ) + tr);

If ¥ > ε:

Select the best action

Else:

Select a random action
6 Take the action a
7 Receive immediate reward r(s, a)
8 Observe the new state’s
9 Update Q (s, a) with Eq. (6)
10 s = s’
11 until all the stopping criterion is satisfied

For the proposed ESA [Eq. (10)], the following values are taken: epsilon (ε) = 1.0; temperature (temp) = no. of iterations; $ \lambda $ = 1 - learning rate (α); and temperature regularizing criterion (t_r) = 0.1995.

To determine the optimal value for t_r, a range of values from 0.01 to 0.2 was tested through a series of experiments. It was observed that initially the values of (D_f) showed significant variation, but they eventually converged when the value of t_r exceeded 0.1995. These experiments were conducted across all eight different products, and the results consistently indicated that the OQL method achieved better convergence and higher quality solutions when t_r was equal to or greater than 0.1995. Based on these findings, t_r is chosen as 0.1995. Since the OQL technique offers better convergence and higher quality solutions by controlling the decay of epsilon values in a structured manner, this proposed OQL technique helps in obtaining the required results in less time compared to CQL.

Proposed RL framework for DSP

RL is preferred to solve DSP because of its efficiency in solving optimization problems. The supervised and unsupervised learning techniques are not used because of the unavailability of huge datasets for the DSP problem. DSP requires the processing of disassembly attributes and generating the solution based on the objectives but without any training process, which is different from the usual problems solved using supervised/unsupervised learning techniques. For applying RL to DSP, the precedence, stability, liaison, and geometric feasibility data are considered and taken as conditions to generate the reward and penalty values. The initial state is the complete product with all parts connected, and the terminal state is the disassembled product with individual parts. The disassembly sequence is the required output. The primary objective of the RL agent is to generate a sequence that consumes less disassembly time and cost. Every part of a product’s disconnection is taken as an action to perform. So, the state and action of DSP are formulated based on the parts disassembled and the remaining parts to be disassembled.

The main objective of this RL framework is to prepare the agent to predict an efficient DSP sequence that is feasible and optimal with minimal disassembly directions. The RL structure for DSP is given as follows:

States: The states are based on the total number (n) of parts of the product to be disassembled. Initially, the complete product with all the products is the starting state. As each part is removed from the product, the state gets changed eventually, and at the final state, only the last part will be left.

Action: Action is the disassembling of one part from the remaining parts of the product. Action is taken based on the information of the set of parts that can be possibly disassembled given the current state. In each state, an action is taken (a part is removed), and it is carried out till there is no part left to disassemble.

Rewards: These functions are defined to associate the disassembly time/cost with the dismantling process. For each action (disassembly of a part), a reward is given. This reward generation is based on the disassembly fitness (D_f) function given in Eq. (11). The better the optimality, the higher the reward is obtained.

(11)

$$ Reward\;(R)=1000\ast \left({D}_f\right) $$

RL-DSP methodology

The RL-DSP methodology proposed in this work is built on four steps.

1. The stability, liaison, and geometric data are processed to generate the cost (D_c) and time values (D_t) of the disassembly. Then, the disassembly fitness (D_f) value is calculated based on the D_c and D_t values.
2. The D_f values are given as the reward matrix to the RL program; then, the actions are taken based on the OQL method. This is followed by the feasibility checking process using precedence data.
3. Once the feasibility conditions are satisfied, the Q-table is generated based on the states, actions, and the QL formula given in Eq. (8).
4. Based on the Q-table values, the RL agent generates the disassembly sequences. The parameters t_r, α, γ, and λ are tuned to get better optimality.

The flowchart of the proposed RL-DSP framework is shown in Figure 8. The calculation process of fitness and reward values is given in Algorithm 5. The DSP attributes are considered, and the D _f value is calculated based on the D_c and D_t values. The D_c and D_t values are allotted to each attribute based on their importance in generating feasible and optimal sequences. Reward values are declared as directly proportional to D_f. So, the main objective of the agent will be to maximize the rewards, thereby increasing the fitness value.

Algorithm 5. Proposed RL-DSP Algorithm.

1 Initialize RL-DSP product data; (stability, liaison, geometric feasibility); D_p = 1;
2 Calculate D_c, T_p, D_t, D _f
3 repeat
4 If L = 0,

D_c = 15;

Else,

D_c = 1;
5 If S = 0,

D_c = D_c + 10

Else if S = 1,

D_c = D_c + 5

Else D_c = D_c
6 If Oc_ij < 2, then T_p = T_p + 3

Else, T_p = 0
7 D_t = T_p + D_p
8 D _f = 1/(D_c + D_t)
9 Reward = 1000*D _f
10 until all the stopping criterion is satisfied

Figure 8. Proposed reinforcement learning framework for DSP – flowchart.

The rewards’ data are taken as the OQL algorithm’s input. Randomly, a part is selected, and Q-value calculation is initiated. It is followed by the generation of subsequent part connection’s Q-values. The OQL is followed for taking proper action after analyzing the exploration and exploitation factors. Based on that analysis, the algorithm either searches for new solutions in the entire search space or searches for the best solution within the explored space. Then, the action is checked for feasibility conditions based on the precedence matrix (P_r). If it satisfies the criteria, the action is selected, else another action is taken. This process is repeated until all the part connections are processed. Then, the Q-values are calculated for all the part disassembles, and the final sequence is generated. Finally, a disassembly sequence with the highest reward is chosen as the solution by the agent.

Experiments and analysis

Several algorithms presented in the current literature were analyzed for their performance, and the selection was based on the following factors:

1. This study preferred methods widely employed for solving DSP problems across various products, including traditional methods such as brute force (BF), dynamic programming (DP), and GM. In the case of meta-heuristic methods, ACO and GA are often preferred, either in their original, enhanced, or hybrid forms. Hence, these methods are considered for comparison purposes.
2. With this consideration, the recent advancement in DSP is the improved max–min ant system (IMMAS), which is included in the comparison (Xing et al., Reference Xing, Wu and Qu2021).
3. Some methods were excluded due to their inferior performance or unsatisfactory results (Alshibli et al., Reference Alshibli, El Sayed, Kongar, Sobh and Gupta2016; Ren et al., Reference Ren, Tian, Zhao, Yu and Zhang2017).
4. Additionally, many algorithms proposed in other research were product-specific, limiting their effectiveness across various products (Guo et al., Reference Guo, Zhong, Li, Du and Guo2019; Tseng et al., Reference Tseng, Yu and Huang2011; Yeh et al., Reference Yeh, Lin and Wei2012; Wu et al., Reference Wu, Cao and Wang2019).
5. The CQL method, which employs a standard QL approach, provides a necessary comparison to the proposed OQL method, an enhanced version of CQL.

Experimental setup

The RL-based DSP program is implemented as Python program and tested on different products using the PyCharm IDE running on a Windows 10 system. The results are analyzed and visualized using the Matplotlib library. The system used for testing has the following specifications: 8GB random access memory (RAM), 1 TB hard disk drive (HDD), and an Intel i5 6th Generation processor. For testing the efficiency of the proposed RL-DSP technique, eight different products with disassembly attributes are taken into consideration. The product information is given in Table 2.

Table 2. Product details

Out of eight products, five are taken from the literature work from which the data are extracted manually, and for the remaining three products, the attributes are obtained from their CAD models. All these products are of different complexities with multidirectional disassembly feasibilities.

Influence of parameters

The proposed RL-DSP framework has been tested on eight different products with multiple parts across various episodes (100, 200, 500, and 1000). Along with the number of episodes, parameters such as temperature regularizing criterion (t_r), alpha (α), gamma (γ), and epsilon (ε) also play a crucial role in generating solutions. Different parameter settings have been explored to generate optimal sequences for the diverse set of products. Considering the satisfactory results obtained within 100 episodes, the default number of episodes is set to 100. To determine the appropriate t_r value, fitness values of various disassembly sequences are compared with their respective t_r values across 10 runs. The experimental study reveals that t_r values below 0.1895 exhibit significant variation in solution quality. However, for t_r values of 0.1995 and 0.2095, the quality of solutions remains consistent. This can be attributed to t_r, which facilitates the reduction of temperature and structured decay of epsilon rate. Consequently, the algorithm achieves a more organized and controlled epsilon decay, leading to improved convergence. Similarly, the values of $ \alpha =0.825 $ , $ \gamma =0.35 $ , and $ \lambda =0.0125 $ are chosen based on their ability to produce high-quality results.

Based on the experimentation, it is concluded that higher learning rates (α) and lower discount factors (γ) contribute to the generation of optimal solutions. On the other hand, higher values of λ result in increased randomness of epsilon (ε), which negatively affects the convergence rate, leading to higher iterations and time consumption.

Evaluation metrics

For the evaluation of the proposed RL framework, three metrics are considered. They are time consumption, memory consumption, and the optimality (fitness value) of the solution. The time consumption given in seconds (s) denotes the period required by the various optimization methods to give the results. The memory consumption given in terms of megabytes (MB) refers to the amount of RAM consumed by the application (optimization methods) to run the program and provide the results. The optimality of the solution is determined by fitness values, where higher values indicate better quality results. An optimal sequence with a high fitness value exhibits feasibility, stability, and minimal directional changes. In this work, the optimality criterion is categorized as no-results, infeasible, near-optimal, and best near-optimal results for clarity and explanation purposes. Based on these three metrics, the performance of the RL-DSP framework (CQL and OQL) and other algorithms is compared.

Results and discussion

A detailed analysis and a comprehensive discussion interpreting the results based on its performance metrics are given in this section.

Time and memory observation

The exact methods explore all the possible solutions. Hence, it produces correct results all the time but consumes more time for the huge number of parts. The optimization algorithms such as ACO, GA, and IMMAS process only a selected set of possible solutions based on their strategies. Being strategic in their processing of solutions, these algorithms give near-optimal results for products of high complexity and matched optimal solutions for smaller or less complex products. Thus, they demonstrate a lower consumption of time and memory compared to the exact methods. The gap in metrics such as time and memory between the tested algorithms is mainly due to the nature of their processing and searching mechanisms. Table 3 and Figure 9 show the time consumption comparisons of the various algorithms for the considered products. Similarly, the memory consumption comparison is given in Table 4 and Figure 10.

Table 3. Time consumption comparison of various algorithms

No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution

Table 4. Memory consumption comparison of various algorithms

No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution

Figure 9. Time consumption analysis chart.

Figure 10. Memory consumption analysis chart.

From Tables 3 and 4, it is evident that traditional algorithms exhibit increased time and memory consumption for products with larger part counts. The BF approach requires more than 500 seconds and 900 MB of memory for products with more than nine parts. The DP approach experiences longer execution time and memory usage for 21-part products and does not consistently yield optimal results. Similarly, the GM consumes more time and memory for products with 16 and 21 parts, failing to provide optimal solutions for products with fewer parts (8, 12, and 13). In terms of optimization algorithms, IMMAS demonstrates superior time performance compared to ACO and GA.

Additionally, IMMAS delivers best near-optimal solutions for products of up to 13 parts. Following IMMAS, GA exhibits satisfactory performance in terms of time and memory consumption for sequence generation compared to ACO. ACO performs better than all the traditional algorithms but falls short when compared to other optimization algorithms such as GA and IMMAS. In the case of RL methods, both the CQL and the proposed OQL outperform other algorithms in terms of solution quality. The advantage of RL methods is their efficient learning and decision-making processes. They offer best near-optimal solutions for all products by learning from the consequences of past actions, while minimizing time and memory consumption.

When comparing CQL and the proposed OQL, it is observed that OQL consumes less time and memory than CQL. From the observation, the OQL approach seems to give better performance for products with more parts. The graphical representation of these data in Figures 9 and 10 indicates that the overall time and memory consumption of RL methods are lesser than those of other algorithms. Moreover, the time and memory consumption for RL methods steadily increases as the number of parts in the product rises. When comparing CQL and OQL methods, the OQL technique demonstrates a slight advantage and performs better for products with more parts.

Fitness and reward value observation

The fitness values for the disassembly sequences generated by all the considered algorithms are presented in Table 5 and visualized in Figure 11. High fitness values indicate stable sequences with minimal directional changes. From Figure 11, it is observed that both RL methods (CQL and OQL) consistently generate sequences with high fitness values for all the products considered. On the other hand, traditional algorithms, due to their high time consumption, are limited to generating solutions within 900 seconds, resulting in poorer fitness values. Both RL methods, CQL and OQL, provide best near-optimal solutions with slight differences in fitness values for large products. Although this difference may seem small, it can have a significant impact on the performance of larger and more complex products.

Table 5. Fitness value comparison of various algorithms

No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution

Figure 11. Fitness value analysis chart.

To compare the performance of CQL and OQL techniques in the RL aspect, the rewards obtained over 100 episodes by both RL methods for four products with 4, 8, 12, and 21 parts are analyzed and visualized in Figure 12. These four products are selected specifically to showcase the variation in rewards obtained in both RL methods.

Figure 12. Reward vs episode comparison – CQL and OQL.

Observing Figure 12, it is evident that for the 4-part product, the highest reward is achieved at the 40^th episode and starts converging in the OQL method, while the CQL approach reaches its highest reward only at the end of 80 episodes. In the case of the 8-part product, convergence occurs within 50 episodes with OQL, while CQL starts converging only from the 80^th episode. For the 12-part product, convergence begins at 50 episodes with OQL, whereas CQL achieves convergence only after 70 episodes. In the case of a large product with 21 parts, OQL produces the best result after 50 episodes, while CQL, with variations, fails to achieve high reward result within 100 episodes.

Summarizing the results, it is clear that OQL achieves good solutions within 100 episodes, with the rewards stabilizing beyond 60 episodes on average for all the considered products. In contrast, CQL exhibits more variations in rewards and only reaches good solutions toward the end of the 100 episodes in most of the cases. Based on this experimental study, it is evident that the OQL method shows significant improvements over CQL with the EG approach. The introduction of the ESA technique in OQL allows for a more structured and organized decay of epsilon values, resulting in faster convergence and higher-quality solutions compared to the CQL method. The results clearly indicate that the OQL method can provide better results faster than CQL, making it a more efficient approach for addressing the DSP problem.

Conclusion

The need for an efficient DSP method for effectively managing the repair–reuse–recycle (RRR) process has been addressed in this work. A RL framework has been proposed for the DSP problem. The QL technique has been employed for generating optimal disassembly sequences. To resolve the exploration–exploitation dilemma, an OQL method based on the proposed ESA technique has been introduced. The proposed RL framework with the OQL approach outperforms the standard benchmark algorithms and state-of-the-art frameworks in terms of time, memory consumption, and solution optimality. The optimality of the solution has been evaluated using the DSP objective function. The results have demonstrated that the proposed RL-DSP framework is effective for various products and yields best near-optimal results. In conclusion, this work has demonstrated that the DSP problem can be effectively solved using the RL approach. Moreover, when the proposed ESA method is incorporated into the QL technique, it has been shown to produce superior results compared to the CQL method. In future work, the employment of deep RL (DRL) techniques to handle large products with multiple parts, connections, and sub-connections is planned.

Data availability statement

The data that support the findings will be available upon valid request.

Funding statement

This work received no specific grant from any funding agency, commercial or not-for-profit sectors.

Competing interest

The authors declare none.

M.C. is Research Scholar in the Department of Computer Science and Engineering at the National Institute of Technology Puducherry, Karaikal. His research interests include artificial intelligence, machine learning, and optimization.

C.R. is Assistant Professor in the Department of Computer Science and Engineering at the National Institute of Technology Puducherry, Karaikal. His area of interests includes artificial intelligence, soft computing, and augmented reality.

References

Alshibli, M, El Sayed, A, Kongar, E, Sobh, TM and Gupta, SM (2016) Disassembly sequencing using Tabu search. Journal of Intelligent and Robotic Systems: Theory and Applications 82(1), 69–79. https://doi.org/10.1007/s10846-015-0289-9.CrossRef Google Scholar

Anil Kumar, G, Bahubalendruni, MVAR, Prasad, VSS and Sankaranarayanasamy, K (2021) A multi-layered disassembly sequence planning method to support decision making in de-manufacturing. Sadhana - Academy Proceedings in Engineering Sciences 46(2), 1–16. https://doi.org/10.1007/S12046-021-01622-3.Google Scholar

Azab, A, Ziout, A and ElMaraghy, W (2011) Modeling and optimization for disassembly planning. Jordan Journal of Mechanical and Industrial Engineering 5(1), 1–8.Google Scholar

Bahubalendruni, MVAR and Varupala, VP (2021) Disassembly sequence planning for safe disposal of end-of-life waste electric and electronic equipment. National Academy Science Letters 44(3), 243–247. https://doi.org/10.1007/S40009-020-00994-0.CrossRef Google Scholar

Beigy, H and Meybodi, MR (2000) Adaptation of parameters of BP algorithm using learning automata. In Proceedings - Brazilian Symposium on Neural Networks, 2000-January, vol 1, pp. 24–31. IEEE, Rio de Janeiro, Brazil. https://doi.org/10.1109/SBRN.2000.889708CrossRef Google Scholar

Beigy, H and Meybodi, MR (2001) Backpropagation algorithm adaptation parameters using learning automata. International Journal of Neural Systems 11(3), 219–228. World Scientific, Singapore. https://doi.org/10.1142/S0129065701000655.CrossRef Google Scholar PubMed

Chand, M and Ravi, C (2023) A state-of-the-art literature survey on artificial intelligence techniques for disassembly sequence planning. CIRP Journal of Manufacturing Science and Technology 41, 292–310. https://doi.org/10.1016/j.cirpj.2022.11.017.CrossRef Google Scholar

Chang, MML, Nee, AYC and Ong, SK (2020) Interactive AR-assisted product disassembly sequence planning (ARDIS). International Journal of Production Research, 58(16), 4916–4931. https://doi.org/10.1080/00207543.2020.1730462CrossRef Google Scholar

De Floriani, L and Nagy, G (1989) Graph model for face-to-face assembly. ICRA, 1, 75–78. https://doi.org/10.1109/robot.1989.99970.Google Scholar

Ghandi, S and Masehian, E (2015) Assembly sequence planning of rigid and flexible parts. Journal of Manufacturing Systems 36, 128–146. https://doi.org/10.1016/jjmsy.2015.05.002.CrossRef Google Scholar

Giudice, F and Fargione, G (2007) Disassembly planning of mechanical systems for service and recovery: A genetic algorithms based approach. Journal of Intelligent Manufacturing 18(3), 313–329. https://doi.org/10.1007/s10845-007-0025-9.CrossRef Google Scholar

Gunji, BM, Pabba, SK, Rajaram, IRS, Sorakayala, PS, Dubey, A, Deepak, BBVL, Biswal, BB and Bahubalendruni, MVAR (2021) Optimal disassembly sequence generation and disposal of parts using stability graph cut-set method for end of life product. Sadhana - Academy Proceedings in Engineering Sciences 46(1), 21. https://doi.org/10.1007/S12046-020-01525-9.Google Scholar

Guo, J, Zhong, J, Li, Y, Du, B and Guo, S (2019) A hybrid artificial fish swam algorithm for disassembly sequence planning considering setup time. Assembly Automation 39(1), 140–153. https://doi.org/10.1108/AA-12-2017-180.CrossRef Google Scholar

Han, HJ, Yu, JM and Lee, DH (2013) Mathematical model and solution algorithms for selective disassembly sequencing with multiple target components and sequence-dependent setups. International Journal of Production Research 51(16), 4997–5010. https://doi.org/10.1080/00207543.2013.788794.CrossRef Google Scholar

Hui, W, Dong, X and Guanghong, D (2008) A genetic algorithm for product disassembly sequence planning. Neurocomputing 71(13), 2720–2726. https://doi.org/10.1016/j.neucom.2007.11.042.CrossRef Google Scholar

Issaoui, L, Aifaoui, N and Benamara, A (2017) Modelling and implementation of geometric and technological information for disassembly simulation in CAD environment. The International Journal of Advanced Manufacturing Technology 89(5), 1731–1741. https://doi.org/10.1007/s00170-016-9128-9.CrossRef Google Scholar

Kheder, M, Trigui, M and Aifaoui, N (2017) Optimization of disassembly sequence planning for preventive maintenance. International Journal of Advanced Manufacturing Technology 90(5–8), 1337–1349. https://doi.org/10.1007/s00170-016-9434-2.CrossRef Google Scholar

Kim, HW and Lee, DH (2017) An optimal algorithm for selective disassembly sequencing with sequence-dependent set-ups in parallel disassembly environment. International Journal of Production Research 55(24), 7317–7333. https://doi.org/10.1080/00207543.2017.1342879.CrossRef Google Scholar

Kirkpatrick, S, Gelatt, CD and Vecchi, MP (1983) Optimization by simulated annealing. Science 220(4598), 671–680. https://doi.org/10.1126/SCIENCE.220.4598.671.CrossRef Google Scholar PubMed

Kuo, TC (2013) Waste electronics and electrical equipment disassembly and recycling using Petri net analysis: Considering the economic value and environmental impacts. Computers and Industrial Engineering 65(1), 54–64. https://doi.org/10.1016/j.cie.2011.12.029.CrossRef Google Scholar

Liu, J, Zhou, Z, Pham, DT, Xu, W, Ji, C and Liu, Q (2020) Collaborative optimization of robotic disassembly sequence planning and robotic disassembly line balancing problem using improved discrete bees algorithm in remanufacturing ☆. Robotics and Computer-Integrated Manufacturing, 61, 101829. https://doi.org/10.1016/j.rcim.2019.101829.CrossRef Google Scholar

Liu, J, Zhou, Z, Pham, DT, Xu, W, Yan, J, Liu, A, Ji, C and Liu, Q (2018) An improved multi-objective discrete bees algorithm for robotic disassembly line balancing problem in remanufacturing. International Journal of Advanced Manufacturing Technology 97(9–12), 3937–3962. https://doi.org/10.1007/s00170-018-2183-7.CrossRef Google Scholar

Luo, Y, Peng, Q and Gu, P (2016) Integrated multi-layer representation and ant colony search for product selective disassembly planning. Computers in Industry 75, 13–26. https://doi.org/10.1016/j.compind.2015.10.011.CrossRef Google Scholar

Ma, YS, Jun, HB, Kim, HW and Lee, DH (2011) Disassembly process planning algorithms for end-of-life product recovery and environmentally conscious disposal. International Journal of Production Research 49(23), 7007–7027. https://doi.org/10.1080/00207543.2010.495089.CrossRef Google Scholar

Meybodi, MR and Beigy, H (2000) A note on learning automata based schemes for adaptation of BP parameters. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 1983, pp. 145–151. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44491-2_22.Google Scholar

Meybodi, MR and Beigy, H (2002) New learning automata based algorithms for adaptation of backpropagation algorithm parameters. International Journal of Neural Systems 12(1), 45–67. World Scientific, Singapore. https://doi.org/10.1142/S012906570200090X.CrossRef Google Scholar PubMed

Mitrouchev, P, Wang, C and Chen, J (2016) Virtual disassembly sequences generation and evaluation. Procedia CIRP 44, 347–352. https://doi.org/10.1016/j.procir.2016.02.001.CrossRef Google Scholar

Mitrouchev, P, Wang, C and Chen, J (2017) Disassembly process simulation in virtual reality environment. In Eynard, B, Nigrelli, V, Oliveri, SM, Fajarnes, GP and Rizzuti, S (eds.), Advances on Mechanics, Design Engineering and Manufacturing. Lecture Notes in Mechanical Engineering. Springer International Publishing, pp. 631–638. https://doi.org/10.1007/978-3-319-45781-9_63.CrossRef Google Scholar

Osti, F, Ceruti, A, Liverani, A and Caligiana, G (2017) Semi-automatic design for disassembly strategy planning: An augmented reality approach. Procedia Manufacturing, 11, 1481–1488. https://doi.org/10.1016/j.promfg.2017.07.279.CrossRef Google Scholar

Ottoni, ALC, Nepomuceno, EG, de Oliveira, MS and de Oliveira, DCR (2021) Reinforcement learning for the traveling salesman problem with refueling. Complex & Intelligent Systems, 8, 2001–2015. https://doi.org/10.1007/s40747-021-00444-4.CrossRef Google Scholar

Petri, CA and Reisig, W (2008) Petri net. Scholarpedia 3(4), 6477. https://doi.org/10.4249/scholarpedia.6477.CrossRef Google Scholar

Ren, Y, Tian, G, Zhao, F, Yu, D and Zhang, C (2017) Selective cooperative disassembly planning based on multi-objective discrete artificial bee colony algorithm. Engineering Applications of Artificial Intelligence, 64, 415–431. https://doi.org/10.1016/j.engappai.2017.06.025.CrossRef Google Scholar

Ren, Y, Zhang, C, Zhao, F, Xiao, H and Tian, G (2018) An asynchronous parallel disassembly planning based on genetic algorithm. European Journal of Operational Research 269(2), 647–660. https://doi.org/10.1016/j.ejor.2018.01.055.CrossRef Google Scholar

Smith, S, Smith, G and Chen, WH (2012) Disassembly sequence structure graphs: An optimal approach for multiple-target selective disassembly sequence planning. Advanced Engineering Informatics 26(2), 306–316. https://doi.org/10.1016/j.aei.2011.11.003.CrossRef Google Scholar

Sutton, RS and Barto, AG (2018) Reinforcement Learning: An Introduction. MIT Press Ltd, Massachusetts.Google Scholar

Syed Shahul Hameed, AS and Rajagopalan, N (2022) SPGD: Search party gradient descent algorithm, a simple gradient-based parallel algorithm for bound-constrained optimization. Mathematics 10(5), 800. https://doi.org/10.3390/math10050800.CrossRef Google Scholar

Syed Shahul Hameed, AS and Rajagopalan, N (2023) MABSearch: The bandit way of learning the learning rate—A harmony between reinforcement learning and gradient descent. National Academy Science Letters 1, 1–6. https://doi.org/10.1007/S40009-023-01292-1.Google Scholar

Tian, G, Ren, Y, Feng, Y, Zhou, MC, Zhang, H and Tan, J (2019a) Modeling and planning for dual-objective selective disassembly using and/or graph and discrete artificial bee Colony. IEEE Transactions on Industrial Informatics 15(4), 2456–2468. https://doi.org/10.1109/TII.2018.2884845.CrossRef Google Scholar

Tian, Y, Zhang, X, Liu, Z, Jiang, X and Xue, J (2019b) Product cooperative disassembly sequence and task planning based on genetic algorithm. International Journal of Advanced Manufacturing Technology 105(5–6), 2103–2120. https://doi.org/10.1007/s00170-019-04241-9.CrossRef Google Scholar

Tian, G, Zhou, M and Chu, J (2013) A chance constrained programming approach to determine the optimal disassembly sequence. IEEE Transactions on Automation Science and Engineering 10(4), 1004–1013. https://doi.org/10.1109/TASE.2013.2249663.CrossRef Google Scholar

Tian, G, Zhou, MC and Li, P (2018) Disassembly sequence planning considering fuzzy component quality and varying operational cost. IEEE Transactions on Automation Science and Engineering 15(2), 748–760. https://doi.org/10.1109/TASE.2017.2690802.CrossRef Google Scholar

Tseng, HE, Chang, CC, Lee, SC and Huang, YM (2019) Hybrid bidirectional ant colony optimization (hybrid BACO): An algorithm for disassembly sequence planning. Engineering Applications of Artificial Intelligence, 83, 45–56. https://doi.org/10.1016/j.engappai.2019.04.015.CrossRef Google Scholar

Tseng, HE and Lee, SC (2018) Disassembly sequence planning using interactive genetic algorithms. In ICNC-FSKD 2018 - 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp. 77–84. IEEE, Huangshan, China. https://doi.org/10.1109/FSKD.2018.8686887.Google Scholar

Tseng, YJ, Yu, FY and Huang, FY (2011) A green assembly sequence planning model with a closed-loop assembly and disassembly sequence planning using a particle swarm optimization method. International Journal of Advanced Manufacturing Technology 57(9–12), 1183–1197. https://doi.org/10.1007/s00170-011-3339-x.CrossRef Google Scholar

Ullerich, C (2014) Advanced disassembly planning: Flexible, price-quantity dependent, and multi-period planning approaches. In Advanced Disassembly Planning: Flexible, Price- Quantity Dependent, and Multi-Period Planning Approaches. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-03118-3.CrossRef Google Scholar

Vongbunyong, S, Kara, S and Pagnucco, M (2012) A framework for using cognitive robotics in disassembly automation. In Leveraging Technology for a Sustainable World - Proceedings of the 19th CIRP Conference on Life Cycle Engineering, pp. 173–178. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29069-5_30.CrossRef Google Scholar

Vongbunyong, S, Kara, S and Pagnucco, M (2013) Application of cognitive robotics in disassembly of products. CIRP Annals - Manufacturing Technology 62(1), 31–34. https://doi.org/10.1016/j.cirp.2013.03.037.CrossRef Google Scholar

Wang, JF, Liu, JH, Li, SQ and Zhong, YF (2003) Intelligent selective disassembly using the ant colony algorithm. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 17(4), 325–333. https://doi.org/10.1017/S0890060403174045.CrossRef Google Scholar

Wu, Y j, Cao, Y and Wang, Q f (2019) Assembly sequence planning method based on particle swarm algorithm. Cluster Computing 22(s1), 835–846. https://doi.org/10.1007/s10586-017-1331-4.CrossRef Google Scholar

Xing, Y, Wu, D and Qu, L (2021) Parallel disassembly sequence planning using improved ant colony algorithm. International Journal of Advanced Manufacturing Technology 113(7–8), 2327–2342. https://doi.org/10.1007/s00170-021-06753-9.CrossRef Google Scholar

Xu, W, Tang, Q, Liu, J, Liu, Z, Zhou, Z and Pham, DT (2020) Disassembly sequence planning using discrete bees algorithm for human-robot collaboration in remanufacturing. Robotics and Computer-Integrated Manufacturing 62, 101860. https://doi.org/10.1016/j.rcim.2019.101860.CrossRef Google Scholar

Yeh, W, Lin, C and Wei, S (2012) Disassembly sequencing problems with stochastic processing time using simplified swarm optimization. International Journal of Innovation, Management and Technology Management 3(3), 226–231.Google Scholar

Zhu, B, Sarigecili, MI and Roy, U (2013) Disassembly information model incorporating dynamic capabilities for disassembly sequence generation. Robotics and Computer- Integrated Manufacturing 29(5), 396–409. https://doi.org/10.1016/j.rcim.2013.03.003.CrossRef Google Scholar

Figure 1. Ink–pen example.

Figure 2. Stability matrix.

Figure 3. Liaison matrix

Figure 4. Disassembly feasibility matrix.

Figure 5. Precedence matrix.

Table 1. Sequence and directional data

Figure 6. General framework of reinforcement learning process.

Figure 7. Classic Q-learning (CQL) flowchart.

Figure 8. Proposed reinforcement learning framework for DSP – flowchart.

Table 2. Product details

Table 3. Time consumption comparison of various algorithms

Table 4. Memory consumption comparison of various algorithms

Figure 9. Time consumption analysis chart.

Figure 10. Memory consumption analysis chart.

Table 5. Fitness value comparison of various algorithms

Figure 11. Fitness value analysis chart.

Figure 12. Reward vs episode comparison – CQL and OQL.

Article contents

A novel reinforcement learning framework for disassembly sequence planning using Q-learning technique optimized using an enhanced simulated annealing algorithm

Abstract

Keywords

Introduction

Disassembly attributes

Stability matrix

Liaison matrix

Disassembly feasibility matrix (geometric)

Precedence matrix

The objectives of DSP

Related works

Graph-based methods

Matrix-based methods

Mathematical and meta-heuristic methods

Other approaches in DSP

Synthesis of literature study

Research contributions

Background study

Reinforcement learning (RL)

Classic Q-learning (CQL) approach

Epsilon-greedy (EG) approach

Proposed methodology

Proposed enhanced simulated annealing (ESA) algorithm

Proposed optimized Q-learning (OQL) approach

Proposed RL framework for DSP

RL-DSP methodology

Experiments and analysis

Experimental setup

Influence of parameters

Evaluation metrics

Results and discussion

Time and memory observation

Fitness and reward value observation

Conclusion

Data availability statement

Funding statement

Competing interest

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests