American Journal of Intelligent Systems
p-ISSN: 2165-8978 e-ISSN: 2165-8994
2016; 6(1): 1-13
doi:10.5923/j.ajis.20160601.01

Karl Hansson , Siril Yella , Mark Dougherty , Hasan Fleyeh
The School of Technology and Business Studies, Dalarna University, Borlänge, Sweden
Correspondence to: Hasan Fleyeh , The School of Technology and Business Studies, Dalarna University, Borlänge, Sweden.
| Email: | ![]() |
Copyright © 2016 Scientific & Academic Publishing. All Rights Reserved.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

In a global economy, manufacturers mainly compete with cost efficiency of production, as the price of raw materials are similar worldwide. Heavy industry has two big issues to deal with. On the one hand there is lots of data which needs to be analyzed in an effective manner, and on the other hand making big improvements via investments in cooperate structure or new machinery is neither economically nor physically viable. Machine learning offers a promising way for manufacturers to address both these problems as they are in an excellent position to employ learning techniques with their massive resource of historical production data. However, choosing modelling a strategy in this setting is far from trivial and this is the objective of this article. The article investigates characteristics of the most popular classifiers used in industry today. Support Vector Machines, Multilayer Perceptron, Decision Trees, Random Forests, and the meta-algorithms Bagging and Boosting are mainly investigated in this work. Lessons from real-world implementations of these learners are also provided together with future directions when different learners are expected to perform well. The importance of feature selection and relevant selection methods in an industrial setting are further investigated. Performance metrics have also been discussed for the sake of completion.
Keywords: Heavy Process Manufacturing, Machine Learning, SVM, MLP, DT, RF, Feature Selection, Calibration
Cite this paper: Karl Hansson , Siril Yella , Mark Dougherty , Hasan Fleyeh , Machine Learning Algorithms in Heavy Process Manufacturing, American Journal of Intelligent Systems, Vol. 6 No. 1, 2016, pp. 1-13. doi: 10.5923/j.ajis.20160601.01.
has a labelled class variable
which specifies what class in which a sample belongs to. The
sample also contains a feature vector,
which specifies the feature values available to predict
The vector
can contain both numerical and categorical data. Supervised learning is formalized as follows:![]() | (1) |
The classifier,
optimizes some performance metric given the dataset, 
into k equally sized disjunct subsets,
For each fold, one of the subsets
is held out as a validation set, while training the learner using remaining datasets,
The performance is then measured as the mean performance for each fold. The strength of cross-validation is that the entire dataset will be used for both training and validation, where each point is validated exactly exactly once. Common practice is to perform 10-fold cross-validation to find appropriate model complexity.
which operates element wise, and the Heaviside step-function,
In this case,
and
The classification itself is expressed using Equation 2.![]() | (2) |
![]() | Figure 1. The structure of a Decision Tree, where is estimated via logical tests of the features in a feature vector and are numerical or categorical constants determined by the learning algorithm |
where the
are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at inputx." What governs the strength of the RF is that each tree in the forest only sees a small part of the problem and specialize by only considering a small portion of the features. The individual trees in the forest are grown using the CART algorithm. Instead of letting the tree use all features, one samples p features which the tree is allowed to use as a basis for classification. Individual trees in the RF are not pruned, since bagging is used in the final prediction. It does not matter if individual trees overfit as long as there are sufficiently many trees in the forest as a whole. Breiman pointed to the strong law of large number to show that overfitting is not an issue for RF. It has also been shown that RF is robust against mislabelled, noisy, missing samples, and no special encoding scheme is needed. The RF is also fast in training, evaluation, and scales well with both training samples as well as high dimensional feature vectors. Few parameters need to be tuned making it easy to implement without much training and testing.However, the Random Forest has some weaknesses that are apparent in industrial applications. Since RF is an ensemble algorithm, it produces a rather complex and hard to comprehend model. Looking at individual tree gives little insight regarding the full model and there is no clear way of visualizing the forest in its entirety. Furthermore, the model itself tends to be rather large which can prove a limitation for implementation on weak hardware.Unlike learners such as SVM and MLP, neither of the tree learners produces a smooth surface in any vector space. This behaviour comes from the fact that the trees do binary splits. This enables the tree learners to work nicely on problems that are discrete, such as market forecasting [21].
with size n, re-sample
uniformly with replacements into m new datasets
each with size
Usually one chooses
In this case,
of the samples are expected to be unique within each of the produced subsets. The value of m should be sufficiently large in order to utilize all samples at least once. After re-sampling, a model of choice is fitted for each
The final prediction is made by having a majority vote with all produced models, each being equally weighted.Bagging is especially suitable when there are many features compared to the number of samples. For example, the number of changeovers for a manufacturer might be in the range of hundreds per year while the number of features are in the range of thousands. Bagging is also widely used in other fields where the number of features is large compared to the number of samples, such as bioinformatics [23].Some learners, like RF, have bagging incorporated as a part of the algorithm itself and contributes to its strength in generalization. For these types of learners, there is no need to have extra bagging as post-improvement method of the produced classifier.![]() | (3) |
and a discrimination threshold is used to determine if the output should belong to one class or the other.![]() | (4) |
is calculated as follows:![]() | (5) |
where H is the true classifier. With the standard definition of variance, the MSE can be reformulated as:![]() | (6) |
![]() | (7) |
Platt argued from empirical experience that a logistic regression, as illustrated in Equation 8, is a good transformation of outputs for many real world problems.![]() | (8) |
![]() | (9) |
![]() | Figure 2. Decision process to chosecalibration technique |
consisting of N samples and M features
The task of the feature selection is to find a compact subset
Where
imposes no significant loss in predictive information regarding the target class compared to
When working with thousands of different features, some are bound to be important in describing the process while others are virtually meaningless. There are three main objectives in feature selection:1. Provide better insight in the underlying process. If the dimensionality of the problem can be reduced, it becomes easier for domain experts to analyse and validate produced models.2. Since machine learning algorithms fits their view of reality based on the data, they is trained on they can have a tendency to find patterns within noise. Meaning that if there are too few samples with a high dimensionality, there is a substantial risk that a model encodes the noise and overfits. With a compact feature set better generalization can be achieved.3. If a predictor is to be implemented limitation of the feature set makes models both smaller, faster and more cost-efficient.When choosing feature selection approach there are, as always, multiple things to consider. Things regarding optimality of feature sets, time taken to perform feature selection, complexity of the problem to model, and the actual structure of the data. For more discussion regarding the subject see Guyon and Elisseeff’s introduction to feature selection [33].In feature selection, there are three main classes of methods for performing the selection of variables, namely wrappers, filters, and embedded methods.![]() | Figure 3. Decision process for whatfeature selection technique to employ |