Why is feature extraction necessary




















Word frequency refers to the number of times that a word appears in a text. Feature selection through word frequency means to delete the words, whose frequencies are less than a certain threshold, to reduce the dimensionality of feature space. This method is based on such a hypothesis; words with small frequencies have little impact on filtration [ 3 , 11 , 12 ]. However, in the studies of information retrieval, it is believed that sometimes words with less frequency of occurrences have more information.

Therefore, it is inappropriate to delete a great number of words simply based on the word frequency in the process of feature selection [ 11 , 12 ].

MI mutual information [ 13 , 14 ] used for mutuality measurement of two objects is a common method in the analysis of computational linguistics models. It is employed to measure differentiation of features to topics in filtration.

The definition of mutual information is similar to the one that of cross entropy. Mutual information, originally a concept in information theory, is applied to represent relationships between information and the statistical measurement of correlation of two random variables [ 13 , 14 ].

Using mutual information theory for feature extraction is based on a hypothesis that words have big frequencies in a certain class but small in others, and the class has relatively large mutual information.

Usually, mutual information is used as the measurement between a feature word and a class, and if the feature word belongs to the class, they have the largest amounts of mutual information. Since this method does not require any hypotheses on the property of relationship between feature words and classes, it is exceedingly suitable for the registration of features of text classification and classes [ 14 ].

Time complexity of mutual information computation is similar to information gain. Its mean value is information gain. The deficiency of mutual information is that the score is extremely impacted by marginal probabilities of words [ 13 , 14 ]. IG information gain is a common method for machine learning. In filtration, it is utilized to measure whether a known feature appears in a text of a certain relevant topic and how much predicted information of the topic.

By computing information gain, features that frequently occur in positive samples instead of negative ones or the other way around can be obtained [ 15 , 16 ]. Information gain, an evaluation method based on entropy, involves lots of mathematical theories and complex theories and formulas about entropy. It is defined as the amount of information that a certain feature item is able to provide for the whole classification, taking no account of the entropy of any features but the difference value of entropy of the feature [ 17 ].

According to the training data, it computes information gain of each feature item and deletes items with small information gain while the rest are ranked in a descending order based on information gain. Reference [ 18 ] has proposed that DF document frequency is the most simple method than others, but is inefficient on making use of the words with the lowest rising frequency well; Reference [ 19 ] has pointed that IG information gain can reduce the dimension of vector space model by setting the threshold, but the problem is that it is too hard to set the appropriate threshold; Reference [ 20 ] has thought that the method MI can make the words with the lowest rising frequency get more points than by other methods, because it is good at doing these words.

In reference [ 21 ], a survey on intelligent techniques for feature selection and classification techniques used of intrusion detection has been presented and discussed. In addition, a new feature selection algorithm called intelligent rule based on attribute selection algorithm and a novel classification algorithm named intelligent rule-based enhanced multi-class support vector machine have been proposed. In reference [ 22 ], to address low efficiency and poor accuracy of keyword extraction of traditional TF-IDF term frequency-inverse document frequency algorithm, a text keyword extraction method based on word frequency statistics is put forward.

Experimental results show that TF-IDF algorithm based on word frequency statistics not only overmatches traditional TF-IDF algorithm in precision ratio, recall ratio, and F1 index in keyword extraction, but also enables to reduce the run time of keyword extraction efficiently. In reference [ 23 ], a feature extraction algorithm based on average word frequency of feature words within and outside the class is presented.

This algorithm can improve the classification efficiently. In reference [ 24 ], a modified text feature extraction algorithm is proposed. The experimental results suggest that this algorithm is able to describe text features more accurately and better be applied to text features processing, Web text data mining, and other fields of Chinese information processing. In reference [ 25 ], a method, which targets the feature of short texts and is able to automatically recognize feature words of short texts, is brought forward.

According to experimental results, compared with traditional feature extraction methods, this method is more suitable for the classification of short texts. In reference [ 26 ], this paper presented an ensemble-based multi-filter feature selection method that combines the output of one third split of ranked important features of information gain, gain ratio, chi-squared, and ReliefF. Fusion needs integration of specific classifiers, and the search needs to be conducted within an exponential increase interval.

The time complexity is high [ 27 , 28 ]. So, it is inappropriate to be used for feature extraction of large-scale texts [ 27 , 28 ]. Weighting method is a special class of fusion. It gives each feature a weight within 0, 1 to train while making adjustments.

Weighting method integrated by linear classifiers is highly efficient. K nearest neighbors KNN algorithm is a kind of learning method based on the instance [ 29 ]. Han [ 30 ] put forward a kind of combination of KNN classifier weighted feature extraction problem. The method is for each classification of continuous cumulative values, and it has a good classification effect.

KNN method as a kind of no parameters of a simple and effective method of text categorization based on the statistical pattern recognition performance outstanding; it can achieve higher classification accuracy rate and recall rate [ 29 , 30 , 31 ].

A weighted center vector classification method is proposed by Shankar [ 32 ], which firstly defines a method of characteristics to distinguish ability, the ability to distinguish between rights and get a new center vector.

Algorithm requires multiple weighted methods until the classification ability down. Mapping has been widely applied to text classification and achieved good results [ 33 ].

Dumais et al. It is a computational theory or method that is used for knowledge acquisition and demonstration. It uses statistical computation method to analyze a mass of text sets, thereby extracts latent semantic structure between words, and employs this latent structure to represent words and texts so as to eliminate the correlation between words and reduce dimensionality by simplifying text vectors [ 17 ].

The basic concept of latent semantic analysis is that mapping texts represented in high-dimensional VSM to lower dimensional latent semantic space. This mapping is achieved through SVD singular value decomposition of item or document matrix [ 19 , 29 ]. Application of LSA: information filtering, document index, video retrieval, text classification and clustering, image retrieval, information extraction, and so on.

Jeno [ 33 ] did a research on high-dimensional data reduction from the perspective of center vector and least squares. He believed dimensionality reduction has its predominance over SVD, because clustered center vectors reflect the structures of raw data, while SVD takes no account of these structures. In reference [ 34 ], this study proposes a novel filter based on a probabilistic feature selection method, namely DFS distinguishing feature selector , for text classification.

The comparison is carried out for different datasets, classification algorithms, and success measures [ 34 ]. Experimental results explicitly indicate that DFS offers a competitive performance with respect to the abovementioned approaches in terms of classification accuracy, dimension reduction rate, and processing time [ 34 ]. Clustering takes the essential comparability of text features primarily to cluster text features into consideration. Then the center of each class is utilized to replace the features of that class.

The advantage of this method is that it has a very low compression ratio, and basic accuracy of classification stays constant.

Its disadvantage is the extremely high time complexity [ 35 , 36 ]. The advantage of this method is relatively low time complexity [ 15 , 16 ]. In text classification, CI concept indexing [ 37 ] is a simple but efficient method of dimensionality reduction. By taking the center of each class as the base vector structure subspace CI subspace , and then mapping each text vector to this subspace, the representation of text vectors to this subspace is acquired. The amount of classification included in training sets is exactly the dimensionality of CI subspace, which usually is smaller than that of the text vector space, so dimensionality reduction of vector space is achieved.

In Reference [ 40 ], the authors have described two approaches for combining the large feature spaces to efficient numbers using genetic algorithm and fuzzy clustering techniques.

Finally, the classification of patterns has been achieved by using adaptive neuro-fuzzy techniques. The aim of the entire work is to implement the recognition scheme for classification of tumor lesions appeared in the human brain as space-occupying lesions identified by CT and MR images.

Deep learning put forward by Hinton et al. Its concept comes from the studies of artificial neural network. Multi-layer perceptron with multiple implicit strata is a deep learning structure. By combining lower level features to form more abstract, higher level representing property classifications or features, deep learning is to discover distributed feature representation of data [ 2 ].

Deep learning as opposed to a surface learning, now a lot of learning methods are surface structure algorithm, and they exist some limitations, such as in the case of limited samples of complex function ability is limited, its generalization ability for complex classification problem is restricted by a certain [ 42 ].

Deep learning is by learning a kind of deep nonlinear network structure and implementing complex function approximation, according to the characterization of the input data distributed, and in the case of sample set, the essence characteristic of the data set [ 63 ] is seldom studied.

The major difference between deep learning and traditional pattern recognition methods is that deep learning automatically learns features from big data, instead of adopting handcrafted features [ 2 ]. In the history of the development of computer vision, only one widely recognized good feature emerged in 5 to 10 years.

But aiming at new applications, deep learning is able to quickly acquire new effective feature representation from training data. Deep learning technology is applied in common NLP natural language processing tasks, such as semantic parsing [ 43 ], information retrieval [ 44 , 45 ], semantic role labeling [ 46 , 47 ], sentimental analysis [ 48 ], question answering [ 49 , 50 , 51 , 52 ], machine translation [ 53 , 54 , 55 , 56 ], text classification [ 57 ], summarization [ 58 , 59 ], and text generation [ 60 ], as well as information extraction, including named entity recognition [ 61 , 62 ], relation extraction [ 63 , 64 , 65 , 66 , 67 ], and event detection [ 68 , 69 , 70 ].

Convolution neural network and recurrent neural network are two popular models employed by this work [ 71 ]. Next, several deep learning methods, applications, improvement methods, and steps used for text feature extraction are introduced.

An autoencoder, firstly introduced in Rumelhart et al. An autoencoder usually has one hidden layer between input and output layer. Hidden layer usually has a more compact representation than input and output layers, i. Input and output layer usually has the same setting, which allows an autoencoder to be trained unsupervised with same data fed in at the input and to be compared with what is at the output layer.

The training process is the same as traditional neural network with backpropagation; the only difference lying in the error is computed by comparing the output to the data itself [ 2 ]. Mitchell et al. A stacked autoencoder is the deep counterpart of autoencoder and it can be built simply by stacking up layers.

For every layer, its input is the learned representation of former layer and it learns a more compact representation of the existing learned representation. A stacked sparse autoencoder, discussed by Gravelines et al. A stacked denoising autoencoder, introduced by Vincent et al. In reference [ 76 ], for the characteristics of short texts, a feature extraction and clustering algorithm based on deep noise autoencoder is brought forward. This algorithm converts spatial vectors of high-dimensional, sparse short texts into new, lower-dimensional, substantive feature spaces by using deep learning network.

According to experimental results, applying extractive text features to short text clustering significantly improves clustering effect and efficiently addresses high-dimensional and sparse short text space vectors. Experiments show that in the situation of fewer training sets, classification performance of SD algorithm is lower than that of traditional SVM support vector machine , but when processing high-dimensional data, SD algorithm has a higher accuracy rate and recall rate than that compared with SVM.

In reference [ 78 ], this paper presents the use of unsupervised pre-training using autoencoder with deep ConvNet in order to recognize handwritten Bangla digits. The proposed approach achieves In reference [ 79 ], human motion data is high-dimensional time-series data, and it usually contains measurement error and noise. In experiments, we compared the using of the row data and three types of feature extraction methods—principal component analysis, a shallow sparse autoencoder, and a deep sparse autoencoder—for pattern recognition [ 79 ].

The proposed method, application of a deep sparse autoencoder, thus enabled higher recognition accuracy, better generalization, and more stability than that which could be achieved with the other methods [ 79 ].

RBM restricted Boltzmann machine , originally known as Harmonium when invented by Smolensky [ 80 ], is a version of Boltzmann machine with a restriction that there are no connections either between visible units or between hidden units [ 2 ].

This network is composed of visible units correspondingly, visible vectors, i. The whole system is a bipartite graph. Edges only exist between visible units and hidden units, and there are no edge connections between visible units and between hidden units Fig.

During the forward transitive process, each input combines with a single weight and bias, and the result is transmitted to the hidden layer. During the backward process, each activation combines with a single weight and bias, and the result is transmitted to the visible layer for reconstruction. In the visible layer, KL divergence is utilized to compare reconstruction and initial input to decide the resulting quality.

Using different weights and biases repeating steps a—c until reconstruction and input are close as far as possible. In reference [ 81 ], RBM is a new type of machine learning tool with strong power of representation, has been utilized as the feature extractor in a large variety of classification problems [ 81 ].

In this paper, we use the RBM to extract discriminative low-dimensional features from raw data with dimension up to and then use the extracted features as the input of SVM for regression.

Experimental results indicate that our approach for stock price prediction has great improvement in terms of low forecasting errors compared with SVM using raw data.

The results demonstrate that, with proper structure and parameter, the performance of the proposed deep learning method on sentiment classification is better than the state-of-the-art surface learning models such as SVM or NB, which proves that DBN is suitable for short-length document classification with the proposed feature dimensionality extension method [ 82 ].

DBN deep belief networks is introduced by Hinton et al. DBN in terms of network structure can be regarded as a matter of stack, one of the restricted Boltzmann machine visible in the hidden layer is a layer on the layers. Training process of DBN includes two phases: the first step is layer-wise pre-training, and the second step is fine-tuning [ 2 , 84 ]. Train RBM network of each layer respectively and solely under no supervision and ensure that as feature vectors are mapped to different feature spaces, and feature information is retained as much as possible.

RBM network of each layer is merely able to ensure that weights in its own layer to feature vectors of this layer instead of feature vectors of the whole DBN to be optimized. Therefore, a back propagation network propagates error information top-down to each layer of RBM and fine-tunes the whole DBN network.

The process of RBM network training model can be considered as initialization of weight parameters of a deep BP network. It enables DBN to overcome a weakness that initialization of weight parameters of a deep BP network easily leads to local optimum and long training time.

Any classifiers based on specific application domain can be used in the layer with supervised learning. It does have to be BP networks [ 16 , 84 ]. In reference [ 85 ], a novel text classification approach is proposed in this paper based on deep belief network. The proposed method outperforms traditional classifier based on the support of vector machine. Detailed experiments are also made to show the effect of different fine-tuning strategies and network structures on the performance of deep belief network [ 85 ].

Reference [ 86 ] proposed a biomedical domain-specific word embedding model by incorporating stem, chunk, and entity information and used them for DBN-based DDI extraction and RNN recurrent neural network -based gene mention extraction.

In reference [ 87 ], this paper proposes a novel hybrid text classification model based on the deep belief network and softmax regression. The experimental results on Reuters and 20 Newsgroup corpus show that the proposed model can converge at the fine-tuning stage and perform significantly better than the classical algorithms, such as SVM and KNN [ 87 ].

CNN convolution neural network [ 88 ] is developed in recent years and caused extensive attention of a highly efficient identification method. Inspired, Fukushima made neurocognitive suggestions in the first implementation of CNN network and also felt that wild concept is firstly applied in the field of artificial neural network [ 89 ].

Then, in LeCun et al. Now, in the field of image recognition, CNN has become a highly efficient method of identification [ 90 ]. CNN is a multi-layer neural network; each layer is composed of multiple 2D surfaces, and each plane is composed of multiple independent neurons [ 91 ]. A group of local unit is the next layer in the upper adjacent unit of input; this views local connection originating in perceptron [ 92 , 93 ].

CNN is one of the artificial neural networks, with its strong adaptability and good at mining data local characteristics. The weights of sharing network structure make it more similar to the biological neural networks, reduce the complexity of the network model, a reduction in the number of weights, makes the CNN be applied in various fields of pattern recognition, and achieved very good results [ 94 , 95 ].

CNN by combining local perception area, sharing the weight, the drop in space or time sampling to make full use of the data itself contains features such as locality, optimize network structure, and to ensure a degree of displacement invariability [ 93 ].

Already have an account? Login here. Don't have an account? Signup here. Feature Extraction. In real life, all the data we collect are in large amounts. To understand this data, we need a process. Manually, it is not possible to process them. Suppose you want to work with some of the big machine learning projects or the coolest and popular domains such as deep learning, where you can use images to make a project on object detection. Making projects on computer vision where you can to work with thousands of interesting project in the image data set.

To work with them, you have to go for feature extraction procedure which will make your life easy. Feature extraction is a part of the dimensionality reduction process, in which, an initial set of the raw data is divided and reduced to more manageable groups. So when you want to process it will be easier. The most important characteristic of these large data sets is that they have a large number of variables. These variables require a lot of computing resources to process them. So Feature extraction helps to get the best feature from those big data sets by select and combine variables into features, thus, effectively reducing the amount of data.

These features are easy to process, but still able to describe the actual data set with the accuracy and originality. The technique of extracting the features is useful when you have a large data set and need to reduce the number of resources without losing any important or relevant information.

Feature extraction helps to reduce the amount of redundant data from the data set. So in this section, we will start with from scratch. For the first thing, we need to understand how a machine can read and store images.

Loading the image, read them and then process them through the machine is difficult because the machine does not have eyes like us.

Machines see any images in the form of a matrix of numbers. The size of this matrix actually depends on the number of pixels of the input image. The Pixel Values for each of the pixels stands for or describe how bright that pixel is, and what color it should be. So In the simplest case of the binary images, the pixel value is a 1-bit number indicating either foreground or background. So pixels are the numbers, or the pixel values which denote the intensity or brightness of the pixel.

Smaller numbers which is closer to zero helps to represent black, and the larger numbers which is closer to denote white. So this is the concept of pixels and how machine sees the images without eyes through the numbers.

The dimensions of the image 28 x And if you want to check then by counting the number of pixels you can verify.



0コメント

  • 1000 / 1000