Way to learn data mining.
How to learn data mining?
what exactly is the definition of data mining? First let’s look at a picture
This picture lists the skills you need to become a data analyst. You can’t help but sigh, this is terrible!
So let’s start with a simple example.
Imagine the vast sea, we want to extract valuable oil from the sea.
For the average person, the sea is hard to perceive, let alone find the treasure. But for skilled oil miners, the sea has coordinates. They explore the geology and analyze the geological formations to find out where they are more likely to have oil. Then use the mining tool to carry out deep excavation until the oil is hit.
The sea, geological information, and oil are the result of data sources, geographic locations, and analysis for the miners.
And the data mining work we have to do is like drilling. By analyzing these data, we find the law from the huge data and find the treasure.
The basic process of data mining
Before I officially talk about the data mining knowledge list, let me talk to you about the basic process of data mining.
The process of data mining can be divided into the following six steps.
1. Business understanding: Data mining is not our goal. Our goal is to better help the business. So the first step is to understand the project requirements from a business perspective. On this basis, we will define the goals of data mining.
2. Data understanding: Try to collect some data, and then explore the data, including data description, data quality verification, and so on. This will help you to have a preliminary understanding of the data collected.
3. Data preparation: start collecting data, and perform data cleaning and data integration operations to complete the preparation work before data mining.
4. Model building: Select and apply various data mining models and optimize them for better classification results.
5. Model evaluation: Evaluate the model and examine each step of building the model to confirm that the model achieves the intended business objectives.
6. Online release: The role of the model is to find gold mines from the data, which is what we call “knowledge”. The acquired knowledge needs to be translated into a way that users can use. The form of presentation can be a report or it can be Implemented a more complex and repeatable data mining process. If data mining results are part of day-to-day operations, subsequent monitoring and maintenance become important.
Ten algorithms for data mining
In order to carry out data mining tasks, data scientists have proposed various models. Among the many data mining models, the internationally authoritative academic organization ICDM (the IEEE International Conference on Data Mining) selected ten classic algorithms.
For different purposes, I can divide these algorithms into four categories so that you can understand them better.
- Classification algorithm: C4.5, Naive Bayes, SVM, KNN, Adaboost, CART
- Clustering algorithm: K-Means, EM
- Correlation analysis: Apriori
- Connection analysis: PageRank
1. C4.5
The C4.5 algorithm is the highest-voted algorithm and can be said to be the top of the top ten algorithms. C4.5 is an algorithm for decision trees that creatively prunes in the decision tree construction process, and it can handle continuous attributes as well as incomplete data. It can be said to be a landmark algorithm in the decision tree classification.
2. Naive Bayes
The naive Bayesian model is based on the principle of probability theory. Its idea is as follows: For the given unknown object to be classified, it is necessary to solve the probability of occurrence of each category under the condition of the occurrence of this unknown object, which is the largest. Think of which classification this unknown object belongs to.
3. SVM
Support Vector Machine, referred to as SVM. The SVM established a hyperplane classification model in training. If you don’t understand the hyperplane, it doesn’t matter, I will introduce you in the following articles.
4. KNN
KNN is K-Nearest Neighbor. The so-called K nearest neighbors means that each sample can be represented by its nearest K neighbors. If a sample whose K nearest neighbors belong to category A, then this sample also belongs to category A.
5. AdaBoost
Adaboost established a joint classification model in training. Boost represents the meaning of promotion, so Adaboost is a lifting algorithm for building classifiers. It allows us to form a strong classifier from multiple weak classifiers, so Adaboost is also a common classification algorithm.
6. CART
CART means Classification and Regression Trees. It builds two trees: one is a classification tree and the other is a regression tree. Like C4.5, it is a decision tree learning method.
7. Apriori
Apriori is an algorithm for mining association rules. It exposes the relationship between items by mining frequent item sets, so it is widely used in the fields of business mining and network security. Frequent item set is a collection of items that often appear together, and the association rules imply that there may be a strong relationship between the two items.
8. K-Means
The K-Means algorithm is a clustering algorithm. You can understand this, eventually, I want to divide the object into K classes. Suppose each category has a “central point”, also can be said as an opinion leader, which is the core of this category. Now I have a new point to categorize. At this time, I just need to calculate the distance between this new point and the K center points, which center point is close belongs to which category.
9. EM
The EM algorithm, also called the maximum expectation algorithm, is a method for finding the maximum likelihood estimate of a parameter. The principle is this: Suppose we want to evaluate parameter A and parameter B. In the starting state, both are unknown, and knowing the information of A can get the information of B. In turn, if we know B, we will also get A. You can consider first giving A an initial value to get the estimate of B, and then re-estimating the value of A from the valuation of B. This process continues until convergence.
EM algorithms are often used in the field of clustering and machine learning.
10. PageRank
PageRank originated from the calculation of the influence of the paper. If the number of times a literary theory is introduced, it means that the influence of this paper is stronger. Similarly, PageRank has been creatively applied by Google to the calculation of web weights: The more pages a page links, the more “references” the page has. The higher the frequency at which the page is linked, the higher the number of times the page is referenced. Based on this principle, we can get the weight division of the website.
Algorithm can be said to be the soul of data mining, but also the most essential part. These 10 classic algorithms have the highest votes in the entire data mining field, and some of the other algorithms are basically improved and innovated on this basis. Today, you have a preliminary understanding of the top ten algorithms. You only need to know what you are doing. It doesn’t matter if you don’t understand the content. I will explain it to you in detail later.
The mathematical principle of data mining
I have said so many classic algorithms in data mining, but if you don’t understand probability theory and mathematical statistics, it is still difficult to grasp the essence of the algorithm; if you don’t understand linear algebra, It is difficult to understand the value of matrix and vector operations in data mining; if you don’t have the concept of an optimization method, you don’t understand the iteration convergence. Therefore, in order to understand the data mining method more deeply, it is very necessary to understand the mathematical principles behind it.
1、Probability theory and mathematical statistics
Probability theory has basically been learned when we were in college, but the content taught by the teachers in the university has more partial probabilities and less statistics. There are more places to use probability theory in data mining. For example, the concept of conditional probability, independence, and the concept of random variables and multidimensional random variables.
The essence of many algorithms is related to probability theory, so probability theory and mathematical statistics are important mathematical foundations of data mining.
2. Linear algebra
Vectors and matrices are important knowledge points in linear algebra. They are widely used in data mining. For example, we often abstract objects into representations of matrices. An image can be abstracted into a matrix. We also often calculate eigenvalues and eigenvectors, use eigenvectors to approximate the features of an object. This is the basic idea of big data reduction.
Matrix-based operations can help us solve many practical problems, such as PCA methods, SVD methods, and MF, NMF methods, etc., which are widely used in data mining.
3. Graph theory
The rise of social networks has made the application of graph theory more and more extensive. The relationship between people can be connected by two nodes on the graph theory. The degree of the node can be understood as the number of friends of a person. We have all heard of the six-degree theory of connections, and it has been proven on Facebook that the average person is connected to another person, only needs 3.57 people. Of course, graph theory is very effective for the analysis of network structure, and graph theory also plays an important role in relation mining and image segmentation.
4. Optimization method
The optimization method is equivalent to the process of self-learning in machine learning. When the machine knows the target and it needs to be iteratively adjusted after the training, the optimization is the process of adjustment. In general, the process of learning and iterating is long and random. The optimization method is proposed to achieve convergence in a shorter time and achieve better results.
Comments
Post a Comment