Predictive modelling uses statistics to predict outcomes, based on historic data. Also referred to as supervised machine learning
• Spam filtering: predict if a new email is spam or non-span based on annotated examples of past spam/non-spam
• Car insurance: assign risk of accidents to policy holdeers and potential customers
• Healthcare: preict disease which a patient has, based on their symptoms
• Algorithmic trading: predictive models can be built for different assets like stocks, futures, currencies, etc, based on histric data nad company information
• Classification: Learn from a labeled training set to make a prediction to assign a new "unseen" example to one of a fixed number of classes
• Regression: Learn from an existing training set to decide the value of a continuous output variable (i.e. the output is a number)
Scikit-learn is a comrehensive open source python package for machine learning and ata analysis. Anaconda includes scikit-learn as a part of its free distribution
When analysing anew dataset, as a first step we might visualize the data to identify any obvious patterns.
• Visualising data using scatter plots can provide us with a sense of the relationship between two variables
• Relationships can have different directions (positive and negative) and different strengths (weak or strong)
• How can we quantify this numerically?
• Often useful to know the strength of relationship between Y and X, but independent of the units of measurement
• The correlation between Y and X is a statistical measure of how strongly two cariables are related. It is dimensionless, i.e. a unit-free measure of the relationship between variables
• Takes a value in
[-1, +1], where 1 is total postive correlation, 0 is no correlation, -1 is total negative correlation
• If our data is stored in a Pandas Data Frame, we can also use the `
df.corr()` function to find correleation
Causation: indicates that one event is the result of the occurance of the other event - i.e. there is a causal relationship between the two events.
A correlation between variables does not automatically mean that the change in one variable is the cause of the change in values of the other variable.
Regression analysis: A common statistical process for estimating the relationships between variables. This can allow us to make numeric predictions based on past data
Linear Regression: a simple approach to predictive modelling It assumes that the dependence of a response (dependent) variable Y on input (independent) variables X1, X2, .. is linear.
• Simple Linear Regression: Method for predicting a numeric response using a single input variable
• Goal is to learn the model coefficients from existing data
• Once we have learned the model, we can make future predictions
• We learn the model by finding the best line (coefficients) which minimises the squared distance between our examples and the line