downloaded_4m_kaggle

Univariate Logistic Regression

Logistic regression is a supervised classification algorithm This is similar to linear regression, but is used when the dependent variable is categorical (despite its name regression). So, it targets the classification problems unlike linear regression for regression problems. Objective in classification is to map the observation (input) to its associated class or label.

Linear regression can be used for classification problems by using a threshold, however using least squares (the sum of the squares of the residuals) criteria is not more apt way for separating the classes for classification.

So, usually for categorical values best try is Logistic Regression, Decision Trees, SVM and Random Forest etc.

Logistic regression predicts the probability of the outcome for a given input unlike linear function which predicts the outcome for a given input.
So, output value modeled in logistic regression is a binary value (0 or 1) instead of a numeric value in linear regression.

Say P is the probability of success of an event. Then probability of failure is 1-P.
By definition, probabilities and proportions cannot exceed 1 or fall below 0. Odds of success are defined as ratio of probability of success over probability of failure.
Odds = P/(1-P)

$$ log(\dfrac{P(x)}{1-P(x)}) = \sum_{i=1}^{m} \theta_0 + \theta_i x_i $$ In vector form $$ =\theta^T X $$

This model assumes that the log(odds) of an observation can be expressed as a linear function to the input variable.

LHS is logit of P, so the name logistic regression.
RHS is linear, similar to linear regression.

Inverse of the logit function is

$$P(x) = \dfrac{e^z}{1+e^z} $$

$$ = \dfrac{1}{1+e^{-z}}$$ where $z = \theta^T X$

RHS is the sigmoid function of Z, which maps the continuous real line to the interval (0,1)

Flow chart

Logistic Regression - Flow chart

Algorithm for this is very similar to linear regression, only things we need to change is the representation of response(prediction) and the cost function

The response, while in linear regression we use a linear function of X ($\theta^T X$), in logistic regression we use sigmoid function of the linear function ($\theta^T X$).

Hypothesis function:
Linear Regression : $$h_\theta(x) = (\theta^T X)$$

Logistic Regression : $$h_\theta(x) = g(\theta^T X)$$

$g$ is a link function, this transforms the observed responses to the original data.

$$h_\theta(x) = \dfrac{1}{1+e^{-\theta^T X}}$$

This hypothesis function represents the estimated probability that y = 1 for given input x parameterized by θ: $$h_\theta(x) = P(y=1 | x;\theta)$$

Since the hypothesis function is formed by sigmoid function, our cost function is not going to be a convex function, which happened to be for linear regression as our hypothesis function was a linear function. It means, unlike the cost function in linear regression, this cost function in logistic regression will get many local minimum.
So, to make this a convex function, this is transformed using logarithm.

Image source:Coursera, Neural Networks

$$ Error(h_\theta(x),y)=\begin{cases} -log(h_\theta(x)), & \text{if $y=1$}\\ -log(1-h_\theta(x)), & \text{if $y=0$} \end{cases}$$

Cost function :

Mean of the errors $$J(\theta) = \dfrac{1}{m} \sum_{i=1}^{m}Error(h_\theta(x^{(i)})),y^{(i)})$$

We are trying to minimize the difference between the prediciont and the response

Putting the two functions into one compact function, we get $$J(\theta) = -\dfrac{1}{m} \sum_{i=1}^{m} \big[y^{(i)} log(h_\theta(x^{(i)})) +(1-y^{(i)}) log(1-h_\theta(x^{(i)}))\big] $$

Binomial/Binary/Univariable/Univariate logistic regression - If the dependent variable is a binary variable [True/False, 0/1]

Multinomial logistic regression - If the dependent variable have more than two outcomes.

We'll try for univariate logistic regression.
We use iris data from sklearn datasets for this.
For that we will consider only one feature (sepal length) and a target vector of size 2 (setosa, versicolor)

Working

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn import linear_model
In [2]:
iris = datasets.load_iris()
In [3]:
iris.feature_names
Out[3]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
In [4]:
iris.target_names
Out[4]:
array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')
In [5]:
X = iris.data[:, 0]

y_bool = iris.target!=2

y = iris.target[y_bool]

X = X[y_bool]
In [6]:
plt.scatter(X, y)
plt.xlabel('Sepal Length ', fontsize=15)

plt.ylabel('0 - setosa, 1 - versicolor ', fontsize=15)
plt.show()

Algorithm

In [7]:
X = np.c_[np.ones((X.shape[0],1)), X[:]]
y = y.reshape(-1,1)


# Parameters required for Gradient Descent
alpha = 0.1   #learning rate
m = y.size  #no. of samples
np.random.seed(10)
theta = np.random.rand(2)  #initializing theta with some random values
theta = theta.reshape(-1,1)
In [8]:
def gradient_descent(x, y, m, theta,  alpha):
    cost_list = []   #to record all cost values to this list
    theta_list = []  #to record all theta_0 and theta_1 values to this list 
    prediction_list = []
    run = True
    cost_list.append(1e10)    #we append some large value to the cost list
    i=0
    while run:
        Z = np.dot(x, theta)
        prediction = 1 / (1 + np.exp(-Z))   #predicted y values 
        prediction_list.append(prediction)
        error = prediction - y
        cost = np.sum(-(y * np.log(prediction) + (1 - y) * np.log(1 - prediction))) / m   #  (1/2m)*sum[(error)^2]

        cost_list.append(cost)
        theta = theta - (alpha * (1/m) * np.dot(x.T, error))   # alpha * (1/m) * sum[error*x]
        theta_list.append(theta)
        if cost_list[i]-cost_list[i+1] < 1e-9:   #checking if the change in cost function is less than 10^(-9)
            run = False

        i+=1
    cost_list.pop(0)   # Remove the large number we added in the begining 
    return prediction_list, cost_list, theta_list
In [9]:
prediction_list, cost_list, theta_list = gradient_descent(X, y, m, theta, alpha)
theta = theta_list[-1]
In [10]:
plt.title('Cost Function J', size = 30)
plt.xlabel('No. of iterations', size=20)
plt.ylabel('Cost', size=20)
plt.plot(cost_list)
plt.show()

Weights we got

In [11]:
theta
Out[11]:
array([[-27.53475787],
       [  5.08546084]])

We'll plot a decision boundary with the coefficients

We'll take sample data in the training data range

In [12]:
X_test = np.linspace(4, 7, 300)
In [13]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
In [14]:
loss = sigmoid(X_test*theta[1] + theta[0])
In [15]:
plt.figure(1, figsize=(8, 6))
plt.clf()
plt.plot(X_test, loss, c='C0', label='Hyperplane')
plt.scatter(X[:,1], y, c='C1', label='Training data')
# plt.plot(X_test,X_test*theta[1] + theta[0] )
# plt.axhline(0.5, c='C2',label='0.5 Threshold')
# plt.axvline(5.4147157190635449, c='C2',label='0.5 Threshold')
# plt.axvline
plt.legend()
plt.xlabel('Sepal Length (Input)', fontsize=15)
plt.ylabel("Probability of the output" "\n" "(0 - setosa, 1 - versicolor)", fontsize=15)
plt.show()

We keep this threshold value of 0.5, for the given input if the probability is equal to or above 0.5, we consider it 1 and less than 0.5 we consider it 0

In [16]:
# 0.5 threshold corresponds to 
boundary = X_test[np.where(loss >= 0.5)[0][0]]
print(round(boundary,3))
5.415

So, sepal length above 5.415 is categorized as versicolor and less than that is categorized as setosa

In [17]:
plt.figure(1, figsize=(8, 6))
plt.clf()
plt.plot(X_test, loss, c='C0', label='Hyperplane')
plt.scatter(X[:,1], y, c='C1', label='Training data')
plt.axhline(0.5, c='C2',label='0.5 Threshold', linewidth=0.7)
plt.axvline(boundary, c='C3',label='Boundary', linewidth=0.7)
plt.legend()
plt.xlabel('Sepal Length (Input)', fontsize=15)
plt.ylabel("Probability of the output" "\n" "(0 - setosa, 1 - versicolor)", fontsize=15)
plt.show()

Scikit learn

We can check this with sci-kit learn logistic regression module

In [18]:
from sklearn.linear_model import LogisticRegression
In [19]:
lr = LogisticRegression(C=1e100)

Here C from the formal definition from sklearn
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Since we are not using any regularization here, we need to keep C ($=1/\lambda$) value very high, so the the regularization strength $\lambda \approx 0$

Regularization is used for avoiding overfitting.
We will see more about regularization in another post.

In [20]:
X.shape
Out[20]:
(100, 2)
In [21]:
X[:5]
Out[21]:
array([[ 1. ,  5.1],
       [ 1. ,  4.9],
       [ 1. ,  4.7],
       [ 1. ,  4.6],
       [ 1. ,  5. ]])

We added a column of ones for our earlier algorithm, which we don't need here, so taking off that column

In [22]:
lr = lr.fit(X[:,1].reshape(-1,1),y.ravel())

Checking the weights

From scikit learn

In [23]:
'Theta_0 and Theta_1 are {},{}'.format(round(lr.intercept_[0],3), round(lr.coef_[0,0],3))
Out[23]:
'Theta_0 and Theta_1 are -27.831,5.14'

From scratch

In [24]:
theta[0,0], theta[1,0]
Out[24]:
(-27.534757871504066, 5.085460835465244)
In [25]:
'Theta_0 and Theta_1 are {},{}'.format(round(theta[0,0],3),round(theta[1,0],3))
Out[25]:
'Theta_0 and Theta_1 are -27.535,5.085'
In [26]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))