博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Logistic回归与最小二乘概率分类算法简述与示例
阅读量:5768 次
发布时间:2019-06-18

本文共 5028 字,大约阅读时间需要 16 分钟。

Logistic Regression & Least Square Probability Classification

1. Logistic Regression

Likelihood function, as interpreted by wikipedia:

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.

Pattern recognition works on the way that learning the posterior probability p(y|x) of pattern x belonging to class y. In view of a pattern x, when the posterior probability of one of the class y achieves the maximum, we can take x for class y, i.e.

y^=argmaxy=1,,cp(u|x)
The posterior probability can be seen as the credibility of model
x belonging to class
y.
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:
q(y|x,θ)=exp(bj=1θ(y)jϕj(x))cy=1exp(bj=1θ(y)jϕj(x))
Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:
maxθi=1mlogq(yi|xi,θ)
We can solve it by gradient descent method:
  1. Initialize θ.
  2. Pick up a training sample (xi,yi) randomly.
  3. Update θ=(θ(1)T,,θ(c)T)T along the direction of gradient ascent:
    θ(y)θ(y)+ϵyJi(θ),y=1,,c
    where
    yJi(θ)=exp(θ(y)Tϕ(xi))ϕ(xi)cy=1exp(θ(y)Tϕ(xi))+{
    ϕ(xi)0(y=yi)(yyi)
  4. Go back to step 2,3 until we get a θ of suitable precision.

Take the Gaussian Kernal Model as an example:

q(y|x,θ)expj=1nθjK(x,xj)
Aren’t you familiar with Gaussian Kernal Model? Refer to this article:

Here are the corresponding MATLAB codes:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);hh=2*1^2; t0=randn(n,c);for o=1:n*1000    i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);    ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));    t(:,yi)=t(:,yi)+0.1*ki;    if norm(t-t0)<0.000001        break;    end    t0=t;endN=100; X=linspace(-5,5,N)';K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);C=exp(K*t); C=C./repmat(sum(C,2),1,c);plot(X,C(:,1),'b-');plot(X,C(:,2),'r--');plot(X,C(:,3),'g:');plot(x(y==1),-0.1*ones(n/c,1),'bo');plot(x(y==2),-0.2*ones(n/c,1),'rx');plot(x(y==3),-0.1*ones(n/c,1),'gv');legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

2. Least Square Probability Classification

In LS probability classifiers, linear parameterized model is used to express the posterior probability:

q(y|x,θ(y))=j=1bθ(y)jϕj(x)=θ(y)Tϕ(x),y=1,,c
These models depends on the parameters
θ(y)=θ(y)1,,θ(y)bT correlated to each classes
y that is diverse from the one used by Logistic classifiers. Learning those models means to minimize the following quadratic error:
Jy(θ(y))==12(q(y|x,θ(y))p(y|x))2p(x)dx12q(y|x,θ(y))2p(x)dxq(y|x,θ(y))p(y|x)p(x)dx+12p(y|x)2p(x)dx
where
p(x) represents the probability density of training set
{
xi}ni=1
.
By the Bayesian formula,
p(y|x)p(x)=p(x,y)=p(x|y)p(y)
Hence
Jy can be reformulated as
Jy(θ(y))=12q(y|x,θ(y))2p(x)dxq(y|x,θ(y))p(x|y)p(y)dx+12p(y|x)2p(x)dx
Note that the first term and second term in the equation above stand for the mathematical expectation of
p(x) and
p(x|y) respectively, which are often impossible to calculate directly. The last term is independent of
θ and thus can be omitted.
Due to the fact that
p(x|y) is the probability density of sample
x belonging to class
y, we are able to estimate term 1 and 2 by the following averages:
1ni=1nq(y|xi,θ(y))2,1nyi:yi=yq(y|xi,θ(y))p(y)
Next, we introduce the regularization term to get the following calculation rule:
J^y(θ(y))=12ni=1nq(y|xi,θ(y))21nyi:yi=yq(y|xi,θ(y))+λ2nθ(y)2
Let
π(y)=(π(y)1,,π(y)n)T and
π(y)i={
1(yi=y)0(yiy)
, then
J^y(θ(y))=12nθ(y)TΦTΦθ(y)1nθ(y)TΦTπ(y)+λ2nθ(y)2
.
Therefore, it is evident that the problem above can be formulated as a
convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:
θ^(y)=(ΦTΦ+λI)1ΦTπ(y)
.
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:
p^(y|x)=max(0,θ^(y)Tϕ(x))cy=1max(0,θ^(y)Tϕ(x))

We also take Gaussian Kernal Models for example:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);for yy=1:c    yk=(y==yy); ky=k(:,yk);    ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);    Kt(:,yy)=max(0,K(:,yk)*ty);endph=Kt./repmat(sum(Kt,2),1,c);figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);C=exp(K*t); C=C./repmat(sum(C,2),1,c);plot(X,C(:,1),'b-');plot(X,C(:,2),'r--');plot(X,C(:,3),'g:');plot(x(y==1),-0.1*ones(n/c,1),'bo');plot(x(y==2),-0.2*ones(n/c,1),'rx');plot(x(y==3),-0.1*ones(n/c,1),'gv');legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

3. Summary

Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.

你可能感兴趣的文章
转自知乎 程序员如何保证自身计算机知识体系完整
查看>>
svn 冲突Skipped ‘inm/inm/templates‘ -- Node remains in conflict
查看>>
性能测试初学_loadrunner使用中遇到的问题
查看>>
python学习_模块化示例
查看>>
C# Linq及Lamda表达式实战应用之 GroupBy 分组统计
查看>>
对软件工程课程的期望
查看>>
20110311-wmh-日记
查看>>
题目5 2
查看>>
BZOJ3329 Xorequ(数位DP)
查看>>
Beyond Compare 4比较文件夹设置基键原因
查看>>
关于通信数据的接收缓存
查看>>
CSRF verification failed. Request aborted.
查看>>
Deal with Warning: mysqli::__construct(): (HY000/2002)
查看>>
一颗简单的JDBC栗子
查看>>
C# 如何避免异常”集合已修改;可能无法执行枚举操作。“
查看>>
A Knight's Journey (DFS)
查看>>
Notepad++ xml/json格式化
查看>>
检查SSD磁盘是否开启了TRIM指令
查看>>
详解KMP算法 另一种思路
查看>>
hdu 2059 DP
查看>>