层次聚类

本文最后更新于 2020年8月5日 早上

概念

层次聚类(hierarchical clustering)就是不断把最近的类合并知道达到要求为止。这是一种树形结构

大致过程:

  1. 初始化,每个实例看做一类
  2. 合并, 算出每两个类之间的距离,然后把距离最近的两个类合并成一个类
  3. 终止条件可以是最近两个类之间的距离

计算两个类之间距离的方法

  1. SingleLinkage: 这种方法是以两个类中最短距离代表两个类之间的距离。但是这种方法可能出现链式反应,即抓到了一个离其他点十分近的点就抓到了一群点,但实际上分属两团
  2. CompleteLinkage: 这种方法是找两个类中最长点距离
  3. AverageLinkage: 这种方法是把两个类中所有点的距离求出来再求平均值,或者也可以取中值

应用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
from numpy import *

"""
Code for hierarchical clustering, modified from
Programming Collective Intelligence by Toby Segaran
(O'Reilly Media 2007, page 33).
"""

class cluster_node: # 树节点
def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
self.left=left
self.right=right
self.vec=vec
self.id=id
self.distance=distance
self.count=count #only used for weighted average

def L2dist(v1,v2):
return sqrt(sum((v1-v2)**2))

def L1dist(v1,v2):
return sum(abs(v1-v2))

# def Chi2dist(v1,v2):
# return sqrt(sum((v1-v2)**2))

def hcluster(features,distance=L2dist):
#cluster the rows of the "features" matrix
distances={}
currentclustid=-1

# clusters are initially just the individual rows
clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]

while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)

# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]

if d<closest:
closest=d
lowestpair=(i,j) # 这里是使用第一种方法

# calculate the average of the two clusters
mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \
for i in range(len(clust[0].vec))]
# 平均距离

# create the new cluster
newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)

# cluster ids that weren't in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)

return clust[0]


def extract_clusters(clust,dist):
# extract list of sub-tree clusters from hcluster tree with distance<dist
clusters = {}
if clust.distance<dist:
# we have found a cluster subtree
return [clust]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = extract_clusters(clust.left,dist=dist)
if clust.right!=None:
cr = extract_clusters(clust.right,dist=dist)
return cl+cr

def get_cluster_elements(clust):
# return ids for elements in a cluster sub-tree
if clust.id>=0:
# positive id means that this is a leaf
return [clust.id]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = get_cluster_elements(clust.left)
if clust.right!=None:
cr = get_cluster_elements(clust.right)
return cl+cr


def printclust(clust,labels=None,n=0):
# indent to make a hierarchy layout
for i in range(n): print ' ',
if clust.id<0:
# negative id means that this is branch
print '-'
else:
# positive id means that this is an endpoint
if labels==None: print clust.id
else: print labels[clust.id]

# now print the right and left branches
if clust.left!=None: printclust(clust.left,labels=labels,n=n+1)
if clust.right!=None: printclust(clust.right,labels=labels,n=n+1)



def getheight(clust):
# Is this an endpoint? Then the height is just 1
if clust.left==None and clust.right==None: return 1

# Otherwise the height is the same of the heights of
# each branch
return getheight(clust.left)+getheight(clust.right)

def getdepth(clust):
# The distance of an endpoint is 0.0
if clust.left==None and clust.right==None: return 0

# The distance of a branch is the greater of its two sides
# plus its own distance
return max(getdepth(clust.left),getdepth(clust.right))+clust.distance


层次聚类
https://www.xinhecuican.tech/post/2ba82a45.html
作者
星河璀璨
发布于
2020年7月30日
许可协议