i'm following tutorial in 'kmeans' algorithm main chunk of overall example. 'rows' list passed data clustered. pearson function provides 2nd parameter, relational coefficient , k=3 number of clusters. 'bestmatches' returned kmeans function list of grouped/clustered index values corresponding elements rows belong each cluster. need make scatter plot, need values. how return values instead of index's?
rows=[(1,1),(3,6),(11,2),(7,19),(22,11),(32,11)] def pearson(v1,v2): #sums sum1=sum(v1) sum2=sum(v2) print(sum1) #sums of sqs sum1sq=sum([pow(v,2) v in v1]) sum2sq=sum([pow(v,2) v in v2]) #sum of products psum=sum([v1[i]*v2[i] in range(len(v1))]) #calculate pearson r num=psum-(sum1*sum2/len(v1)) den=sqrt((sum1sq-pow(sum1,2)/len(v1))*(sum2sq-pow(sum2,2)/len(v1))) if den==0: return 0 return 1.0-num/den def kmeans(rows,distance=pearson,k=3): #determine min , max values each point #count through "rows"(data) , find min , max values ranges=[(min([row[i] row in rows]),max([row[i] row in rows])) in range(len(rows[0]))] #create k randomly placed centroids within len of 'data' clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] in range(len(rows[0]))] j in range(k)] lastmatches=none t in range(100): print 'iteration %d' % t bestmatches=[[] in range(k)] #find centroid closest each row j in range(len(rows)): row=rows[j] bestmatch=0 in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) if bestmatches==lastmatches: break lastmatches=bestmatches #move centroids avg of members in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: #print(len(bestmatches[i])) rowid in bestmatches[i]: m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
do not use k-means pearson correlation
this may fail badly, because pearson correlation , mean incompatible , may prevent algorithm converging. worse, may yield invalid values.
if take 2 vectors
1 2 3 4 5 9 8 7 6 5
then mean is
5 5 5 5 5
and resulting mean cannot used pearson corrleation, because constant value.
k-means correct brgeman divergences, such squared euclidean. because variance minimization, not distance minimization
k-means cannot used arbitrary distances. use k-medians (pam) or other clustering algorithms if have other distances.
Comments
Post a Comment