algorithm - Python Clustering numerical data -


i'm following tutorial in 'kmeans' algorithm main chunk of overall example. 'rows' list passed data clustered. pearson function provides 2nd parameter, relational coefficient , k=3 number of clusters. 'bestmatches' returned kmeans function list of grouped/clustered index values corresponding elements rows belong each cluster. need make scatter plot, need values. how return values instead of index's?

rows=[(1,1),(3,6),(11,2),(7,19),(22,11),(32,11)]  def pearson(v1,v2):  #sums sum1=sum(v1) sum2=sum(v2) print(sum1) #sums of sqs sum1sq=sum([pow(v,2) v in v1]) sum2sq=sum([pow(v,2) v in v2])  #sum of products psum=sum([v1[i]*v2[i] in range(len(v1))])  #calculate pearson r num=psum-(sum1*sum2/len(v1)) den=sqrt((sum1sq-pow(sum1,2)/len(v1))*(sum2sq-pow(sum2,2)/len(v1))) if den==0: return 0  return 1.0-num/den     def kmeans(rows,distance=pearson,k=3): #determine min , max values each point  #count through "rows"(data) , find min , max values ranges=[(min([row[i] row in rows]),max([row[i] row in rows]))  in range(len(rows[0]))]     #create k randomly placed centroids within len of 'data' clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]  in range(len(rows[0]))] j in range(k)] lastmatches=none t in range(100):     print 'iteration %d' % t      bestmatches=[[] in range(k)]      #find centroid closest each row     j in range(len(rows)):         row=rows[j]         bestmatch=0         in range(k):             d=distance(clusters[i],row)             if d<distance(clusters[bestmatch],row): bestmatch=i          bestmatches[bestmatch].append(j)      if bestmatches==lastmatches: break     lastmatches=bestmatches      #move centroids avg of members     in range(k):         avgs=[0.0]*len(rows[0])         if len(bestmatches[i])>0:             #print(len(bestmatches[i]))             rowid in bestmatches[i]:                 m in range(len(rows[rowid])):                     avgs[m]+=rows[rowid][m]                 j in range(len(avgs)):                     avgs[j]/=len(bestmatches[i])                 clusters[i]=avgs      return bestmatches 

do not use k-means pearson correlation

this may fail badly, because pearson correlation , mean incompatible , may prevent algorithm converging. worse, may yield invalid values.

if take 2 vectors

1 2 3 4 5 9 8 7 6 5 

then mean is

5 5 5 5 5 

and resulting mean cannot used pearson corrleation, because constant value.

k-means correct brgeman divergences, such squared euclidean. because variance minimization, not distance minimization

k-means cannot used arbitrary distances. use k-medians (pam) or other clustering algorithms if have other distances.


Comments