python - Sorting string in order of high frequency terms from Inverted Index Elasticsearch -


i'm new elasticsearch , wanted know if doing possible:

i have bunch of address strings want sort on repetitive terms in strings.

for example:

1. shop no 1 abc lane city1 - zipcode1 2. shop no 2 efg lane city1 - zipcode2 3. shop no 1 xyz lane city2 - zipcode3 4. shop no 3 abc lane city1 - zipcode1 

what need bunch them on common terms in strings.

so sorted output should earlier example is:

    1. shop no 1 abc lane city1 - zipcode1      4. shop no 3 abc lane city1 - zipcode1 # because 1 , 2 have common words in them.     2. shop no 2 efg lane city1 - zipcode2 # second common words 1 , 4.     3. shop no 1 xyz lane city2 - zipcode3 # not many common terms amongst them. 

i have no idea how go it. know fire each string query results close query being fired. have hundred thousand rows such , doesn't seem efficient option @ all.

if matchall() , sort term filter amount of recurring terms in every string, helpful.

can there sort on documents contain of similar words in inverted index?

here's sample pastebin of how data looks: sample addresses

solution

i have used https://stackoverflow.com/a/15174569/61903 calculate cosine similarity of 2 strings (credits @vpekar) base algorithm similarity. put strings list. set index parameter 0 , loop on long in range of list length. within loop iterate position p i+1 length(list). find maximum cosine value between list[i] , list[p]. both textstrings put out list won't taken account in later similarity calculations. both textstrings put result list along cosine value, datastructure vectorresult.

afterwards list sorted cosine value. have unique string pairs descending cosine, a.k.a. similarity value. hth.

import re import math import timeit  collections import counter  word = re.compile(r'\w+')   def get_cosine(vec1, vec2):     intersection = set(vec1.keys()) & set(vec2.keys())     numerator = sum([vec1[x] * vec2[x] x in intersection])      sum1 = sum([vec1[x] ** 2 x in vec1.keys()])     sum2 = sum([vec2[x] ** 2 x in vec2.keys()])     denominator = math.sqrt(sum1) * math.sqrt(sum2)      if not denominator:         return 0.0     else:         return float(numerator) / denominator   def text_to_vector(text):     words = word.findall(text)     return counter(words)   class vectorresult(object):     def __init__(self, cosine, text_1, text_2):         self.cosine = cosine         self.text_1 = text_1         self.text_2 = text_2      def __eq__(self, other):         if self.cosine == other.cosine:             return true         return false      def __le__(self, other):         if self.cosine <= other.cosine:             return true         return false      def __ge__(self, other):         if self.cosine >= other.cosine:             return true         return false      def __lt__(self, other):         if self.cosine < other.cosine:             return true         return false      def __gt__(self, other):         if self.cosine > other.cosine:             return true         return false  def main():     start = timeit.default_timer()     texts = []     open('data.txt', 'r') f:         texts = f.readlines()      cosmap = []     = 0     out = []     while < len(texts):         max_cosine = 0.0         current = none         p in range(i + 1, len(texts)):             if texts[i] in out or texts[p] in out:                 continue             vector1 = text_to_vector(texts[i])             vector2 = text_to_vector(texts[p])             cosine = get_cosine(vector1, vector2)             if cosine > max_cosine:                 current = vectorresult(cosine, texts[i], texts[p])                 max_cosine = cosine         if current:             out.extend([current.text_1, current.text_2])             cosmap.append(current)         += 1      cosmap = sorted(cosmap)      item in reversed(cosmap):         print(item.cosine, item.text_1, item.text_2)      end = timeit.default_timer()      print("similarity sorting of {} strings lasted {} s.".format(len(texts), end - start))  if __name__ == '__main__':     main() 

results

i used sampple adresses @ http://pastebin.com/hyskz4pn test data:

1.0000000000000002 no 15& 16 1st floor,2nd main road,khb colony,gandinagar yelahanka  no 15& 16 1st floor,2nd main road,khb colony,gandinagar yelahanka  1.0 # 51/3 agrahara yelahanka  #51/3 agrahara yelahanka  0.9999999999999999 # c m c road,yalahanka  # c m c road,yalahanka  0.8728715609439696 # 1002/b b b road,yelahanka  0,b b road,yelahanka  0.8432740427115678 # lakshmi complex c m c road,yalahanka  # sri lakshman complex c m c road,yalahanka  0.8333333333333335 # 85/1 b b m p office road,kogilu yelahanka  #85/1 b b m p office near kogilu yalahanka  0.8249579113843053 # 689 3rd cross sheshadripuram callege opp yelahanka  # 715 3rd cross sectur sheshadripuram callege opp yelahanka  0.8249579113843053 # 10 ramaiaia complex b b road,yalahanka  # jamati complex b b road,yalahanka  [ snipped ]  similarity sorting of 702 strings lasted 8.955146235887025 s. 

Comments