i'm new elasticsearch , wanted know if doing possible:
i have bunch of address strings want sort on repetitive terms in strings.
for example:
1. shop no 1 abc lane city1 - zipcode1 2. shop no 2 efg lane city1 - zipcode2 3. shop no 1 xyz lane city2 - zipcode3 4. shop no 3 abc lane city1 - zipcode1
what need bunch them on common terms in strings.
so sorted output should earlier example is:
1. shop no 1 abc lane city1 - zipcode1 4. shop no 3 abc lane city1 - zipcode1 # because 1 , 2 have common words in them. 2. shop no 2 efg lane city1 - zipcode2 # second common words 1 , 4. 3. shop no 1 xyz lane city2 - zipcode3 # not many common terms amongst them.
i have no idea how go it. know fire each string query results close query being fired. have hundred thousand rows such , doesn't seem efficient option @ all.
if matchall()
, sort
term
filter amount of recurring terms in every string, helpful.
can there sort on documents contain of similar words in inverted index?
here's sample pastebin of how data looks: sample addresses
solution
i have used https://stackoverflow.com/a/15174569/61903 calculate cosine similarity of 2 strings (credits @vpekar) base algorithm similarity. put strings list. set index parameter 0 , loop on long in range of list length. within loop iterate position p i+1 length(list). find maximum cosine value between list[i] , list[p]. both textstrings put out list won't taken account in later similarity calculations. both textstrings put result list along cosine value, datastructure vectorresult.
afterwards list sorted cosine value. have unique string pairs descending cosine, a.k.a. similarity value. hth.
import re import math import timeit collections import counter word = re.compile(r'\w+') def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] x in intersection]) sum1 = sum([vec1[x] ** 2 x in vec1.keys()]) sum2 = sum([vec2[x] ** 2 x in vec2.keys()]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = word.findall(text) return counter(words) class vectorresult(object): def __init__(self, cosine, text_1, text_2): self.cosine = cosine self.text_1 = text_1 self.text_2 = text_2 def __eq__(self, other): if self.cosine == other.cosine: return true return false def __le__(self, other): if self.cosine <= other.cosine: return true return false def __ge__(self, other): if self.cosine >= other.cosine: return true return false def __lt__(self, other): if self.cosine < other.cosine: return true return false def __gt__(self, other): if self.cosine > other.cosine: return true return false def main(): start = timeit.default_timer() texts = [] open('data.txt', 'r') f: texts = f.readlines() cosmap = [] = 0 out = [] while < len(texts): max_cosine = 0.0 current = none p in range(i + 1, len(texts)): if texts[i] in out or texts[p] in out: continue vector1 = text_to_vector(texts[i]) vector2 = text_to_vector(texts[p]) cosine = get_cosine(vector1, vector2) if cosine > max_cosine: current = vectorresult(cosine, texts[i], texts[p]) max_cosine = cosine if current: out.extend([current.text_1, current.text_2]) cosmap.append(current) += 1 cosmap = sorted(cosmap) item in reversed(cosmap): print(item.cosine, item.text_1, item.text_2) end = timeit.default_timer() print("similarity sorting of {} strings lasted {} s.".format(len(texts), end - start)) if __name__ == '__main__': main()
results
i used sampple adresses @ http://pastebin.com/hyskz4pn test data:
1.0000000000000002 no 15& 16 1st floor,2nd main road,khb colony,gandinagar yelahanka no 15& 16 1st floor,2nd main road,khb colony,gandinagar yelahanka 1.0 # 51/3 agrahara yelahanka #51/3 agrahara yelahanka 0.9999999999999999 # c m c road,yalahanka # c m c road,yalahanka 0.8728715609439696 # 1002/b b b road,yelahanka 0,b b road,yelahanka 0.8432740427115678 # lakshmi complex c m c road,yalahanka # sri lakshman complex c m c road,yalahanka 0.8333333333333335 # 85/1 b b m p office road,kogilu yelahanka #85/1 b b m p office near kogilu yalahanka 0.8249579113843053 # 689 3rd cross sheshadripuram callege opp yelahanka # 715 3rd cross sectur sheshadripuram callege opp yelahanka 0.8249579113843053 # 10 ramaiaia complex b b road,yalahanka # jamati complex b b road,yalahanka [ snipped ] similarity sorting of 702 strings lasted 8.955146235887025 s.
Comments
Post a Comment