Skip to content Skip to sidebar Skip to footer

How To Compare Two Strings By Meaning?

I want the user of my node.js application to write down ideas, which then get stored in a database. So far so good, but I don't want redundant entrys in that table, so I decided to

Solution 1:

Comparing the meaning of two string is still an ongoing research. If you really want to solve the problem (or to get really good performance of your language modal) you should consider get a PhD.

For out of box solution at the time: I found this Github repo that implement google's BERT modal and use it to get the embedding of two sentences. In theory, the two sentence share the same meaning if there embedding is similar.

https://github.com/UKPLab/sentence-transformers

# the following is simplified from their README.md
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
S1 = ['A man is eating a food.']
S2 = ['A man is eating pasta.']

s1_embedding = embedder.encode(S1)
s2_embedding = embedder.encode(S2)

dist = scipy.spatial.distance.cdist([s1_embedding], [s2_embedding], "cosine")[0]
Exampleoutput (copied from their README.md)

Query: Amaniseatingpasta.
Top5mostsimilarsentencesincorpus:
Amaniseatingapieceofbread. (Score: 0.8518)
Amaniseatingafood. (Score: 0.8020)
Amonkeyisplayingdrums. (Score: 0.4167)
Amanisridingahorse. (Score: 0.2621)
Amanisridingawhitehorseonanenclosedground. (Score: 0.2379)

Solution 2:

To compare two strings by meaning, the strings would need to be convert first to a tensor and then evalutuate the distance or similarity between the tensors. Many algorithm can be used to convert strings to tensors - all related to the domain of interest. But the Universal Sentence Encoder is a wide broad sentence encoder that will project all words in one dimensional space. The cosine similarity can be used to see how closed some words are in meaning.

Example

Though king and kind are closed in hamming distance (difference of only one character), they are very different. Whereas queen and king though they seems not related (because all characters are different) are close in meaning. Therefore the distance (in meaning) between king and queen should be smaller than between king and kind as demonstrated in the following snippet.

<scriptsrc="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script><scriptsrc="https://cdn.jsdelivr.net/npm/@tensorflow-models/universal-sentence-encoder"></script><script>

(async() => {

const model = await use.load();
const embeddings = (await model.embed(['queen', 'king', 'kind'])).unstack()
tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).print() // 0.39812755584716797
tf.losses.cosineDistance(embeddings[1], embeddings[2], 0).print() // 0.5585797429084778

})()  
</script>

Post a Comment for "How To Compare Two Strings By Meaning?"