-2

enter image description hereI want to identify most matching sentence using some pattern. That means by using java algorithm I want to create identical value for each sentences.Each sentence when entering to that algorithm can be out some kind of identical value. How can I develop it? What can I refer any web sites do you know? What sort of sites should I look for? Actually I want clarify about when I'm giving as ex: 5 sentence to algorithm that possible to generate some kind of 5 values.Then I compare with those values with previously generated values(I should be store those values in my database) and get the gap between new 5 value and previously stored values.Then I get the distance and I selected most suitable sentence as most lowest gap value.

I'm use those things for my machine translation tool. As an example we think using my ruled based translation model generate 2 sentences. 1. I want eat an apple. 2. I want eat a house. In my corpus we think more sentences include and I store values for sentences in my database. (Value assign part is I don’t know yet) I want to create java algorithm to assign value for each sentence. As an example if we think Sentence 1 value: 250.8 Sentence 2 value : 290.5

Database included values 248, 400,800 Then I got the difference. So we can see here most minimum difference get for 250.8 and most suitable sentence is 1 one.

user3149
  • 1
  • 2
  • 1
    I think this will give you most information you need: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – thorsten müller Mar 16 '14 at 06:24
  • http://meta.programmers.stackexchange.com/questions/6483/why-was-my-question-closed-or-down-voted/6487#6487 – gnat Mar 16 '14 at 11:51
  • Please, fix your english. It is not clear what you really want. – Euphoric Mar 16 '14 at 12:22
  • Yep I have not good at English as well.Sorry for that.Again I edited and give some example as well.Dr @Euphoric – user3149 Mar 16 '14 at 12:50
  • If I understand correctly this question is asking for a hash that preserves (or introduces) a metric. I would be very interested in a hash that preserves a metric, even imperfectly. That is, inputs that are nearby according to some metric give hashes which are nearby. In this particular question it seems that the measure of nearness between sentences is to be based on their meaning, rather than simply their character content. This makes it a far less straightforward question. – trichoplax is on Codidact now Mar 16 '14 at 13:04
  • Since English is not getting the meaning across in this question, would pseudocode be a good common language with which to explain what is required? – trichoplax is on Codidact now Mar 16 '14 at 13:08
  • Thank you @githubphagocyte Can you tell me is there any way to generate some numerical value for given sentence? – user3149 Mar 16 '14 at 13:24
  • 1
    What I mean is, can you include pseudocode in your question so that we can see what you want? For example, do you want: 1. Which sentence has most similar **meaning**? 2. Which sentence has most similar **characters**? – trichoplax is on Codidact now Mar 16 '14 at 13:41
  • I have added picture of my architecture.When sentence finalize part there may be 5,6,7 sentence may be generate. Therefore.Therefore I have to identify which is most correct one.Then by using my corpus I have to identify it.That's why I thought to assign value for above generating sentences and compare with database values those generates from corpus sentences and select most suitable one.Other wise I have to go each and every sentence in my corpus.Is this good method or any other useful way? – user3149 Mar 16 '14 at 14:08
  • What are you measuring about the words and the sentence? Hypothetically, how would these sentences scores on similarity: I ate an apple. I'm so hungry I could eat a horse. My horse ate my apple. I rode a horse to Apple. –  Mar 16 '14 at 14:57
  • @MichaelT dr. Actually when translating there may be different odd meaning sentence can be generated.That's why I want to select most suitable one in the database.That's why I want to create numerical value for each sentences.Then I can understand as a example when we used "eat" in example, the value difference between database one and my newly created one less difference.Then we can get most suitable one "I could eat apple".Not "I could eat horse". – user3149 Mar 16 '14 at 17:02
  • @user3149 note that "I could eat a horse" *is* idiomatic English ( http://en.wiktionary.org/wiki/I_could_eat_a_horse ) I believe there are more dimensions here to the sentence that you are trying to capture than can easily be depicted in a simple number that has a useful meaning. Mapping a sentence to a number isn't something simple like [soundex](http://en.wikipedia.org/wiki/Soundex). –  Mar 16 '14 at 17:07
  • I have to do something for my project.Any way without useful meaning any of you know any method to crate a value for sentence. Because I have to show something this week.Please help me. – user3149 Mar 16 '14 at 17:23
  • The thing is, I *can't* imagine how to store an arbitrary sentence in a number in a meaningful way that lets you say "this sentence has a value of 250, the closest one to it in the database is 248 so use that". Thats a one dimensional path and similarity between sentences isn't so limited. You need to do a *lot* more work describing the input, the acceptable language restrictions, the output, and the application (what *is* sentence 248 in the database? why is it similar to sentence 250?) –  Mar 16 '14 at 18:24
  • Sentence 248 means that sentence may be more similar to sentence value for 250. I assumed that because of two value(250,248) less difference that may be same kind of sentences. If sentences have same kind of things the algorithem should be able to generate less difference values. I assumed that algorithem may be full fill that condition. @MichaelT Any how can I able to generate numerical value for sentences. That's only enough for now. Thank you so much. – user3149 Mar 16 '14 at 18:38
  • Similar in ***what*** way? How do you intend to measure similarity? I can't see any reasonable way to do this with unrestricted English. At all. Nada. You haven't described any restrictions on the language. You haven't described two sentences that *are* similar and how they are. You haven't described two sentences that *aren't* similar. What is in sentence #248 that makes it more similar to "I want to eat an apple"? I believe you are trying to do the impossible and you haven't described anything that gives anyone enough information to help you either understand this or solve it. –  Mar 16 '14 at 19:12
  • 248 means value generate from algorithm for the sentence.I don't know how to generate numerical value for sentences.I want to solve actually that value generating part.Similarity means that may have similar words or similar meaning.As a example if database value 250 means that sentence have similar like "I want to eat a...."banana,mango...etc.I thought that's why have values are close.@MichaelT Thank you lot for your explanations.But I haven't any clear idea how I solve this. – user3149 Mar 17 '14 at 03:38

1 Answers1

3

If you want to create unique value for each and every sentence, then don't even try, because thanks to Pigeonhole principle, you are guaranteed to get collisions and thus the identifiers won't be unique. You could limit the input space accordingly, but in this case the algorithm looses it's purpose.

If you are looking for way to create identifier than might indicate that sentences are be equal, but not guarantee it, then hashing is practically created for this purpose. This the allows you to check if the sentences might be equal, but you still have to run the equality function to guarantee it.

Euphoric
  • 36,735
  • 6
  • 78
  • 110
  • I use pattern matching technique to my translation tool.In there I have created corpus.That means some big amount of sentences.Firstly I have created candidate sentences by translation English sentence to my language.Then I wanted to select most suitable sentence for that.Therefore I decided to store my corpus sentences according to some kind of java algorithm generated value.Then I'm easy to compare those value with my candidate sentence value and select most suitable one.I haven't clear idea about how to create value or identifiable thing for sentences. – user3149 Mar 16 '14 at 09:14
  • So I think you are looking for more than just a unique identifier for each sentence. You are looking for the distance from one value to another to indicate distance to the target sentence as well. Can you update the question to reflect this extra requirement? – Encaitar Mar 16 '14 at 09:21
  • I have edited.Please inform me is there any misunderstand things.Thank you so much for your help. – user3149 Mar 16 '14 at 09:29