Homework 5:
Adding Spell Checking and AutoComplete to Your Search Engine
Copyright By PowCoder代写 加微信 powcoder
Objectives
o Experience using a third-party spell program
o Developing efficient methods for accomplishing autocomplete
In the previous document (AutocompleteInSolr.pdf) you saw how to enhance the Solr program
with spelling correction and an autocomplete (suggest) function. In this exercise you are asked
to use an external spelling correction program in conjunction with Solr and to enhance the
autocomplete functionality of Solr. For spelling correction, you may use an existing third-party
program adapted to your downloaded files. In the case of autocomplete you will need to enhance
your client program that communicates with Solr to deliver autocomplete suggestions to the web
interface you created in an earlier homework
Description of the Exercise
Spelling Correction: in the class lecture you saw a complete spelling correction program
developed by . The program was written in Python. For this exercise you are
welcome to use whatever third-party spelling program you wish, or you may even write your
own. Since most of you wrote your previous homework client using PHP, you may want to adopt
a version of Norvig’s spelling program written in PHP and run it on your server. You can download
the PHP version of Norvig’s spelling corrector from here:
http://www.phpclasses.org/package/4859-PHP-Suggest-corrected-spelling-text-in-pure-PHP.html#download
(you will have to register at the site before being able to download the software, registration is free)
If you prefer to use Norvig’s program in a different language, a wide variety of implementations
can be found at the bottom of this page, http://norvig.com/spell-correct.html
You should make sure to enhance your spelling correction program with a set of terms that are
specific to the news website that you are responsible for. You should make sure that common
terms such as climate, election, etc., and the terms used in the queries of homework #4 are
handled. Norvig’s spell correction program uses a text file(‘’big.txt”) to get a set of words to
calculate edit distance. For this you should create your own “big.txt” for your specified news
website. You can use any parser (our suggestion – Apache Tika) and Instructions on using
apache Tika for this purpose can be found here, (https://tika.apache.org/1.5/gettingstarted.html
Autocomplete: for the autocomplete portion of the exercise, you will have to modify your
client program, so it accepts single character insertions to the text box and returns a list of
completions/suggestions.
http://www.phpclasses.org/package/4859-PHP-Suggest-corrected-spelling-text-in-pure-PHP.html#download
http://norvig.com/spell-correct.html
https://tika.apache.org/1.5/gettingstarted.html
There are several ways to implement the autocomplete functionality while using Solr. One
possible way is to use the FuzzyLookupFactory
(https://solr.apache.org/guide/7_7/suggester.html) feature of Solr/Lucene. The
FuzzyLookupFactory creates suggestions for misspelled words in fields. It assumes that what
you’re sending as the suggest.query parameter is the beginning of the suggestion. It will match
terms in your index starting with the provided characters. So, if the query is “ca” it will return all
the words starting with “ca”, e.g. “california” and “carolina” etc. For the first character and
second character that is entered, some autocomplete suggestions should appear.
For this to work you need to enable the suggest component as described in the tutorial but add
some options.
Note: with respect to specific issues about how spelling corrections are displayed or how
autocomplete corrections are displayed you should imitate the way Google handles both. For
example, while typing in the search box, the top suggestions should automatically appear and be
updated as the user keeps typing. The spellcheck suggestion should appear at the top of the
retrieved results. If the word typed is correct no suggestion should appear at the top.
Submission Instructions
You need to place the YouTube URL in your CSCI572/HW5 folder. Uploading a .txt file- with the
link to the YouTube URL of your HW5 video- to your CSCI572/HW5 Google drive folder is
acceptable. Please refer GuidelinesVideoRecordingHW5 for more information on how to create
the Youtube video.
https://solr.apache.org/guide/7_7/suggester.html
Suggested config change for making ‘AND’ as default instead of ‘OR’ for multi-word queries in
Solr default boolean model uses OR instead of AND. So, if your query is “Elon Musk”, then the
result will match all pages which either have ” Elon ” OR ” Musk ” present and not the entire query
” Elon Musk”.To solve this problem, please do as following to set up the standard Query Parser
Parameters:
In solrconfig.xml add this line:
within this tag:
and Inside
Remember to reload after editing.
Q1. Can we use default spell checker for HW5?
A. You are not supposed to use default spell-checker for Hw5.
Q2. How to handle multi word queries?
A. To handle queries with two words, please handle as follows.
Eg: You have to handle word as if you are typing the same for suggest on Solr UI.
For example:
If you type ‘new’, you may get ‘news’.
If you type ‘new ‘, you should not get ‘news’
One of the ways you could replicate this behavior:
User types ‘n’ => query for ‘n’ and display suggestions
User types ‘ne’ => query for ‘ne’ and display suggestions
User types ‘new’ => query for ‘new’ and display suggestions
User types ‘new ‘ => query for ‘new ‘ and display suggestions (suggestions should be same as
the ones from previous step)
User types ‘new y’ => query for ‘y’, append each suggestion to ‘new’, and display resulting
suggestions
User types ‘new yo’ => query for ‘yo’, append each suggestion to ‘new’, and display resulting
suggestions.
Another Explanation Way:
If you type “new “, you make get “new”, “new book”, “new york” or “new years”.
If you type “new”, you may get “news”, “newspaper”.
After you type the first word “new”, your just keep this word as the first word in your list and
you just find the query of second word and append the result after the new.
If you type “new new”, you may get “new news”, “new newspaper” (If you type “new”, you
get “news”, “newspaper”)
When the space after new is typed, the autocomplete behavior should take this under
consideration as well. Hence you have to handle “new “ and not just “new”.
Q3. Can we use solr’s inbuilt auto-complete features?
Q4. How should the spell correction and auto complete working look like.
A. Imitate googles auto complete and spell correction, your result should look like that
Q5. when using the php corrector and when loading big.txt, error log says allowed memory size
exhausted.
A. Add the following code, php ini_set ('memory_limit', '1024M')?> at the start of your php
file. This should solve it. If it still doesn’t, change the code to php ini_set ('memory_limit', -
Q6. Do we need to store user’s history for suggestions?
A. No need to store any user history to get this functionality.
Q7. Do we need to extract the data from all the 17000+ files into big.txt and do we have to avoid
duplicates?
A. Yes, you need to extract the content from all your files. Please Don’t avoid duplicates, please
read how Norvig’s spell correction works, you will find why you need to have duplicates.
Q8. In hw4 it tells us to set the ‘text’ to have a type of "text general". However, in hw5 it says
"text_en_splitting".
A.You can leave as it is. It works just fine.
Q9. Solr exception: Java.lang.String cannot be cast to Java.lang.String
A. please check whether you added suggest component as stated in document(at right place and
with right tags).
Q10. How does big.txt look like?
A. You need to parse content of html files into big.txt. Please refer to http://norvig.com/big.txt.
Q11. Do we need to extract the data from all the 17000+ files into big.txt and do we have to avoid
duplicates?
http://norvig.com/big.txt
A. Yes, you need to extract the content from all your files. Please Don’t avoid duplicates, please
read how Norvig’s spell correction works, If you de-duplicate, then you will defeat the purpose
of using word frequency to estimate P(c), where c is the corrected spelling.
Q12. The document "SpellcheckandAutocompletioninSolr.pdf" is only for reference? For both
spellcheck and autocompletion we don't use the solr internal functions? We both use external
A. #1 for spell check: You use external program
#2 for Auto completion: You use the one inbuilt in solr.
Q13. If we search " ", when we put "Donad", the spell check should show "Donald",
what if we put " ", what spell check should show? Should we combine each spell check
result, like " ", or just show "Donad Trump", only correct the correct word?
A. Please check for each word separately when you have multiple words in a query. Our queries
will only be one or two words at most. If the query is , then your result should be
Q14. Can we use the big.txt provided on the Norvig’s website and add query terms from hw4 into
it, or do we have to generate ourselves with Tika?
A. Please generate it. It doesn't make sense to load all the words from " The adventures of
sherlock Holmes" in to memory. You might not get correct results too.
Q15. Can we remove the radio button and functionality of page rank here for hw5?
A. No. You can leave it as it was for Hw4.
Q16. I did not do the pagination function on my page. Can we just do top ten results?
A. Pagination is NOT a requirement for this exercise. Top 10 should suffice.
Q17. How should the UI look for auto-suggest feature?
A. Try to imitate the way google works. There should be a dropdown of suggestions when a
character is entered. Also, the suggestions should be clickable. Once a suggestion is clicked, it
should replace the text in the text box.
Q18. Mimicking Google Spell Correction
A. Google handles spelling correction in 2 ways. You can follow any one from the below 2
approaches
1. Show result for misspelled word. Just below the text box, you can display the correct
spelling which is clickable. Upon clicking the correct word, it should perform a search and
display the new results.
2. Show results for the spell corrected term. Just below the text box, display the spell
corrected term and the initial misspelled term. Make the misspelled term clickable. Upon
clicking, it should perform a search and display the new results.
Q19. Duplicate suggestions for autocomplete
A. It is okay to get duplicates. You don't need to handle them. You should always show whatever
solr is suggesting for that input character/word.
Q20. Not able to record with any of the suggested screen recording softwares.
A. You can use any other screen recording software which is not listed by us, as long as the
recorded content is clear. If you are not able to, then use your phone, just make sure it is clear.
If you are using MacOS, you can follow this link https://support.apple.com/en-us/HT208721
Q21. I am getting underscore or dot in autocomplete suggestion. Is that ok?
A. Yes, it is acceptable. You should always show whatever solr is suggesting for that input
character/word.
Q22. Undefined index error while trying SpellCorrector.php.
A. Change file permissions:
sudo chmod 777 /var/www/html/big.txt
sudo chmod 777 /var/www/html/SpellCorrector.php
B. Try the function isset() to avoid this error
Q23. What fields should be shown in the search results? Are snippets to be included or not?
A. Snippets are not required for HW5.
Autocomplete and Spelling correction need to be implemented for HW5.
Q24. What should I do if I get 0 results?
A. If you get 0 results then there should be some indication on the screen. Eg. You can display
the text “No results found” if you get 0 results.
Q25. When I try to type "disclaimer" one character at a time, my name comes up as one of the
suggestions after I type in the first character (since it starts with a 'D'). Also, as I type in more
characters, I get suggestions like dc:title and display:block. Is this okay or is there something
wrong with my configuration?
A. It is acceptable
Q26. Permission is being denied while creating a file with option 'w', getting following warnings.
https://support.apple.com/en-us/HT208721
A. "sudo chmod 777
Or this issue may be solved by putting the file in the same folder as your php script.
Q27. How to create our own big.txt?
A. What you are supposed to do initially is to parse all the HTML files that were provided to you
in the last assignment, parse the contents of these HTML files and store all the words found
in a file called big.txt. This file will be an input to your spell corrector program. As per the
document, you can use Apache Tika for parsing purposes.
Q28. Solr will automatically concat the multi-words in the result, like “newyork”,”newyorktimes”.
Is that acceptable?
A. No, in the video you give us, the result words should be seperated.
Q29. Unable to call suggest query with solr php client, I’m having trouble calling suggest instead
of the default “select.” The Solr call I try to make the default “select” call instead of “suggest.”
A. Try following this website -> https://skipperkongen.dk/2011/01/11/solr-with-jsonp-with-
Q30. Error 404 while searching
A. FundA 404 response from your server, most likely you need to adjust the path to match
your solr installation. Try to check your path.F
Q31. Request was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options:
A. Try changing `text/plain` to `application/javascript` as it’s the correct response type for
jsonp. -> https://stackoverflow.com/a/39228881
https://stackoverflow.com/a/39228881
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com