Indexing HTML files in Solr12
This tutorial explains how to index html files in Solr using the built-in post tool, which leverages Apache Tika and auto
extracts content from html files. You should have already downloaded and installed Solr, see
https://solr.apache.org/guide/7_7/installing-solr.html
Copyright By PowCoder代写 加微信 powcoder
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses various existing document parsers and document type detection techniques to detect and extract data. For example, in case of HTML pages, Tika uses its HTMLParser to strip out all html tags and only stores the content from the html pages. Tika is a powerful tool and is very useful when you have crawled various types of documents, e.g. PDF, Images, Videos. Tika is included with the Solr installation.
Using post tool and TIKA
1. Start the Solr server, cd into solr-7.x.x folder and enter the command: bin/solr start
2. Create a new core, in this tutorial the core name is myexample using the command: bin/solr create –c myexample After the core is created you should be able to see the message – “Created new core ‘myexample’”. Verify that your new core is created in solr-7.x.x/server/solr/ with the name myexample.
3. Let us examine the directory structure of this folder. cd into the folder myexample in server/solr. There will be a conf folder, data folder and a core.properties file. The conf folder contains the managed-schema file and the solrconfig.xml file, these are the files that have to be modified for specific field level analysis during indexing and querying. For example you can specify that a field from your document has to be tokenized using white space delimiter, whereas another field has to be tokenized into n-grams, depending on how you want to index and later query these fields. You can specify stop words in stopwords.txt file which would enable Solr to eliminate these words during indexing. The data folder contains the indexed data and logs generated by Solr. The core.properties file contains metadata information about the core, e.g, the core name, the directory name where the core data is stored.
4. Solr inherently uses Tika for extracting content from the documents that will be indexed. Tika uses the TagSoup library to support virtually any kind of HTML found on the web. The output from the HtmlParser class is used as the streamed content for indexing into Solr. Before indexing html files, we have to edit the schema file to make sure that all the text content from the html pages extracted by Tika are mapped correctly. To do this, go to the conf folder of the core “myexample”. You will see a file named – “managed-schema”. Now open this xml to edit it. Below is a portion of the managed-schema file.
This tutorial was authored by (with a little help from Prof. Horowitz)
2 ATTENTION: It has come to our attention that Solr, version 8.11 has instituted several security patches which unfortunately prevent HW#4 from being fully completed. As a result we are recommending that all students instead download version 7.7.3 which can be found at
https://solr.apache.org/downloads.html, Or directly downloaded from
https://www.apache.org/dyn/closer.lua/lucene/solr/7.7.3/solr-7.7.3.zip?action=download
Uncomment the following line – , so it looks like below –
Fields are defined in the field element of managed-schema as shown in the above figure. These are field definitions for the documents that you will be indexing, and each document that gets indexed will have these fields. You can define various properties for the fields, such as “indexed” if set to true, then the content in this field will be indexed, “stored” if set then the content of this field is stored in Solr and we can retrieve this from the query responses, “required” indicates whether this field is mandatory in a document for indexing, multivalued is true meaning it can appear multiple times in a document. In the current managed-schema version, an “id” field, “_version_” and “_root_” fields are auto generated while indexing. These may not necessarily be part of your original documents.
Solr has a mechanism for making copies of fields so that you can apply several distinct field types to a single piece of incoming information. You can see that the copyField element copies all the fields into the destination field “_text_”. The “_text_” field is also only indexed and not stored because there is no need to have redundant data. Basically what we are trying to accomplish here, is to combine all the information from different fields into a single field, “_text_”.
5. Save your changes and go back to the home directory of Solr. Let us try to index some html pages. In this example, I have crawled the webpages from losangeles.eventful.com and stored the crawled pages in a folder named “crawl_data”. You should be using the crawl data folder for the specific news website you are responsible for (based on your USC ID as given in Solr Exercise document). Indexing is performed by using the command:
bin/post –c
The filetypes option specified above implies we are only indexing html files. If we are indexing various document types together, we can ignore the filetypes option, then the command will be simply:
bin/post –c
The crawl folder in this example is “crawl_data”, the core to which I am indexing is “myexample” and I am going to index only html files, therefore the command will be:
bin/post –c myexample –filetypes html crawl_data/
You should see output similar to the one below if the command was successful:
NOTE – If you are getting an error similar to – “SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/myexample/update/extract?resource.name=%2Fhome%2Fhw4%2Fnytimes%2F39f07542 -ee59-4254-ade2-0d393aa7e360.html&literal.id=%2Fhome%2Fhw4%2Fnytimes%2F39f07542-ee59-4254-ade2-
0d393aa7e360.html” for almost all the HTML files (if you are getting the error only for a few urls, it’s fine), then you can either downgrade to Solr version 7.7.2, or follow the below steps.
➢ Add the below code to solrconfig.xml –
After making the above changes to the file, save the file. Restart solr with “bin/solr restart” command and rerun the previous command bin/post –c myexample –filetypes html crawl_data/
After all the files are indexed, solr will auto commit the index and you should see the following output.
6. So now we can check the Solr UI and see if the files have gotten indexed. Open the browser and go to http://localhost:8983/solr/. Select the core “myexample” from the dropdown. You should see an output showing the statistics similar to the figure below.
7. Let us see how the TIKA parser parsed the html pages and what fields were created by using TIKA. Select the query option, and just submit the default query (*:*). Below is a snapshot of a portion of one of the web pages from losangeles.eventful.com followed by the results produced by the TIKA parser. Some notable fields extracted here are title, latitude, longitude, description etc. We can see the HTML page source from where the fields were extracted below.
The following figure is snipped from the response of the default query in the SolrUI. Notice that the response uses the JSON format. Also notice that all meta properties in the html are preserved. Notice that some key : value pairs are autogenerated by Tika, e.g. content_encoding, content_type, x_parsed_by, stream_size, and stream_content_type.
TIKA extracted some fields automatically from these pages.
8. We can also query using the hidden field “_text_”. In the following example I am going to search in the _text_ field for “French”.
The query returned the following results. Looking at the JSON we see 55 documents were matched and returned.
9. We can configure Solr to default querying from the field _text_ by defining the default field in the requestHandler in solrconfig.xml file.
Uncomment the str element with name “df”, and replace it in the following manner, where we specify that the default query field has to be “_text_”.
You can reload your core and query with the new configuration. Go to the Solr Dashboard UI (http://localhost:8983/solr/) -> Core Admin and click on the “Reload” button.
Note: in practice we would want to build a more appropriate user interface, one in which the query results are properly displayed. Our new user interface would make AJAX calls on the Solr index rather than use the SolrUI.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com