Skip to end of metadata
Go to start of metadata

Setup

  • You have to define a user in GX Webmanager (in the authorization panel) and add it to a group with sufficient rights so that every page can be reached.
  • The username and password of this user have to be added to the credentials.xml of the WebManager Search Engine.
  • Configure the schedule in the crontab.txt (cron format)
  • George has to be started.
  • Set development_mode in the /web/setup tool to false (uncheck the checkbox)
  • Now the site can be indexed.

In the example below, replace the 2 values 'georgeuser' and 'FillHereThePasswordForGeorgeUser' by the username and password you added.

If there are more webinitiatives, the credential-pattern must be something like <credential pattern=".*-redactie.devel.gx.nl.*" ...

Search engine fails after upgrade or rebuild

Make sure the right credentials.xml and properties.txt are available in the \webmanager-searchengine\src and \webmanager-searchengine\target directories and not only the target directory. Otherwise this copy will be overwritten during and upgrade or rebuild.
Also make sure that since WebManager 9.3 the sourceformurl parameter must be supplied and point to the URL that hosts the editor login form. The sourceformurl is required, because WebManager 9.3 requires a correct form signature, which can be obtained from the login page.

 Further, it's crucial that the searchengine version matches the WebManager version. The searchengine jar files and binaries are not installed automatically, even if they are available in the deploy. The update of the searchengine has to be requested from the system administrator.




Troubleshooting

Basedir error

When you get an error such as 'Unable to set error log to 'C:SVNWM93WM9.3.1/webmanager-searchengine/target/classes/logs/error.log' your webmanager.basedir in your settings.xml file is incorrect. You probably used slashes instead of backslashes. So instead C:\SVN\WM93\WM9.3.1 you should use C:/SVN/WM93/WM9.3.1

properties.txt

Possible problems in properties.txt:

  • metaUrl should not use slashes, but use a "?"-sign. If this is not a "?"-sign the meta information (e.g. date) in the indexer may be incorrect. A good example would be:metaUrl=http://localhost:8080/web/webmanager?id=39016
  • If the indexer page (id=39016) is too large a timeout occurs (SocketTimeoutException). The timeout can be increased by modifying the downloadtimeout setting.

Notes

Below are some notes concerning the search engine.

Indexing of media repository articles

Articles in the media repository are added to the standard INDEXER output using a stored procedure. This stored procedure (wjGetContentForIndexerWithDisplayOn) by default only retrieves articles of the last 5 days. This means that reindexing a complete site requires changing that stored procedure to return all articles, or all older articles will disappear from the search results. If the crontab.txt starts with Fullindex, the index is emptied before indexing (in one transaction), so on websites with a mediarepository, only articles of the last 5 days can be found with the search-function. Sites with a mediarepository should have a crontab.txt starting with 'index'.

Do not forget to change back the stored procedure afterwards!

Page removal

Pages that are removed in GX WebManager, are not automatically removed from the search index. To achieve removal, the page should be offered empty (e.g. non-published) to the search engine. This will cause all references to the page to be removed from the search index. 

Restart

In which cases should the GX WebManager search engine be restarted?

- Changes to the properties.txt require a restart

- The files parser.txt, meta.txt, credentials.txt, crontab.txt are reread every minute 

 

FAQ

Q: Is it possible to filter the search results based on media item terms?
A: Yes, by using metadata and prefix querystring addition. More info here: GX WebManager search engine filtering on metadata and terms

Q: Is it possible to find search results of other sites?
A: Yes. You can add the URLs of other sites to the "crontab.txt" to make the search engine index the pages of that site (make sure that you are allowed to index it by its owner). To be able to view search results from that site, create entries in the "meta.txt" as follows:

 Every line in the "meta.txt" is in the form of "<URL pattern><tab><meta name><tab><value>". Upon matching search results, GX WebManager will filter for results that belong to the website. This is done by adding "(webid:26098^0) AND " to the search query. The instruction in the "meta.txt" tells the indexer to add that meta info to each indexed URL, thereby making a match possible.

Q: Is it possible to influence the order of search results?

A: Yes. Words are matched in fields, adding up to a certain score.  The search engine allows to set weights for fields in order to determine the importance of a field. The default value of a factor is 1. If you want to change the importance of fields, edit the file "properties.txt" and add lines like:

Q: Is it  possible to exclude certain URLs from being indexed?

A: Yes. If you are trying to exclude a page, simply untick the checkbox "Include in Search Engine" on the top of the page in edit mode.

However, in some rare cases this will not work, for example when your page outputs CSV data instead of HTML. In such case, you will have to configure the WebManager Search Engine to ignore the content of the offending URL.

The configuration file "parser.txt" determines which parser will be used for what URL-pattern. By sending the content for any URL through a parser, the WebManager Search Engine will get readable content that can be indexed. The special notation "-" in this case means "no parser", which in turn means "discard the content, do not index".

For example, let us assume we want to exclude the following URL:http://www.gxdeveloperweb.com/Excluded/Do-Not-Index.htm

We now tell the WebManager Search Engine for this URL, for all content types (".*"), use no parser ("-"). This can be done by adding the URL on the top of the "parser.txt" configuration file, like below:

Q: Why is my page not shown in the searchresults on the site even though it shows in the search tools search results

A: In this case the reason lies in the meta tags, to check which meta tags your page is indexed with go to the indexer object (for example http://localhost:8080/web/webmanager?id=39016) and pass your page to this page as parameter document:

     http://localhost:8080/web/webmanager?id=39016&document=http%3A%2F%2Flocalhost%3A8080%2Fweb%2Fshow%2Fid%3D26111

     make sure you pass the page url as document parameter url-encode, you can url encode here: http://www.albionresearch.com/misc/urlencode.php

     An easy way to check which meta tags will be searched for, is to use xmldebug or check the searchengine out.log where is each search request is logged