Hi.
I am currently using a search solution called LucidWorks Enterprise which has an integrated web crawler and uses Java Regular Expressions.
Description
With that in mind, I want to crawl a large web site but only index certain url's that adhere to a certain pattern.To limit crawls to the URL specified in the data source definition, you could enter the following in the Included URLs field:
http://www\.lucidimagination\.com/.*
To limit crawls to a path that is relative to the URL specified, you could enter the following in the Included URLs field:
.*/Relative-path/
To limit a crawl of Wikipedia to topics only (not other pages such as history or info), you could enter the following in the Included URLs field
http://en\.wikipedia\.org/wiki/[^/?]+
URL Examples
http://www.example.com/insanely-twisted-shadow-planet/61-27375/reviews/ http://www.example.com/catherine/61-32367/reviews/
Also, these url's are in the category:
http://www.example.com/reviews/
From this category reviews are further broken down into additional pages:
http://www.example.com/reviews/?page=2 http://www.example.com/reviews/?page=3
So I set the crawler to index:
http://www.example.com/
With the included path:
.*/reviews/
I get about 50 reviews indexed so it appears the crawler is only indexing pages found on...
http://www.example.com/reviews/
...with /reviews/ in the url. If go ahead and add the additional path:
.*/reviews/*
...I get all reviews indexed. So far so good. The problem is I also get pages I do not want to be indexed, indexed...
http://www.example.com/reviews/?page=2 http://www.example.com/reviews/?page=3 http://www.example.com/reviews/?page=4
I do want the crawler to visit:
http://www.example.com/reviews/?page=2 http://www.example.com/reviews/?page=3 http://www.example.com/reviews/?page=4
...and just index reviews found on each of those pages, but not index the actual index pages leading to the reviews:
http://www.example.com/reviews/?page=2]
...because it makes search results messy and inflates database size.
I realise this isn't a Lucidworks web site, my question is related to Java Regular Expressions; is there an expression that I can use that will tell the crawler to visit all pages in:
/reviews/
...but only index the url if it has /review/ at the end of the url, while not indexing any page that also has ?page=x in the url?
If anyone has a better idea how I can achieve what I want to achieve, please let me know I would really appreciate it.
Thanks for reading this long rambling post.
Sorry of this is the wrong forum for this type of question.