Web Crawling & Java Regular Expressions

**Nytol** · November 18th, 2011, 09:40 AM

Hi.

I am currently using a search solution called LucidWorks Enterprise which has an integrated web crawler and uses Java Regular Expressions.

Description

To limit crawls to the URL specified in the data source definition, you could enter the following in the Included URLs field:

http://www\.lucidimagination\.com/.*

To limit crawls to a path that is relative to the URL specified, you could enter the following in the Included URLs field:

.*/Relative-path/

To limit a crawl of Wikipedia to topics only (not other pages such as history or info), you could enter the following in the Included URLs field

http://en\.wikipedia\.org/wiki/[^/?]+

With that in mind, I want to crawl a large web site but only index certain url's that adhere to a certain pattern.

URL Examples

http://www.example.com/insanely-twisted-shadow-planet/61-27375/reviews/
http://www.example.com/catherine/61-32367/reviews/

Also, these url's are in the category:

http://www.example.com/reviews/

From this category reviews are further broken down into additional pages:

http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3

So I set the crawler to index:

http://www.example.com/

With the included path:

.*/reviews/

I get about 50 reviews indexed so it appears the crawler is only indexing pages found on...

http://www.example.com/reviews/

...with /reviews/ in the url. If go ahead and add the additional path:

.*/reviews/*

...I get all reviews indexed. So far so good. The problem is I also get pages I do not want to be indexed, indexed...

http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3
http://www.example.com/reviews/?page=4

I do want the crawler to visit:

http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3
http://www.example.com/reviews/?page=4

...and just index reviews found on each of those pages, but not index the actual index pages leading to the reviews:

http://www.example.com/reviews/?page=2]

...because it makes search results messy and inflates database size.

I realise this isn't a Lucidworks web site, my question is related to Java Regular Expressions; is there an expression that I can use that will tell the crawler to visit all pages in:

/reviews/

...but only index the url if it has /review/ at the end of the url, while not indexing any page that also has ?page=x in the url?

If anyone has a better idea how I can achieve what I want to achieve, please let me know I would really appreciate it.

Thanks for reading this long rambling post.

Sorry of this is the wrong forum for this type of question.

**Tjstretch** · November 18th, 2011, 02:34 PM

Nvm read the question wrong, will edit this again in a moment

Sigh I can't get my expression to work, I am sad now

Thread: Web Crawling & Java Regular Expressions

LinkBack

Thread Tools

Display

Web Crawling & Java Regular Expressions

Related threads:

Re: Web Crawling & Java Regular Expressions

Similar Threads

Java program to validate an email address using Regular Expressions

regular expressions

[SOLVED] Java Regular Expressions (regex) Greif

Text Processing with Regular Expressions explained in Java

Java program to validate an email address using Regular Expressions

Tags for this Thread