Hi all,
My Question in Detail :-
I am working on web-crawler project, for that we are executing the webpages using the HttpClient4.0 beta2.jar API.
Basically it's a multithreaded application, and we are following the client server approach in our project.
we are fetching the url's form the database, and exceuting the urls using the java HttpClient.execute( ) method.
Code snippet as follows:
HttpGet get1; DefaultHttpClient client = new DefaultHttpClient(); String body = null; URL url=new URL(html.GetURL().trim()); html.Sethost(url.getHost()); HttpResponse preresponse=null; ResponseHandler<String> handler = new BasicResponseHandler(); preresponse = client.execute(get1); StringBuffer source=new StringBuffer(); HttpEntity entity=preresponse.getEntity(); if(entity!=null) { InputStream is=entity.getContent(); BufferedReader br=new BufferedReader(new InputStreamReader(is)); String line=br.readLine(); while(line!=null) { source.append(line+"\n"); line=br.readLine(); } br.close(); is.close(); }
when we start our application all the threads perfectly crawling the web-url, but after some time suppose 4 to 5 hours or more , some of thread gets blocked , after executing the statement
"StringBuffer source=new StringBuffer();"
and they didn't crawl futher any of the url's. We have also generated the log file for the Thread status after every 5 minutes, the thread which gets blocked, has still in the Runnable State, also in addition we have also caught all the exception of java.io, java.util., org.apache.http.client. packages. and write all the exception in the log file with the time and the url, but there would be no log generated in our log file.
Please help me
Thanks & Regards
Rakesh Yadav