Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 13 of 13

Thread: Scanning and displaying every word from a website source code Java

  1. #1
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Scanning and displaying every word from a website source code Java

    I'm a new Computer Science student and I have been given a task to scan the contents of a websites source code, and use delimiters to extract all hyperlinks from the site and display them. We havent been told anything about how to do this so after some looking around online this is what I have so far:

    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.Scanner;
     
    public class HyperlinkMain {
    	public static void main(String[] args) {
            try {
            	Scanner in = new Scanner (System.in);
            	String URL = in.next();
     
                URL website = new URL(URL);
                BufferedReader input = new BufferedReader(new InputStreamReader(website.openStream()));
                String inputLine; 
     
                while ((inputLine = input.readLine()) != null) {
                    // Process each line.
                    System.out.println(inputLine);
                }
                in.close(); 
     
            } catch (MalformedURLException me) {
                System.out.println(me); 
     
            } catch (IOException ioe) {
                System.out.println(ioe);
            }
        }
    }

    }

    So my program can extract each line from the source code of a website and display it, but realistically I want it to extract each WORD as such from the source code rather than every line. I've looked around online but I don't really know how it's done because I keep getting errors when I use input.read();

    Could anyone help me understand how to make it extract each word from the source code? Would be highly appreciated


  2. #2
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Can you post a small sample page and the "words" you want to extract from it?

    The project sounds like it should read the lines of html text from the site and then parse the lines to extract the desired words.
    If you don't understand my answer, don't ignore it, ask a question.

  3. #3
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Basically I will get given a web page and my program should be able to extract all the hyperlinks from that page
    They said to use delimiters to extract the links, however I assume I need to extract every 'word' from the source code, then check using delimiters whether the 'word' is a link

  4. #4
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    need to extract every 'word'
    What is a "word"?
    Can you post a small sample page and the "words" you want to extract from it?
    If you don't understand my answer, don't ignore it, ask a question.

  5. #5
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Say the website was something like this:
    <html>
    <body>
    <a href="www.google.com">GoogleLink</a>
    </body>
    </html>
    How would I use a delimiter to detect the start of the link (<a href=") and the end of the link (")?

    Ive changed my code around a bit:

    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.Scanner;
    import java.util.ArrayList;
     
     
    public class HyperlinkMain {
    	public static void main(String[] args) {
            try {
            	Scanner in = new Scanner (System.in);
            	String URL = in.next();
     
                URL website = new URL(URL);
                Scanner inWebsite = new Scanner (website.openStream());
     
                String inputLine; 
     
                while ((inWebsite.hasNext())) {
                	// Process each 'word'.
                    System.out.println(inWebsite.next());
     
                }
                in.close(); 
     
            } catch (MalformedURLException me) {
                System.out.println(me); 
     
            } catch (IOException ioe) {
                System.out.println(ioe);
            }
        }
    }

    So all I need to do now is work out how to use a delimiter to find the start and end of the link

  6. #6
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    If you are sure that there are no embedded spaces in the html, then the String class has methods to locate one String: "<a href=\"" within another String. The same method could be used to find the ending "\""

    How would I use a delimiter to detect the start
    Can you explain what that means? What is the delimiter you are talking about? Is this String: "<a href=\"" a delimiter?
    If you don't understand my answer, don't ignore it, ask a question.

  7. #7
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    I'm really new to programming so this isn't easy for me to explain or understand, but as the scanner goes through the source code, if it see's [<a href="] then it scans that until it finds [">] and outputs what should be between the '=' and the '>'

    But i've been told to do this by setting delimiters, but I cant find much help on that

  8. #8
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Are you talking about the delimiters used by the Scanner class? They have to do with the Pattern class and regular expressions. I'm not very good with regular expressions.
    If you don't understand my answer, don't ignore it, ask a question.

  9. #9
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Yes I believe that's along the right lines, using the scanner class: I understand that not many people use regular expressions which is why I'm finding it so hard to get help

  10. #10
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Would you be looking for something like this:
    Given a stream of characters skip over characters until a match on the link tag, then match what follows up to the ending " character. The link tag has "<a" or "<A", One or More Spaces, "href", OMS, "=", OMS and "
    If you don't understand my answer, don't ignore it, ask a question.

  11. #11
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    That sounds like it could work, say if I the Scanner used the .next() function, it would go check it and if it finds <a href=" it then finds when it closes at "> and prints the link thats inside

  12. #12
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,162
    Thanks
    65
    Thanked 2,725 Times in 2,675 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Now find someone that can write a regex to do that.
    If you don't understand my answer, don't ignore it, ask a question.

  13. #13
    Junior Member
    Join Date
    Feb 2014
    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Re: Scanning and displaying every word from a website source code Java

    Thanks for trying to help me out on this, I appreciate your time

Similar Threads

  1. Replies: 0
    Last Post: May 23rd, 2013, 04:35 PM
  2. Getting Value from Website source code.
    By Blackbird94 in forum Java Theory & Questions
    Replies: 2
    Last Post: August 26th, 2011, 07:16 AM
  3. How to Grab the HTML source code of a website URL index page?
    By JavaPF in forum Java Networking Tutorials
    Replies: 6
    Last Post: April 22nd, 2010, 02:46 PM
  4. How to Grab the HTML source code of a website URL index page?
    By JavaPF in forum Java Code Snippets and Tutorials
    Replies: 6
    Last Post: April 22nd, 2010, 02:46 PM
  5. Website Source
    By expertOpinion in forum Java Theory & Questions
    Replies: 10
    Last Post: July 21st, 2009, 11:06 AM