Does the Scanner class use an internal buffer?

**Fazan** · October 8th, 2012, 06:00 AM

This is actually a pretty simple question that I'm surprised I'm unable to find the answer to. Recently, I was advised to use a Scanner to read a large file in tokens, rather than using a BufferedReader and then splitting the Strings it puts out. While this worked wonders for my code, I've found the Scanner to work much more slowly and I suspect the problem is its lack of a buffer... Which it may or may not have. I've found conflicting evidence on the 'net.

So here's my question: Does Scanner use a buffer, if so how large and how can I mess with it? If not, is there any way to get the performance of a BufferedReader with the per-token reading of a Scanner? Because at this point, it seems faster to use a BufferedReader and do String.split than it is to use the Scanner for basically the same task.

For context, the code loading the file looks like this:

private void LoadComparisonList(File comparisonFile)
	{
		String readLine;
		String[] readArray = new String[2];
 
		if(!comparisonFile.exists() || !comparisonFile.isFile() || !comparisonFile.canRead()
				|| comparisonFile.length() == 0)
		{
			this.errorcode = 2;
			return;
		}
 
		try
		{
			Scanner scanner = new Scanner(comparisonFile);
			scanner.nextLine();
 
			do
			{
				readArray[0] = scanner.next();
				readArray[1] = scanner.nextLine();
				this.comparisonList.add(readArray);
			}
			while(scanner.hasNext());
 
			scanner.close();
		}
		catch (IOException exception)
		{
			this.errorcode = 2;
		}
	}
}

This reads and discards the file's header line, which is just column names, and then proceeds to read the first element of the column to save as a header and everything else (another five tokens) as a footer, saving everything into an ArrayList<String[]>. It works just fine, it just takes ~27 seconds for a 90MB, ~5 000 000 line file where virtually the same action done through a BufferedReader completes in around 4-5, if I remember correctly.

**curmudgeon** · October 8th, 2012, 07:33 AM

I don't know the answer, but if it isn't specified in the API or JLS, it may be undefined and JVM dependent.

Edit:
Google has given me more info:

Does a Java Scanner implicitly create a buffer even if you do not pass it one? - Stack Overflow

**KevinWorkman** · October 8th, 2012, 08:05 AM

Your jdk folder should contain src.zip. Check that out, find the Scanner class, and you can see for yourself exactly what it does. I'd be curious to see what you find.

**Fazan** · October 8th, 2012, 10:58 AM

I can't seem to find that on my system, I'm afraid. I checked the Java folder and the Eclipse folder but I can't find a source file. I'm sure it has to exist somewhere for my system to know what's in the individual classes, I just can't find it. Even Eclipse doesn't seem able to find it, since every time it uses one of the base classes when debugging, I just get a "Source code not available" screen. It's why I had to weed out those steps when debugging. I'm not sure I'd be able to understand what the classes do even if I did find the source, to be honest. I'm not that good at Java.

**KevinWorkman** · October 8th, 2012, 11:07 AM

It depends on how your system is setup, but for example my JDK folder is:

C:\Program Files\Java\jdk1.7.0_07

In that directory, I have a src.zip, and inside that I have a Scanner.java file (inside java/util within the zip).

But I will say that the short answer to your question, from reading the source, is that Scanner uses a CharBuffer internally.

**Fazan** · October 9th, 2012, 09:12 AM

Yeah, I checked my Java folder, but all it has is JRE folders for Java 6 and 7 (I'm still using 6, by the way, legacy stuff). No JDK. As far as Windows 7 can be trusted, I ran a search just in case I wasn't aware of where my Java folders were, but no src.zip turned up.

**KevinWorkman** · October 9th, 2012, 09:14 AM

Originally Posted by Fazan

Yeah, I checked my Java folder, but all it has is JRE folders for Java 6 and 7 (I'm still using 6, by the way, legacy stuff). No JDK. As far as Windows 7 can be trusted, I ran a search just in case I wasn't aware of where my Java folders were, but no src.zip turned up.

Strange. What jdk are you using?

Either way, you can download the source either as part of the full JDK or standalone from this page: Java SE Downloads

**Fazan** · October 9th, 2012, 10:40 AM

At the risk of revealing that I'm not entirely certain I know what a Java Development Kit is, I use Eclipse for code-writing purposes and that's about the extent of it. Beyond that, my only other Java-related software are the JRE packages, and even then I only have 6 and 7.

Either way, thank you for the link. I'll ook into it.

**KevinWorkman** · October 9th, 2012, 11:58 AM

Originally Posted by Fazan

At the risk of revealing that I'm not entirely certain I know what a Java Development Kit is, I use Eclipse for code-writing purposes and that's about the extent of it. Beyond that, my only other Java-related software are the JRE packages, and even then I only have 6 and 7.

Either way, thank you for the link. I'll ook into it.

Okay, gotcha. I assume that eclipse has its own JDK tucked away somewhere, but a JDK is simply the set of tools that compile your java code into bytecode (namely the javac tool). Compare that to the JRE, which is the set of tools that run compiled bytecode. Eclipse might hide it, but behind the scenes it has to be using a JDK, even if it's not installed where you'd expect. Most people install a JDK like I have above so that they can compile with the command prompt. The source comes with the "real" (non-eclipse) JDK, which is another reason it's a good thing to have.

**helloworld922** · October 9th, 2012, 12:20 PM

As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.

Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory

That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).

It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).

**KevinWorkman** · October 9th, 2012, 12:47 PM

Originally Posted by helloworld922

As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.

Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory

That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).

It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).

From the source, the Scanner(File) constructor wraps the File in a FileInputStream:

    public Scanner(File source) throws FileNotFoundException {
        this((ReadableByteChannel)(new FileInputStream(source).getChannel()));
    }

Which is converted into a Readable:

    public Scanner(ReadableByteChannel source) {
        this(makeReadable(Objects.requireNonNull(source, "source")),
             WHITESPACE_PATTERN);
    }

And finally passed into the main constructor:

    private Scanner(Readable source, Pattern pattern) {
        assert source != null : "source should not be null";
        assert pattern != null : "pattern should not be null";
        this.source = source;
        delimPattern = pattern;
        buf = CharBuffer.allocate(BUFFER_SIZE);
        buf.limit(0);
        matcher = delimPattern.matcher(buf);
        matcher.useTransparentBounds(true);
        matcher.useAnchoringBounds(false);
        useLocale(Locale.getDefault(Locale.Category.FORMAT));
    }

...which creates the CharBuffer. The CharBuffer is used in the base method for reading input:

    private void readInput() {
        if (buf.limit() == buf.capacity())
            makeSpace();
 
        // Prepare to receive data
        int p = buf.position();
        buf.position(buf.limit());
        buf.limit(buf.capacity());
 
        int n = 0;
        try {
            n = source.read(buf);
        } catch (IOException ioe) {
            lastException = ioe;
            n = -1;
        }
 
        if (n == -1) {
            sourceClosed = true;
            needInput = false;
        }
 
        if (n > 0)
            needInput = false;
 
        // Restore current position and limit for reading
        buf.limit(buf.position());
        buf.position(p);
    }

**helloworld922** · October 9th, 2012, 01:42 PM

I suppose I should have performed benchmarks before making assumptions

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Scanner;
 
public class StreamTest
{
	public static void gen_file(String file_name, int lines) throws FileNotFoundException
	{
		PrintStream out = new PrintStream(new BufferedOutputStream(new FileOutputStream(file_name)));
		for (int i = 0; i < lines; ++i)
		{
			out.print("header,");
			for (int j = 0; j < 5; ++j)
			{
				out.print(i * 0.33f + j + ",");
			}
			out.println();
		}
		out.close();
	}
 
	public static void main(String[] args) throws IOException
	{
		String file_name = "data.txt";
		gen_file(file_name, 100000);
		Scanner scan;
		BufferedReader read;
 
		int times = 10;
		long scan_times[] = new long[times];
		long buff_times[] = new long[times];
		System.out.println("scanner test");
		for (int i = 0; i < times; ++i)
		{
			scan = new Scanner(new BufferedReader(new FileReader(file_name)));
			scan.useDelimiter(",");
			long start_time = System.currentTimeMillis();
			while (scan.hasNextLine())
			{
				// skip header
				scan.next();
				// read 5 doubles
				scan.nextDouble();
				scan.nextDouble();
				scan.nextDouble();
				scan.nextDouble();
				scan.nextDouble();
				// finish line
				scan.nextLine();
			}
			long end_time = System.currentTimeMillis();
			scan_times[i] = end_time - start_time;
			System.out.println(scan_times[i]);
			scan.close();
		}
 
		System.out.println("buffered reader test");
		for (int i = 0; i < times; ++i)
		{
			read = new BufferedReader(new FileReader(file_name));
			long start_time = System.currentTimeMillis();
			String line;
			while ((line = read.readLine()) != null)
			{
				String[] split = line.split(",");
				// convert items to doubles
				Double.parseDouble(split[1]);
				Double.parseDouble(split[2]);
				Double.parseDouble(split[3]);
				Double.parseDouble(split[4]);
				Double.parseDouble(split[5]);
			}
			long end_time = System.currentTimeMillis();
			buff_times[i] = end_time - start_time;
			System.out.println(buff_times[i]);
			read.close();
		}
	}
}

This code does creates a reader stream, reads in a line which consists of a header and 5 doubles, all separated by commas. It also generates a file to run the test with.

Results:

scanner test

3162

2855

2787

2800

2799

2797

2760

2755

2779

2756

buffered reader test

209

128

112

108

110

108

108

108

107

107

So it is looks like it is indeed true that Scanner's are significantly slower than BufferedReaders, even after buffering.

**Fazan** · October 9th, 2012, 04:07 PM

Originally Posted by KevinWorkman

Okay, gotcha. I assume that eclipse has its own JDK tucked away somewhere, but a JDK is simply the set of tools that compile your java code into bytecode (namely the javac tool). Compare that to the JRE, which is the set of tools that run compiled bytecode. Eclipse might hide it, but behind the scenes it has to be using a JDK, even if it's not installed where you'd expect. Most people install a JDK like I have above so that they can compile with the command prompt. The source comes with the "real" (non-eclipse) JDK, which is another reason it's a good thing to have.

Thank you. I suspected, but it's been a while since I've used precise terminology (I'm not actually a native English speaker) so I'm not sure if I was ever clear on this. I'll see about getting the actual JRE, though I still prefer Eclipse. The ease of use of the editor's functions is too good to pass up. At this point I'm hooked on code-complete as a means of avoiding spelling errors (sloppy typist here).

Originally Posted by helloworld922

As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.

Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory

That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).

Wait, I can pass a BufferedReader to a Scanner? But I thought its constructor expected a Stream child, not a Reader child? Yes, if I can pass a BufferedReader, then I'd expect the Scanner's behaviour to build on that of the underlying reader object, so I'd expect it to use said reader's buffer. I didn't know I could do that, though.

Originally Posted by helloworld922

It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).

I'm not just reading the data, I need to split it into tokens. Each row comes in six parts - one header and five number columns. Originally, I was using a BufferedReader and then post-splitting and converting the coming Strings, which really isn't a good idea. After having my post moderated in another thread, it gave me the idea of using a StringTokenizer, instead. I haven't read up enough on it to know for certain, but this seems like an easier way to get tokens out of a file reader than a Scanner, which does indeed seem to want to parse my data. I suspect that might be faster, though I don't know if the StringTokenizer itself won't reintroduce some slowdown.

In any event, the previous memory problem and the Scanner's slowdown forced me to admit defeat and move the software to our server so it can use more operating memory. It's either that or take ages to accomplish anything.

Thread: Does the Scanner class use an internal buffer?

LinkBack

Thread Tools

Display

Does the Scanner class use an internal buffer?

Related threads:

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

The Following 2 Users Say Thank You to KevinWorkman For This Useful Post:

Re: Does the Scanner class use an internal buffer?

Re: Does the Scanner class use an internal buffer?

Similar Threads

use Delimiter, Scanner Class

I am Facing problem in Scanner Class

Scanner class error "java.lang.Error"

[SOLVED] Problem in Coin-counter with scanner class

Is there any performance issues if we use Scanner class?