How to use a Set
Put simply, a Set is a Collection that contains no duplicate elements. Sets are a fantastic utility to have in the arsenal, allowing one to find, sort, and iterate over unique values quickly and easily.
A little background: a Collection is an interface which is used to groups object. The interfaces Set and List both extend from Collection, and thus the most commonly used Collection implementation - ArrayList (or its interface List) - is somewhat related to a Set, with mutual functions inherited from the Collection interface. That being said, a Set and List differ in some key ways.
On to a demonstration:
public static void main(String[] args){ String[] strings = {"1","2", "3", "4" , "1", "2"}; //create the Collection objects java.util.List<String> list = new ArrayList<String>(); Set<String> set = new HashSet<String>(); //Add the same values to each object for ( String s : strings ){ list.add(s); set.add(s); } //print out the contents of each collection System.out.println("List Contents:"); printCollection(list); System.out.println("Set Contents:"); printCollection(set); } /** *Prints the values of the parameter collection to the command line. */ public static void printCollection(Collection<String> c){ Iterator<String> it = c.iterator(); while ( it.hasNext() ){ System.out.println(it.next()); } }
Results:
This code adds the same values to both a List and a Set, and outputs the values contained within those Collections. Two important things to note:List Contents: 1 2 3 4 1 2 Set Contents: 3 2 1 4
1) as defined by the implementation the Set has no duplicate values, as opposed to the List which contains all values placed within the Collection.
2) the order of the Set output does not equal the order of its input. Based inherently upon how it is implemented, a Set (at least a HashSet - see below) has no guarantee of the order when the values are retrieved (see below).
Guarantee Set Order:
One advantage to using a List lies in the fact that order is maintained. The basic implementation of a Set - the HashSet - cannot guarantee any order. This limitation is fixed with a LinkedHashSet, an implementation of the Set interface backed by a LinkedList. This object guarantees no duplicates, whose iterator iterates over its objects in the order of entry. I will leave it up to the reader to test the demonstration of a LinkedHashSet by substituting the HashSet in the example above with a LinkedHashSet.
How to define equality:
Equality in a Set is defined by two values: the equals() and hashCode() methods of an Object. These methods could be a subject of an entirely separate tutorial, suffice it to say these two methods define object equality. Typically for custom made object, these methods default to those defined in object - this behavior defining equality based upon object references - and not their values. The following example extends off the previous, using a custom object as a demonstration:
public static void main(String[] args){ Tester[] strings = {new Tester("1"),new Tester("2"), new Tester("3"), new Tester("4") , new Tester("1"), new Tester("2")}; //create the Collection objects java.util.List<Tester> list = new ArrayList<Tester>(); Set<Tester> set = new LinkedHashSet<Tester>(); //Add the same values to each object for ( Tester s : strings ){ list.add(s); set.add(s); } //print out the contents of each collection System.out.println("List Contents:"); printCollection(list); System.out.println("Set Contents:"); printCollection(set); } /** *Prints the values of the parameter collection to the command line. */ public static void printCollection(Collection<Tester> c){ Iterator<Tester> it = c.iterator(); while ( it.hasNext() ){ Tester s = it.next(); System.out.println(s.value); } } /** *A Custom class with only a single variable for demonstration. */ private static class Tester { final String value; public Tester(String s){ value = s; } public String getValue(){ return value; } }
Results:
First, notice the Set contains no unique values (relative to the value variable in Tester). This is because the the hashCode()/equals() is defaulting to the implementation in Object - testing for unique instances, and not the values of those instances. To solve this, one can override these methods to define equality:List Contents: 1 2 3 4 1 2 Set Contents: 1 2 3 4 1 2
This addition results in unique values in the Set, based upon the value variable in the class Tester.@Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((value == null) ? 0 : value.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Tester other = (Tester) obj; if (value == null) { if (other.value != null) return false; } else if (!value.equals(other.value)) return false; return true; }
If I want a unique collection, why not test if a List contains the object before adding it again?
you mean like this?
List<String> list = new ArrayList<String>(); for ( String s : Strings){ if ( list.contains(s) ){ continue; } list.add(s); }
Yes, but lets push this to the limit a bit. In the example below, identical values are placed in both a List and a Set. These Collections are then searched for values with time for both operations being tallied and printed.
final int MAX = 10000; java.util.List<Integer> list = new ArrayList<Integer>(); Set<Integer> set = new HashSet<Integer>(); for ( int i = 0; i < MAX; i++ ){ list.add(i); set.add(i); } long start = System.currentTimeMillis(); //calculate the time for a list search for ( int i = 0; i < MAX; i++ ){ if ( list.contains(i) ){ //do nothing for the demo } } long end = System.currentTimeMillis(); System.out.println("Time for List: " + (end-start)); start = System.currentTimeMillis(); //calculate the time for the set search for ( int i = 0; i < MAX; i++ ){ if ( set.contains(i) ){ //do nothing for the demo } } end = System.currentTimeMillis(); System.out.println("Time for Set: " + (end-start));
System dependent of course, but a quick output on my current system:
Time for List: 161 Time for Set: 4
The reasoning behind the differences in time values is based the differences in how values are stored. While the details of this is beyond the scope of this tutorial, but suffice it to say the lookup in a Set is MUCH faster. A simple analogy (and only an analogy) to the difference: searching an array list is like looping over an array to find a value, whereas searching a set is similar to inherently knowing the index of the value in the array (based upon its 'hash' - eg the hashCode() method). For more information, see WIKI - HashTable
Yes this is pushing things to a limit, but more than once I have noticed huge benefits in runtime performance by using a Set when it is more appropriate to do so.
Uses
Unique lines in a file, unique url's in a webpage, unique values in a dataset. Both Set's and List's have their own uses in different situations: Set's for unique values, and Lists for sequential, and duplicate values.