Sets
HashSet, LinkedHashSet, TreeSet — and what "no duplicates" actually means for your data
A set is a collection of unique elements. There is no notion of
position — only "is this thing in here or not?" The interface is
called Set<E>, and the three production-grade implementations cover
the three useful orderings: no order, insertion order, and
sorted order.
The Set interface
Set<E> extends Collection<E> and adds zero new methods. The
difference is purely in the contract: add(x) returns false (and
does nothing) if x is already present.
The three implementations you actually use:
| Class | Backed by | Ordering | Cost of add/contains |
|---|---|---|---|
HashSet<E> | HashMap<E,Object> | None | O(1) average |
LinkedHashSet<E> | hash table + linked list | Insertion order | O(1) average |
TreeSet<E> | Red-black tree | Sorted by natural order or Comparator | O(log n) |
HashSet
HashSet is the default: very fast, no order. Internally each
element becomes a key in a hidden HashMap, mapped to a sentinel
value.
Notice:
add(...)returnstrueonly the first time. That alone is one of the most useful primitives in the framework (it gives you "have I seen this before?" in one line).- The iteration order is unspecified. Do not rely on it.
HashSet relies entirely on equals and hashCode — review the
foundations chapter if you need to put your own classes in a set.
LinkedHashSet
LinkedHashSet is HashSet plus a doubly-linked list threaded
through the entries to remember insertion order.
The cost is a little extra memory (two pointers per entry) — not much. The benefit is that iteration gives you elements in the order you added them, which is invaluable when uniqueness and ordering both matter (e.g. "the unique URLs visited, in visit order").
Output is /home, /about, /contact — in the order they first
appeared.
TreeSet
TreeSet keeps elements in sorted order, using either the
elements' natural order (Comparable) or a supplied Comparator.
Backed by a red-black tree, every operation is O(log n).
TreeSet is also a NavigableSet, which gives you higher(x),
lower(x), ceiling(x), floor(x), headSet(x), tailSet(x) —
all the range operations a sorted structure can give you for free.
Reach for it when "what is the next/previous element near x?" is a
real question.
Set algebra: union, intersection, difference
The Collection API gives you the three classical set operations
through bulk methods. They mutate the receiver — which surprises
people once and then never again.
We wrap the printout in a TreeSet only to get a stable order for
display — the underlying sets are unordered.
A pitfall: putting mutable objects in a set
We covered this in the foundations chapter, and it's worth saying
again here in context: a HashSet indexes by hashCode at insertion
time. If you mutate a field that participates in hashCode, the
element silently becomes unreachable in the set. The same hazard
exists with TreeSet if you mutate a field that affects compareTo.
Use immutable value types as set elements.
A multi-file example: unique visitors
This puts Set to work in a recognizably real role: deduplicating an
incoming stream while preserving first-seen order.
A single class — three methods — and it solves both "unique" and "in arrival order" because we picked the right container.
Practice
Write a class Distinct with a static method List<String> ofWords(String text) that splits the text on whitespace and returns each distinct word, in the order it first appeared.
- Tokenize by splitting on one or more whitespace characters.
- "Distinct" is case-sensitive here (treat "The" and "the" as different).
- The return type must be a
List<String>so the caller can index.
Expected output:
[the, quick, brown, fox, jumps, over, lazy, dog]
Test your understanding
Which set implementation should you reach for when you need both uniqueness and iteration in insertion order?
HashSet
TreeSet
LinkedHashSet
ArrayList
A method receives a Set<User>. The same User object is added twice. What happens?
The set throws IllegalStateException
The size becomes 2
The second add returns false and the set is unchanged
The first User is overwritten
Why does mutating a field of an element already inside a HashSet (one that participates in hashCode) cause bugs?
It triggers ConcurrentModificationException
The element's bucket was chosen from the old hash; after mutation the set looks for it in a different bucket and reports "not present"
It silently clears the set
It throws UnsupportedOperationException