Data Modeling with Collections
Picking the right shape — list, set, map, map-of-list, map-of-set — for real-world domains
Most code, in the end, is some collection nested inside some other collection. The shape you choose is a modeling decision, not a performance one — pick wrongly and your code will be full of awkward loops and edge cases; pick well and most queries become one method call.
This chapter is about the shape of your data.
The "shape" mental model
Every modeling problem has three questions:
- Are duplicates allowed? If no, you want a
Set. - Is order meaningful? If yes, you want a
List(or an ordered set/map). - Is there a key by which we look things up? If yes, you want a
Map.
Combine these and you get the four canonical shapes:
| Question | Shape |
|---|---|
| "Ordered bag of things" | List<T> |
| "Unique things, lookup by identity" | Set<T> |
| "Lookup by key" | Map<K, V> |
| "Each key maps to many values" | Map<K, List<V>> or Map<K, Set<V>> |
Worked example: a music library
Let's design the data of a tiny music library. Domain:
- A library has many tracks.
- Each track belongs to one album and one artist.
- A track has a list of genres.
- Listeners can like any track.
- We need to be able to:
- List all tracks by a given artist.
- List all unique genres in the library.
- Show the top-N most liked tracks.
First pass: lists everywhere
A naive design might use lists for everything:
record Track(String title, String artist, String album, List<String> genres, int likes) {}
class Library {
private final List<Track> tracks = new ArrayList<>();
// tracksByArtist(name): for-loop over tracks, filtering by artist
// allGenres(): for-loop with a duplicate-checking inner loop
// topLiked(n): sort and take
}This works, but every query is O(n), every "is this duplicate?"
check is hand-written, and the intent — "lookup by artist", "set
of unique genres" — is hidden inside loop bodies.
Second pass: shape per query
If we let the questions drive the shape, we get:
- "List all tracks by a given artist" →
Map<String, List<Track>> tracksByArtist - "List all unique genres in the library" →
Set<String> allGenres - "Top-N most liked tracks" → keep a
List<Track>and sort on demand
Notice what happened: tracksByArtist(name) went from an O(n)
scan to an O(1) map lookup, not because we changed any
algorithm, but because we stored the data in the shape we needed
to query it.
Modeling relationships
Most "real" data is graph-shaped. You'll mainly model two flavors of relationship with collections:
1:N — one parent, many children
Map<ParentKey, List<Child>> (ordered) or
Map<ParentKey, Set<Child>> (unique).
Example: orders per customer.
Map<CustomerId, List<Order>> ordersByCustomer = new HashMap<>();N:M — many on both sides
Two complementary maps form a bidirectional index:
Map<UserId, Set<TagId>> tagsOfUser = new HashMap<>();
Map<TagId, Set<UserId>> usersOfTag = new HashMap<>();Whenever you tag a user, update both. The cost is a tiny bit of
discipline on every write; the benefit is that both questions
("what tags does this user have?" and "which users have this tag?")
become O(1).
Typed value objects beat strings
A subtle but huge win: don't let your maps and lists be keyed by
loose primitives. Map<String, Order> is fine — until you also have
Map<String, Customer> and accidentally pass a customer's email
where an order id was expected.
record CustomerId(String value) {} solves it: the compiler now
catches the mix-up, and you can attach validation to the record's
constructor.
This is type-driven data modeling — letting the type system carry information that comments and convention used to carry.
Practice
Implement OrderIndex.byCustomer(List<Order>) returning a Map<String, List<Order>> keyed by customerId, with each customer's orders in input order.
Expected output:
alice: [Order[id=1, total=100], Order[id=3, total=70]]
bob: [Order[id=2, total=40]]
carol: [Order[id=4, total=200]]
Test your understanding
You need to answer "all messages from a given user" instantly, where messages arrive over time and the per-user list grows. What is the natural shape?
A flat List<Message> that you scan each time
A Set<Message>
A Map<UserId, List<Message>> updated on every arrival
A Map<UserId, Set<Message>>
You're modeling a tagging system: each post has many tags, each tag belongs to many posts. Which is a good design?
One Map<PostId, List<TagId>>
One Map<TagId, List<PostId>>
Two indices: Map<PostId, Set<TagId>> and Map<TagId, Set<PostId>>, kept in sync on every change
A List<Pair<PostId, TagId>>
Why prefer record CustomerId(String value) {} over a bare String for keys?
It uses less memory
It's faster
The compiler now refuses to mix up customer ids with other strings (order ids, emails, etc.), and you can put validation in the record's constructor
It's required by the Collections API