Dataslope logoDataslope

Data Modeling with Collections

Picking the right shape — list, set, map, map-of-list, map-of-set — for real-world domains

Most code, in the end, is some collection nested inside some other collection. The shape you choose is a modeling decision, not a performance one — pick wrongly and your code will be full of awkward loops and edge cases; pick well and most queries become one method call.

This chapter is about the shape of your data.

The "shape" mental model

Every modeling problem has three questions:

  1. Are duplicates allowed? If no, you want a Set.
  2. Is order meaningful? If yes, you want a List (or an ordered set/map).
  3. Is there a key by which we look things up? If yes, you want a Map.

Combine these and you get the four canonical shapes:

QuestionShape
"Ordered bag of things"List<T>
"Unique things, lookup by identity"Set<T>
"Lookup by key"Map<K, V>
"Each key maps to many values"Map<K, List<V>> or Map<K, Set<V>>

Worked example: a music library

Let's design the data of a tiny music library. Domain:

  • A library has many tracks.
  • Each track belongs to one album and one artist.
  • A track has a list of genres.
  • Listeners can like any track.
  • We need to be able to:
    • List all tracks by a given artist.
    • List all unique genres in the library.
    • Show the top-N most liked tracks.

First pass: lists everywhere

A naive design might use lists for everything:

record Track(String title, String artist, String album, List<String> genres, int likes) {}

class Library {
    private final List<Track> tracks = new ArrayList<>();
    // tracksByArtist(name): for-loop over tracks, filtering by artist
    // allGenres(): for-loop with a duplicate-checking inner loop
    // topLiked(n): sort and take
}

This works, but every query is O(n), every "is this duplicate?" check is hand-written, and the intent — "lookup by artist", "set of unique genres" — is hidden inside loop bodies.

Second pass: shape per query

If we let the questions drive the shape, we get:

  • "List all tracks by a given artist" → Map<String, List<Track>> tracksByArtist
  • "List all unique genres in the library" → Set<String> allGenres
  • "Top-N most liked tracks" → keep a List<Track> and sort on demand
Code Block
Java 8 (Update 492)

Notice what happened: tracksByArtist(name) went from an O(n) scan to an O(1) map lookup, not because we changed any algorithm, but because we stored the data in the shape we needed to query it.

Modeling relationships

Most "real" data is graph-shaped. You'll mainly model two flavors of relationship with collections:

1:N — one parent, many children

Map<ParentKey, List<Child>> (ordered) or Map<ParentKey, Set<Child>> (unique).

Example: orders per customer.

Map<CustomerId, List<Order>> ordersByCustomer = new HashMap<>();

N:M — many on both sides

Two complementary maps form a bidirectional index:

Map<UserId,    Set<TagId>>  tagsOfUser  = new HashMap<>();
Map<TagId,     Set<UserId>> usersOfTag  = new HashMap<>();

Whenever you tag a user, update both. The cost is a tiny bit of discipline on every write; the benefit is that both questions ("what tags does this user have?" and "which users have this tag?") become O(1).

Typed value objects beat strings

A subtle but huge win: don't let your maps and lists be keyed by loose primitives. Map<String, Order> is fine — until you also have Map<String, Customer> and accidentally pass a customer's email where an order id was expected.

record CustomerId(String value) {} solves it: the compiler now catches the mix-up, and you can attach validation to the record's constructor.

Code Block
Java 8 (Update 492)

This is type-driven data modeling — letting the type system carry information that comments and convention used to carry.

Practice

Challenge
Java 8 (Update 492)
Group orders by customer

Implement OrderIndex.byCustomer(List<Order>) returning a Map<String, List<Order>> keyed by customerId, with each customer's orders in input order.

Expected output:

alice: [Order[id=1, total=100], Order[id=3, total=70]]
bob: [Order[id=2, total=40]]
carol: [Order[id=4, total=200]]

Test your understanding

QuestionSelect one

You need to answer "all messages from a given user" instantly, where messages arrive over time and the per-user list grows. What is the natural shape?

A flat List<Message> that you scan each time

A Set<Message>

A Map<UserId, List<Message>> updated on every arrival

A Map<UserId, Set<Message>>

QuestionSelect one

You're modeling a tagging system: each post has many tags, each tag belongs to many posts. Which is a good design?

One Map<PostId, List<TagId>>

One Map<TagId, List<PostId>>

Two indices: Map<PostId, Set<TagId>> and Map<TagId, Set<PostId>>, kept in sync on every change

A List<Pair<PostId, TagId>>

QuestionSelect one

Why prefer record CustomerId(String value) {} over a bare String for keys?

It uses less memory

It's faster

The compiler now refuses to mix up customer ids with other strings (order ids, emails, etc.), and you can put validation in the record's constructor

It's required by the Collections API

On this page