← API

Pagination

Context and Problem Statement

Banno does not have any policies around paging through our API result sets. Absent this, each team decides how to proceed whenever a new endpoint is requested based on their own experiences (or lack thereof). Due to this, our API does not offer consumers a unified and consistent methodology around paginating results.

Decision Drivers

  • Unified methodology to assist API design for developers and simplification of consumer code.
  • Ability to apply pagination in various disparate circumstances
  • Simplicity of pagination implementations

Decision Outcome

Chosen option: Support “opaque cursor” pagination where possible, with “limit plus offset” as an alternative

Considered Options

  • Support limit plus offset based pagination only
  • Support opaque cursor based pagination only
  • Support “opaque cursor” pagination where possible, with “limit plus offset” as an alternative
  • Bypass the need for pagination with streamed endpoints

Pros and Cons of the Options

Support limit plus offset based pagination only

This involves the use of 2 primary parameters, “limit” and “offset”. “limit” here refers to the maximum number of records that can be returned in any one request, while “offset” refers to where, in the result set, should the records start from.

  • Good, simple to implement
  • Good, allows navigation to any page in the result set
  • Bad, requires consumers to meddle in the implementation details of calculating offsets
  • Bad, does not map naturally to third party integrations that only support cursors or other non-limit/offset based pagination schemes
  • Bad, cannot easily handle modifications to the result set while consumer is paging through results

Support opaque cursor based pagination only

Generally cursor based pagination involves the use of 2 primary parameters, “limit” and “cursor”. “limit” indicates the maximum number of records that can be returned in any one request, while “cursor” is a reference that can be directly used to return the next page in the result set. In the opaque cursor scheme, we also require the cursor to an opaque string. This allows the server to adapt the implementation of the cursor to the needs of the result set being paginated without exposing implementation details.

  • Good, simple to use for consumers
  • Good, adaptable to disparate use cases
  • Good, can change underlying implementation without breaking contracts
  • Good, used by FDX API so is an industry standard
  • Bad, more complex to implement
  • Bad, typically does not allow navigation to any page of the result set

Support “opaque cursor” pagination where possible, with “limit plus offset” as an alternative

  • Good, handles the widest variety of use cases
  • Good, provides maximum implmentation flexibility
  • Bad, increases API complexity

Bypass the need for pagination with streamed endpoints

Use chunk encoded http responses to stream back result sets to consumers without the need for pagination.

  • Good, easy to implement
  • Bad, is not condusive to partial consumption of result sets

Research

While there are a considerable number of ways to paginate through a dataset, most fall into three major categories:

  • Limit plus Offset
  • Cursors
  • Not paged / streaming

Here at Banno, we also witness the use of all 3 categories, which are described in further detail below.

Limit plus Offset

This involves the use of 2 parameters used during all web service requests, “limit” and “offset”. Of course, the names may be substituted with some other term, such as “count” and “start”.

In this situation, “limit” refers to the maximum number of records that can be returned in any one request, while “offset” refers to where, in the resultset, should the records start when the web service is called.

For instance, if the consumer wants 10 records per page, they would specify a “limit” of 10. For the first page, they would omit “offset” or set it to 1. For the second page, “offset” would be 11.

There are a couple of general offshoots to this method based on its simplicity:

  • Instead of “offset”, allow the consumer to specify “page”. This is roughly “offset divided by limit plus 1”, but is nice to use if the front-end has page navigation.
  • Allow navigation using “last” (or “first”). This requires that the service know how many total records exist.

Many times “Limit plus Offset” is the natural choice for developers when the results are being collected from a database which supports LIMIT and OFFSET clauses in their query language (such as PostgreSQL). Using this methodology greatly simplifies the implementation of the service.

Limitations

This simplicity does come at a cost:

  • Large resultsets do not perform well on many databases when using LIMIT and OFFSET. This is because the database still queries for all records, running through them to apply the clauses.

  • This scheme cannot handle modifications to the results occurring while a consumer is paging through the records. In the worst-case scenario, records can be missed completely. This can happen, for example, in the following scenario:

    • The consumer is using a LIMIT size of 10
    • The consumer reads the first page (i.e.: records 1 through 10).
    • While reading the first page, record #9 is deleted.
    • The consumer requests page 2 (or an OFFSET of 11).
    • At this point, record #11 before the deletion has become record #10 and will no longer appear on page 2 (but it also had not appeared originally on page 1).
  • To prevent the problem described in the previous point, a results-caching layer can be introduced to ensure that changes to the resultset are controlled in a manner that makes sense for both the developers and the consumer’s expectations. However, this significantly complicates the solution.

Cursors

Using a cursor involves “marking” a place in the results and asking the web service to continue the next page starting with that mark. Most implementations of cursors are not nearly as straightforward as using Limit plus Offset, but this methodology does offer significant protections against missing a record.

Natural Cursors

Cursors that are based on a “natural mark” (my term) are the simplest to implement and afford the same benefits of Limit plus Offset with the added benefit of guaranteeing that you do not miss any records or see any duplicate entries.

A “natural mark” is a field on every record in the resultset that both (a) uniquely identifies that entry; and (b) is the sort-order of the results.

In those cases, the API would allow the consumer to specify a “limit” (or page size) and “start”, where “start” indicates the mark. The web service would simply structure the query to make sure it did not return any records whose natural mark is less than the given “start” mark.

Unfortunately, most APIs do not work against a dataset with natural marks.

Date-Based Cursors

This is similar to Natural Marks with the exception of the following:

  • Multiple records could have the same date stamp
  • Depending on the purpose of the date, the results could be shuffled around, allowing duplicates to be shown.

Using dates (such as “Created Date”) as the sort key is, in many cases, an acceptable means of returning results. However, this can open up edge cases where records are seen more than once. For instance, if “Updated Date” is used and a record is updated between pages, that record will be seen at least a second time when the consumer navigates to the end of the resultset.

For instance, if backed by a SQL database, a simple query such as the following can be used to power Date-Based cursors:

SELECT …
FROM …
WHERE (created_date >= ${Token.CreatedDate})
ORDER BY created_date

Date-Based cursors cannot be used if the date is sorted in descending order and an entry can have this date modified, since this again will expose the consumer to the risk that a record is missed.

The other significant drawback is that this method will not work if too many records have the same value in the date field (note: this is true of all cursors, not just date fields). If there are more records with the same value than there are records in a page, this method will not work at all. Even when there are less records with the same value as there are records in a page, the behavior will be sub-optimal if the page ends with one of these types of records. This is because when the next page is requested, the consumer will see all of the records with that same date value again, regardless of whether those records were on the previous page.

To improve on Date-Based Cursors (or other similar cursor types), a Continuation Token Cursor can be used which aims to solve some of the problems outlined above.

Continuation-Token Cursors

Continuation tokens specify multiple fields to define where to continue paging. They are almost exclusively used with date-based sorting, but can be used with other naturally sorted resultsets as well. In contrast with “Date-Based Cursors”, Continuation tokens add a field which, when combined with the first cursor field, forms a primary key on the record. The sorting of the resultset must include both fields.

Using this type of token eliminates the problem of having too many records with the same cursor value, but does not solve the issue that the results must be sorted primarily by these two fields.

For instance, if a User record has a “Created Date”, a continuation token can be created which combines “Created Date” and “User ID”. The query to produce the results can be something similar to the following:

SELECT …
FROM …
WHERE (created_date > ${Token.CreatedDate}) OR (created_date = ${Token.CreatedDate} AND user_id > ${Token.UserID})
ORDER BY created_date, user_id

Opaque Cursors

Slack has a great blog post on pagination. They use a hybrid approach, whereby the cursor structure is not a field (or a few fields), but instead a general object that can be decoded by the endpoint that produced it. With this setup, an endpoint that wants to use simple “Limit plus Offset” values can encode those two fields into a single cursor. Other endpoints can choose alternative items to encode. (Note: this object is encoded normally in a Base64 string, or something similar, to allow use on URI query strings).

For this to work, the API must return more data back with each response to allow the consumer to easily use the hybrid cursor (see next section).

Results Payload

Regardless of which pagination method is used above, it is strongly recommended to return metadata about the results with each payload. In keeping with the “HATEOAS” concept of RESTful services, this metadata should describe the options (i.e. URIs) that the consumer of the API can follow after processing the result.

At a minimum, all results should return a field that specifies the URI to request the next page of the results. This can be expanded to return the URIs for the previous, first, and last pages, for instance. For “Limit plus Offset” pagination schemes, this can include URIs for each page in the resultset, the total number of pages, and the total number of records.

There is no generally accepted method of including the metadata. Many APIs, including here at Banno, add these types of URIs to the JSON payload as a top-level element (like “nextPage”). If many URIs are to be provided, adding a top-level object, such as “_links”, may work better.

RFC-5988 is a standard around including metadata links in API payloads. This is, in fact, how GitHub manages pagination in its API. This standard indicates that this information should be placed into the HTTP headers. However, there are two issues with this approach: (1) The consumer now has to parse both the payload and headers in order to work with the results, adding to client complexity, and (2) Other metadata (such as “number of pages”) do not technically adhere to the purpose of RFC-5988. If additional metadata were going to be included in the results, they would need to be placed elsewhere, causing more confusion.