A unified data model for token-based authentication and credential authentication

Michael Seifert, 2018-07-28, Updated 2020-10-01

Read about how the authentication system in Ameto (opens new window) has evolved over time. I evaluate different approaches for modeling user data for token-based and user-password authentication and present a secure approach for treating them uniformly in the data model.

Authentication is the process of determining a user's identity

Authorization is the process of determining whether a user has permission to execute an operation. Note that, there is also an HTTP header attribute name "Authorization".

Token-based authentication

When starting out with Ameto, (opens new window) the only use case was to check whether an incoming API request was sent by a registered user. There was almost no authorization in place, so a user could perform the operation as soon as he authenticated correctly. Consequently, the code for authentication was very simple. API users were given a token that they had to send in the Authorization header field ^[1] of their HTTP request.

Authorization: Bearer S3cr3tT0k3n

Tokens were distributed manually and consisted of a number of (secure) random bytes.

import secrets
# Generate a secure token with 32 random bytes
token = secrets.token_hex(32)
user = User(token)

Since a user might want to have different tokens for different purposes, users could have more than one access tokens. The service that handles the incoming API request parses the request header and performs a look up whether the token is present in the database.

The upside of this approach is its simplicity. If you choose a token size that is long enough, the risk of two generated tokens being identical is close to zero and tokens are hard to guess.

The request always has to be sent via a secure connection (read HTTPS). Otherwise, an attacker can retrieve the access token via a Man-In-The-Middle attack.

From token-based authentication to user-password credentials

Soon there was a need to authenticate in a web browser. Browsers do not allow us to abitrarily set the Authorization header. Also, how is the user supposed to remember his 32 byte long access token? For this reason, we had to implement authentication with a login and password. The server-side has to check whether the user exists, verify the password, and set a session cookie on success^[2]. On subsequent requests the server may check for the existence of such a cookie to authenticate the user.

This leaves us with two different processes for authenticating users: Sending an API token via the Authorization header and issuing a session cookie after successful login with password. This means that we have to change the data model of our user, which can now have either a set of tokens or a pair of credentials. Or is it possible for a user to have both?

The emerging data model for users felt overly complicated, so we decided to reconsider what a "user" actually is. We came to the conclusion that a user can be either a piece of software that authenticates via an access token or a human being that accesses the services via a login and a password. However, we did not want customers to create a separate account for each user. This is why we introduced the concept of a tenant. A tenant can be an organization or a person. Each tenant can have an arbitrary number of users, some of which are programmatic users using token-based authentication.

A tenant is an entity that has a contract with the service provider.

Now that the notion of a user is clear, our user object stores an api token for computational users and a login and a password for human users.

import secrets
token = secrets.token_hex(32)
login = provided_login
password = provided_password
user = User(token, login, password)

However, some of these fields may be empty, since the user can use either form of authentication, not both. This lead us to splitting up an access token into two different parts, a login part and a password part, both of which consist of random bytes. If the user is human and provides a login and password, we use the provided credentials. Otherwise, the credentials are randomly generated and the concatenation of the login and password forms the actual token.

import secrets
from typing import NamedTuple

class User(NamedTuple):
    login: str
    password: str

    @property
    def token(self):
        return f'{self.login}{self.password}'

if provided_login and provided_password:
    login = provided_login
    password = provided_password
else:
    login = secrets.token_hex(16)
    password = secrets.token_hex(16)
user = User(login, password)

A nice side effect is that authentication data (user) is now stored separately from personal information (tenant). This narrows down the number of services that have access to the personal data of the tenant and makes it easier to protect.

Never forget to hash the password before storing it in the database. I recommend using a flavor of the Argon2 password hash family (opens new window) which came out on top of its contenders in the Password Hashing Competition 2015 (opens new window).

The resulting token in still very limited in what it does. For example, it is not possible to encode any information into it, such as an expiration date. Such attributes either have to be managed on the server-side or the token has to be encoded in a different format, such as JWT. (opens new window)

MDN web docs – Authorization (opens new window) ↩︎
It is also possible to manage sessions on the server-side: Upon successful login, the server generates a session ID, stores it in a database, and transmits it to the client. The client then uses the session ID for subsequent requests, e.g. as a URL parameter. This is how PHP's built-in session management works, for example. Server-side sessions work fine as long as there is only one instance that is issuing and authenticating user sessions. Let's call this the "session service". In a distributed system where more than one session service instance, we need to ensure consistency between these instances. For example, a session could have been created by instance A, but a load balancer redirects subsequent requests of the same client to instance B. Now we have to take measures so that instance B knows about the session that was issued by instance A. We can do this by using a strongly consistent database, for example. In any case, server-side sessions incur additional programmatic (higher load on backend services) and operational (possibly an additional database) costs. Therefore, we will not discuss server-side session management further in this post. ↩︎