This is my personal thought and development bubble, so let’s look at Pydantic! Since I first used FastAPI I’m a huge fan of Pydantic. Basically it’s about data validation. As a developer it’s often about recieving data from somewhere, doing something with it and passing it on to somewhere else. When recieving data, I like to know if it follows the structure I expect. When sending stuff elsewhere, I’d like to make sure everything is following a structure. So pydantic was a handy way to ensure these i/o patterns.
TL;DR: pydantic models enable handy, standardized data wrapping and validation!
Basics
First of the following examples might include one or more of the folloing imports:
from datetime import datetime
from typing import Union, Optional, List, Dict, Literal
from uuid import UUID
from pydantic import BaseModel
Pydantic models work pretty straight forward.
- Define a model with standard python type hinting (define data shape)
- Load data in a model
- Validate data passed to the model (this is done by pydantic)
- Enjoy your easily implemented data shape 🎉
A simple model could look like this:
import uuid
from pydantic import BaseModel
class RawSentence(BaseModel):
id: uuid.UUID # An id has to match a UUID
text: str # The text has to match a normal str
So far so good, We created a python class inheriting from BaseModel, which gives us pydantics validator powers. Such that whenever data doesn’t pass our models spec, pydantic will raise proper errors for us. It’ll look like so:
if __name__ == "__main__":
data_point = {
"id": "4a3f61a9-8e75-4341-b3a0-3e64e0b60fb6",
"text": "Hello I'm a raw sentence",
}
model = RawSentence(**data_point)
print(model)
# id=UUID('4a3f61a9-8e75-4341-b3a0-3e64e0b60fb6') text='Hello Im a raw sentence'
print(model.id)
# 4a3f61a9-8e75-4341-b3a0-3e64e0b60fb6
model = RawSentence(**{"id": "", "text": "Hello Im a raw sentence"})
# model = RawSentence(**{"id": "", "text": 15})
# File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
# pydantic.error_wrappers.ValidationError: 1 validation error for RawSentence
# id
# value is not a valid uuid (type=type_error.uuid)
Like with every other class we can define function for this one too. This comes especially
handy with larger classe, that have mutple sub models, but this is a later topic. Here is
a class function that works with our text
variable:
import typing
import uuid
from pydantic import BaseModel
class RawSentence(BaseModel):
id: uuid.UUID # An id has to match a UUID
text: str # The text has to match a normal str
def filter_text_by_token_length(self, min_lenth: int) -> typing.List[str]:
return [
token
for token in self.text.split()
if len(token) >= min_lenth
]
if __name__ == "__main__":
data_point = {
"id": "4a3f61a9-8e75-4341-b3a0-3e64e0b60fb6",
"text": "Hello I'm a raw sentence",
}
model = RawSentence(**data_point)
filterd_tokens = model.filter_text_by_token_length(min_lenth=2)
print(filterd_tokens)
# ['Hello', "I'm", 'raw', 'sentence']
More typing
One can also work with import from typing
. I personally often use the following:
Union
: On of the passed types must be fulfilledOptional
: Not requiredList
: is a list of passed typesDict
: is a dictionaryLiteral
: One of the following strings is allowed
In case of Dict
one can also build a new pydantic model and pass it as a type to the
parent model. Like this it’s possible to validate children as well since **
would be
implicit in these cases.
import typing
from pydantic import BaseModel
class Related(BaseModel):
name: str
class Product(BaseModel):
price: typing.Union[int, float]
flag: typing.Optional[str]
tags: typing.List[str]
related: typing.Optional[typing.Dict]
model: typing.Literal["A", "B", "C", "D"]
related_model: Related
You can run this code as is, which can be used like so:
if __name__ == "__main__":
product = Product(
**{
"price": 1,
"tags": ["awesome"],
"model": "A",
"related_model": {"name": "ChildName"},
}
)
print(product)
# price=1 flag=None tags=['awesome'] related=None model='A' related_model=Related(name='ChildName')
print(product.related_model)
# name='ChildName'
Custom validation
In case we have some non-standard data type we can also use custom data validation.
For example, we could from pydantic import HttpUrl
. This allows us to parse and validate
something like https://arrrrrmin.netlify.com/
. If we’d have a response from some
aws service recieved by FastAPI, we should be able to validate something like
s3://bucket/folder/file.json
, which fails with HttpUrl
. The following validator can
handle this:
# Write a validator function that raises a ValueError (which will be caught inside
# pydantic and returned as a ValidationError, layouting exaktly what went wrong, while
# trying to create the model).
import typing
from pydantic import BaseModel, root_validator, validator
def validate_s3_scheme(s3_uri: str) -> str:
prefix: str = "s3://"
if not s3_uri.startswith(prefix):
raise ValueError("S3 URIs must start with {0}".format(prefix))
return s3_uri
def validate_non_key_file_string(v: str):
if len(v) < 1 or "." in v:
raise ValueError(
"Length provided should support at least 1 character and it's "
"not allowed to contain '.'"
)
return v
class S3Uri(BaseModel):
uri: str
scheme: str = "s3" # prefix has to be s3://
bucket: typing.Optional[str] # the host bucket
path: typing.Optional[str] # suffix after bucket
key: typing.Optional[str] # file key
# Use @validator defined below
_validate_bucket = validator("bucket", allow_reuse=True)(validate_non_key_file_string)
_validate_path = validator("path", allow_reuse=True)(validate_non_key_file_string)
# Please note: root validators are always executed on model creation
@root_validator
def validate_root_uri(cls, values): # noqa: U100
# Most of time you won't use cls, so make sure it doesn't break linting
uri: str = validate_s3_scheme(values.get("uri"))
uri_parts: typing.List[str] = uri.split("/")
# Expect: uri_parts = ["s3:", "", "bucket_name", "folder_name", "key_file.json"]
uri_length: int = len(uri_parts)
if uri_length < 3 and len(uri_parts[2]) > 0:
raise ValueError(
"S3 URIs must provide at least a bucket. Failed with uri: {0}".format(uri)
)
values["bucket"] = uri_parts[2]
if uri_length > 3:
values["path"] = "/".join(uri_parts[3:])
if "." in uri_parts[-1]:
values["key"] = uri_parts[-1]
return values
@validator("key")
def validate_non_key_file_string(cls, v: str): # noqa: U100
# Here we always need a "." in key file string
if len(v) < 1 or "." not in v:
raise ValueError(
"Length provided should support at least 1 character and has to contain "
"file suffix with '.'"
)
You can run this code as is, which can be used like so:
return v
if __name__ == "__main__":
s3_uri = S3Uri(uri="s3://bucket/folder/key.json")
print(s3_uri.uri)
# uri='s3://bucket/folder/key.json' scheme='s3' bucket='bucket' path='folder/key.json' key='key.json'
Ok, let’s break it down a bit. First we declared what we need and took advantage of
typing.Optional
.
class S3Uri(BaseModel):
uri: str
scheme: str = "s3" # prefix has to be s3://
bucket: typing.Optional[str] # the host bucket
path: typing.Optional[str] # suffix after bucket
key: typing.Optional[str] # file key
Our parameters bucket
, path
and key
are optional at first, but will be filled when
their values become available in root_validator
, which is the first validator executed,
by pydantic. With validate_root_uri
we inspect all values
recieved by S3Uri
. We
break it into pieces and fill our optional parameters. By nature an S3Uri needs a bucket,
which we’ll enforce here:
# Most of time you won't use cls, so make sure it doesn't break linting
uri: str = validate_s3_scheme(values.get("uri"))
uri_parts: typing.List[str] = uri.split("/")
# Expect: uri_parts = ["s3:", "", "bucket_name", "folder_name", "key_file.json"]
Everything else below is just to see wether or not our parsed vales fullfill what we expect, such that these are prompted like pydantic intends to.
Aliases
In some cases, for example working with FastAPI it’s useful to work with aliases. Aliases will help to have PEP8 (snake-case) compatible names, by also support camel-case naming, when creating the model. Here is an example:
import uuid
from pydantic import BaseModel
def to_camel(string: str) -> str:
return "".join(word.capitalize() for word in string.split("_"))
class Scheme(BaseModel):
class Config:
alias_generator = to_camel
class ResponseScheme(Scheme):
user_id: uuid.UUID
document_id: uuid.UUID
if __name__ == "__main__":
response_scheme = ResponseScheme(
**{"UserId": uuid.uuid4(), "DocumentId": uuid.uuid4(),}
)
print(response_scheme.json())
# {"user_id": "489bee33-7f7e-4b09-b83b-2d44ebeffde6", "document_id": "5e47e30d-d175-457d-8737-84cd06eae3f0"}
Like this it’s possible to use snake- and camel-case naming when passing data. There is a lot more to explore at Pydantic documentation.
Configs
Another nice feature: BaseSettings
. Configurations can be added and validated using
pydantics BaseSettings
. In addition one can install pip install pydantic[dotenv]
, to
read settings from a .env
file. This comes in especially handy when building a fastapi.
import os
from typing import Optional
from pydantic import BaseSettings
class S3Settings(BaseSettings):
REGION: Optional[str] = os.getenv("REGION") or "eu-central-1"
MAIN_BUCKET: Optional[str] = os.getenv("MAIN_BUCKET") or None
USER_BUCKET: Optional[str] = os.getenv("USER_BUCKET") or None
class Config:
env_file = ".env.dev.s3"
env_file_encoding = "utf-8"
Secrets
Building APIs often includes Secrets, which under no circumstances should be leaked. Therefore
pydantic provides SecretStr
class, which is nice to be sure that secret values are only
viewable if really needed. The following example reads a secret from AWS Secretsmanager, using
boto3
in a root_validator
and holds it in a BaseSetting
class.
import base64
import os
from typing import Optional
import boto3
from botocore.exceptions import ClientError
from pydantic import BaseSettings, SecretStr, root_validator
def read_public_key(secret_name: str, secrets_region_name: str) -> SecretStr | None:
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/secrets-manager.html
session = boto3.session.Session()
client = session.client(
service_name="secretsmanager", region_name=secrets_region_name
)
secret = None
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
except ClientError as e:
if e.response["Error"]["Code"] == "DecryptionFailureException":
raise e
# Catch other errors here ...
else:
if "SecretString" in get_secret_value_response:
secret = get_secret_value_response["SecretString"]
else:
secret = base64.b64decode(get_secret_value_response["SecretBinary"])
# Trying to pass None into SecretStr will result in a ValueError, so validation can happen
return SecretStr(secret)
class SecretSettings(BaseSettings):
SECRET_REGION: Optional[str] = os.getenv("SECRET_REGION") or "eu-central-1"
PRIVATE_KEY: Optional[SecretStr] = None
PUBLIC_KEY: Optional[SecretStr] = None
@root_validator
def val_private_key(cls, values: Dict): # noqa
private_key_secret = (
read_public_key(
secret_name="SECRETS_NAME_PRIVATE_KEY",
secrets_region_name=values["SECRET_REGION"],
)
or None # Used to trigger ValueError
)
public_key_id_secret = (
read_public_key(
secret_name="SECRETS_NAME_PUBLIC_KEY",
secrets_region_name=values["SECRET_REGION"],
)
or None # Used to trigger ValueError
)
return {
"PRIVATE_KEY": private_key_secret,
"PUBLIC_KEY": public_key_id_secret,
}
class Config:
env_file = ".env.dev.cf"
env_file_encoding = "utf-8"
settings = SecretSettings()
print(settings.PRIVATE_KEY)
# SecretStr('**********')
print(settings.PRIVATE_KEY.get_secret_value())
# -----BEGIN RSA PRIVATE KEY-----<...>==-----END RSA PRIVATE KEY-----