Tired of text-to-image models that can't spell or deal with fonts and typography correctly ? The secret seems to be in the use of multilingual, tokenization-free, character-aware transformer encoders such as ByT5 and CANINE-c.
text-to-image PipelineAS part of the Hugging Face JAX Diffuser Sprint, we will replace CLIP's tokenizer and encoder with ByT5's in the HF's JAX/FLAX text-to-image pre-training code and run it on the sponsored TPU ressources provided by Google for the event.
More specifically, here are the main tasks we will try to accomplish during the sprint:
Pre-training dataset preparation: we are NOT going to train on lambdalabs/pokemon-blip-captions. So what is it going to be, what are the options? Anything in here or here takes your fancy? Or maybe DiffusionDB? Or a savant mix of many datasets? We probably will need to combine many datasets as we are looking to cover these requirements:
We shoud use the Hugging Face Datasets library as much as possible since it supports JAX out of the box. For simplicity's sake we will limit us to concatenated Hugging Face datasets such as LAION2B EN, MULTI and NOLANG. We shall, however pre-load, pre-process and cache the dataset on disk before training on it.
Improvements to the original code:
jnp (instead of np) jit, grad, vmap, pmap, pjit everywhere! And we should make sure we do not miss any optimization made in the sprint code either.FlaxStableDiffusionSafetyChecker out of the way.Replace CLIP with ByT5 in original code:
CLIPTokenizer with ByT5Tokenizer. Since this will run on the CPUs, there is no need for JAX/FLAX unless there is hope for huge performance improvements. This should be trivial.FlaxCLIPTextModel with FlaxT5EncoderModel. This might be almost as easy as replacing the tokenizer.CLIPImageProcessor for ByT5. This is still under investigation. It's unclear how hard it will be.FlaxAutoencoderKL and FlaxUNet2DConditionModel for ByT5 if necessary.Secondly, we will integrate to the above a Hugging-Face JAX/FLAX ControlNet implementation for better typographic control over the generated images. To the orthographically-enanced SD above and as per Peter von Platen's suggestion, we also introduce the idea a typographic ControlNet trained on an synthetic dataset of images paired with multilingual specifications of the textual content, font taxonomy, weight, kerning, leading, slant and any other typographic attribute supported by the CSS3 Text, Fonts and Writing Modes modules, as implemented by the latest version of Chromium.