3.1.3. Cross-modal Generation

Figure illustrates how certain image and its content can result in a specific font style. In this version, the sun is shining above the mountain.
Figure illustrates how another image that contains lightning above the mountain results in a different font style. One might wonder why someone would want to demonstrate such a scene. However, further research is required to answer the question about the lightning above the mountain.

A multimodal or cross-modal generation represents a domain that aims to generate its outputs from different modalities of their inputs. For instance text-to-image, text-to-speech, image-to-text etc. In the font domain, it could be text-to-font, image-to-font. (Chen et al. 2020) ’s proposed framework, based on GANs, enables the generated fonts to reflect human emotional information distinctive from scanned faces. This approach is considered as image-to-font as the input is scanned images, and the output is generated fonts. (Kang et al. 2022) aimed to understand correlations between impressions and font styles through a shared latent space in which the font and its impressions are embedded nearby. This approach is considered text-to-font as the input is impression defined in texts, and the output is generated fonts. (Matsuda, Kimura, and Uchida 2022) aimed to generate fonts with specific impressions with a font dataset with impression labels. Similar to previous work, it is considered as text-to-font as the input is impression defined in texts, and the output is generated fonts. (Ueda, Kimura, and Uchida 2022) tried to analyse the given impression of fonts by training the Transformer network. They didn’t generate fonts. This approach is considered font-to-text as the input is a font, and the output generates text describing the impression.

More details about multimodal representation techniques are provided in the representations chapter. Hold on!

Chen, Lu, Feifei Lee, Hanqing Chen, Wei Yao, Jiawei Cai, and Qiu Chen. 2020. “Automatic Chinese Font Generation System Reflecting Emotions Based on Generative Adversarial Network.” Applied Sciences 10 (17, 17): 5976. https://doi.org/10.3390/app10175976.
Kang, Jihun, Daichi Haraguchi, Seiya Matsuda, Akisato Kimura, and Seiichi Uchida. 2022. “Shared Latent Space of Font Shapes and Their Noisy Impressions.” In MultiMedia Modeling, edited by Björn Þór Jónsson, Cathal Gurrin, Minh-Triet Tran, Duc-Tien Dang-Nguyen, Anita Min-Chun Hu, Binh Huynh Thi Thanh, and Benoit Huet, 146–57. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-98355-0_13.
Matsuda, Seiya, Akisato Kimura, and Seiichi Uchida. 2022. “Font Generation with Missing Impression Labels.” June 2, 2022. https://doi.org/10.48550/arXiv.2203.10348.
Ueda, Masaya, Akisato Kimura, and Seiichi Uchida. 2022. “Font Shape-to-Impression Translation.” March 28, 2022. https://doi.org/10.48550/arXiv.2203.05808.

Citation

If this work is useful for your research, please cite it as:

@phdthesis{paldia2025generative,
  title={Research and development of generative neural networks for type design},
  author={Paldia, Filip},
  year={2025},
  school={Academy of Fine Arts and Design in Bratislava},
  address={Bratislava, Slovakia},
  type={Doctoral thesis},
  url={https://lttrface.com/doctoral-thesis/},
  note={Department of Visual Communication, Studio Typo}
}