There's No Tiananmen Square In the New Chinese Image-Making AI (technologyreview.com) 73
The ERNIE-ViLG model is part of Wenxin, a large-scale project in natural-language processing from China's leading AI company, Baidu. It was trained on a data set of 145 million image-text pairs and contains 10 billion parameters -- the values that a neural network adjusts as it learns, which the AI uses to discern the subtle differences between concepts and art styles. That means ERNIE-ViLG has a smaller training data set than DALL-E 2 (650 million pairs) and Stable Diffusion (2.3 billion pairs) but more parameters than either one (DALL-E 2 has 3.5 billion parameters and Stable Diffusion has 890 million). Baidu released a demo version on its own platform in late August and then later on Hugging Face, the popular international AI community. The main difference between ERNIE-ViLG and Western models is that the Baidu-developed one understands prompts written in Chinese and is less likely to make mistakes when it comes to culturally specific words.
But ERNIE-ViLG will be defined, as the other models are, by what it allows. Unlike DALL-E 2 or Stable Diffusion, ERNIE-ViLG does not have a published explanation of its content moderation policy, and Baidu declined to comment for this story. When the ERNIE-ViLG demo was first released on Hugging Face, users inputting certain words would receive the message "Sensitive words found. Please enter again (...)," which was a surprisingly honest admission about the filtering mechanism. However, since at least September 12, the message has read "The content entered doesn't meet relevant rules. Please try again after adjusting it. (...)" In a test of the demo by MIT Technology Review, a number of Chinese words were blocked: names of high-profile Chinese political leaders like Xi Jinping and Mao Zedong; terms that can be considered politically sensitive, like "revolution" and "climb walls" (a metaphor for using a VPN service in China); and the name of Baidu's founder and CEO, Yanhong (Robin) Li. While words like "democracy" and "government" themselves are allowed, prompts that combine them with other words, like "democracy Middle East" or "British government," are blocked. Tiananmen Square in Beijing also can't be found in ERNIE-ViLG, likely because of its association with the Tiananmen Massacre, references to which are heavily censored in China. Giada Pistilli, a principal ethicist at Hugging Face, says it could be helpful for the developer of ERNIE-ViLG to release a document explaining the moderation decisions. "Is it censored because it's the law that's telling them to do so? Are they doing that because they believe it's wrong? It always helps to explain our arguments, our choices," says Pistilli.
"Despite the built-in censorship, ERNIE-ViLG will still be an important player in the development of large-scale text-to-image AIs," concludes the report. "The emergence of AI models trained on specific language data sets makes up for some of the limitations of English-based mainstream models. It will particularly help users who need an AI that understands the Chinese language and can generate accurate images accordingly."
"Just as Chinese social media platforms have thrived in spite of rigorous censorship, ERNIE-ViLG and other Chinese AI models may eventually experience the same: they're too useful to give up."