nightowl logo png
Google Bard | Blog Update

Do You Know Which Websites Have Been Used For Training The New Google Bard AI Tool

With everyone raving about Open AI’s ChatGPT, it was time for Google to give it a fitting reply. Enter the Google Bard AI, the new controversial and experimental AI chatbot service from Google.


The new trend Google Bard AI tool was designed to be similar to the ChatGPT. The only difference was that it would get its data from the Internet. This ChatGPT rival uses LaMDA or Language Model for Dialogue Applications, Google’s indigenous language model.



How is Bard different from Chat GPT?


Bard made its debut on 6th February with a statement from Pichai, Alphabet’s CEO. The language (LaMDA) it uses had been unveiled 2 years back. This Artificial Intelligence-powered chatbot is capable of responding to queries in a conversational way, just like its predecessor, the ChatGPT.


Google states that this tool uses online data to provide more accurate and original replies. The LaMDA is built on the neutral network architecture Transformer that was launched as open-source in 2017. ChatGPT, founded on the GPT-3 language model, is also based on Transformer.




The earlier versions of the Bard will use a lighter version of LaMDA. This is beneficial as it will need less computing power. It can also be scaled up to cater to a greater number of users. Besides the LaMDA, this tool will use web information for giving responses. The LaMDA model is trained on datasets that are based on web content known as Infinite. However, not much is known regarding whether the data has been obtained and how.


Which sites have been used to train the Bard AI?


According to the LaMDA research paper, different types of data sets have been used for training the Bard AI tool. Only 12.5% is derived from public datasets of crawled content that is obtained from the web while another 12.5% is derived from Wikipedia.


Google remains tight-lipped about where the rest of the information comes from. However, there are some indications of what sites may have trained the datasets. The LaMDA language was based on Infiniset which is a mix of internet content deliberately identified to enhance the design’s capability.


According to the research paper, this web content structure was chosen to ensure more durable efficiency in dialogue jobs. On the whole, LaMDA has been trained on 1.56 trillion words including internet text and public dialog information. The dataset comprises 12.5% C4-based data, 12.5% Wikipedia, 12.5% doe papers from tutorials, Q&As, 6.25% non-English web documents, 6.25% English web documents, 50% dialogs data from public forums, etc.



Training The New Google Bard AI Tool



The first 2 parts of the Infinite, from Wikipedia and C4, have information that is known and recognized. The latter is a specially-filtered Common Crawl dataset version. The remaining data which makes up a big chunk of the Infiniset comprises words that have been derived from the Internet. But, you won’t find any information on how the data has been derived from sites or which sites were involved. Google simply uses terms like non-English documents. It chooses to use “murky” to describe this 75% of data whose origin is not clearly explained.


The C4 (Colossal Clean Crawled Corpus) Dataset was created in 2020 by Google and is open-source. Common Crawl is a non-profit organization responsible for crawling the web to create free-of-cost datasets for everyone. This organization is administered by those who have served the Wikimedia Foundation, ex-Google employees, and advisors. The raw data collected by Common Crawl is cleaned by eliminating stuff like obscene words, thin content, deduplication, navigational menus, etc. to make the dataset compact enough to include just the main content. The idea behind filtering unwanted data is to do away with gibberish while retaining natural English.


Some of the 25 sites included in C4 are, en.m.,,,,,,,,,,,, etc.


Now, 50% of training data is from public forums. One can assume these forums to be like Reddit or communities like StackOverflow. Reddit, for example, has been used in multiple key datasets like those developed by OpenAI and Google’s WebText-like dataset. Google has also revealed another dataset from public dialog websites before the release of the LaMDA paper; this is known as MassiveWeb. While one doesn’t know for a fact if the MassiveWeb dataset has been used for training Bard, it contains much of what Google has used in another language model.


MassiveWeb has also been built by Google-owned DeepMind. This has been used for creating Gopher, another language model. MassiveWeb has used dialog web sources beyond Reddit so as to prevent a bias toward Reddit-influenced information. So, it uses data that has been scraped from other websites too, like Quora, Facebook, YouTube, Medium, and StackOverflow. While there’s no evidence that LaMDA was trained using these sites, chances are high that Google could have made use of these. That’s because another dataset developed by Google around this time has used these.


The final 35% or the last group contains datasets from Wikipedia, English documents, non-English web documents, and 12.5 % code documents derived from sites related to tutorials and Q&As, etc. One can only “speculate” about the tutorial websites which have been crawled. Everyone knows of Wikipedia but the non-English and English web documents are not clearly specified.


Certain publishers are afraid that since their websites have been used for training AI systems, they may backfire when these systems take over, making their sites outdated and irrelevant. Whether that will happen or not isn’t yet known, but this is an issue of much concern. Google continues to be vague about the sites or technologies used for training LaMDA.