Overcoming Baidu’s Block: How to Access Content for AI Training Despite Restrictions on Google and Bing

AI News

2 Mins Read

In-Short

  • Baidu updates Baike service to block Google and Bing from scraping content.
  • Change aims to protect data used for training AI, following industry trends.
  • Chinese Wikipedia remains accessible, while Baidu Baike entries still‍ appear in searches.
  • AI industry’s demand for high-quality content leads to new data-sharing policies.

Summary of Baidu’s Recent Update

Baidu, the prominent Chinese internet search provider, has taken a‍ significant step by updating its Baike service to prevent major search engines Google and​ Microsoft Bing from scraping ‌its ​content. This move, detected through changes in the Baidu Baike robots.txt file, reflects⁢ a broader industry trend where ⁣companies are ⁤increasingly protective of their data, especially with the surge in demand for large datasets to train artificial intelligence models.

The update, which occurred on August 8 ‍according to the Wayback Machine, restricts‍ the previously‌ allowed indexing ‌of Baidu Baike’s extensive repository by Googlebot and Bingbot crawlers. Baidu Baike boasts nearly 30 million entries, and while some subdomains were already restricted, this ⁤marks a more comprehensive measure.

Similar actions have been ​observed from other platforms, such as Reddit, which also limited search engine access to its ⁣content, with the exception ⁤of Google due to a financial agreement for data access. Microsoft has also contemplated restricting⁢ access to its internet-search data for competitors, particularly those utilizing the ⁢data‌ for chatbots and generative AI services.

Despite Baidu’s restrictions, the Chinese Wikipedia, with over 1.43 million entries, ⁤remains open to search engine crawlers. Current searches still show Baidu Baike entries, likely ​due to the use of cached content by search engines.

The⁢ industry is⁣ witnessing a shift towards strategic partnerships for content⁤ access, as seen with OpenAI’s agreements with Time magazine and ‌the Financial​ Times. Baidu’s decision⁢ underscores the escalating value of curated⁤ datasets ⁢in AI development and the consequent changes⁢ in online content management policies.

As the AI sector evolves, it ⁤is anticipated that more companies will ⁣reevaluate their data-sharing strategies, potentially leading to further shifts in ‌the indexing and ⁢accessibility of information on the internet.

Further Reading

For​ more detailed insights on Baidu’s decision and its​ implications for the AI industry, readers are‍ encouraged to visit the original source.

Footnotes

Image credits and external⁤ sources​ referenced⁢ within the article are acknowledged ⁢where applicable.

Leave a Comment