Skip to main content
SearchLoginLogin or Signup

Wikipedia's Not So Little Sister Is Finding Its Own Way

Wikidata is arguably one of Wikipedia's most successful sister projects. It had a profound impact on Wikipedia in just a few years.

Published onJul 08, 2019
Wikipedia's Not So Little Sister Is Finding Its Own Way
·

In 2012, Wikipedia had grown and achieved so much in over a decade of creating an encyclopedia. But it was also at a point where fundamental change was needed: The world around Wikipedia was changing and Wikimedia had to find ways to make its content more accessible and support its editors in maintaining an ever increasing body of content in over 250 languages. The vision of a world in which every single human being can freely share in the sum of all knowledge was not achievable in this scattered way.

Ever since 2005 at the very first Wikimania, Wikimedia’s annual conference, one idea kept coming up: to make Wikipedia semantic and thus make its content accessible to machines. Machine-readability would enable intelligent machines to answer questions based on the content and make the content easier to reuse and remix. For example, it was not possible to easily find an answer to the question of what are the biggest cities with a female mayor because the necessary data was distributed over many articles and not machine-readable. Denny Vrandečić and Markus Krötzsch kept working on this idea and created Semantic MediaWiki, learning a lot about how to represent knowledge in a wiki along the way. Others had also started extracting content from Wikipedia, with varying degrees of success, and making the information available in machine-readable form.

So when the first line of code for the software that came to power Wikidata was written in 2012, it was an idea whose time had come. Wikidata was to be a free and open knowledge base for Wikipedia, its sister projects and the world that helps give more people more access to more knowledge. Today, it provides the underlying data for a lot of technology you use and the Wikipedia articles you read every day.

Being able to influence the world around you is such an important and empowering thing and yet we are losing this ability a bit more everywhere every day. More and more in our daily lives depends on data so lets make sure it stays open, free and editable for everyone in a world where we put people before data. Wikipedia showed how it can be done and now its sister Wikidata joins to contribute a new set of strengths.

Growing Up

Wikidata always had bigger ambitions, but it started out by focusing on supporting Wikipedia. There were nearly 300 different language versions of Wikipedia, all covering overlapping (but not identical) topics without being able to share even basic data about these topics. Considering that most of these language versions had only a handful of editors, this was a problem. Small language versions were not able to keep up with the ever changing world and, depending on which language you could read, a vast amount of Wikipedia content was inaccessible to you. Perhaps someone famous had died? That information was usually available quickly on the largest Wikipedias but took a long time to be added to the smaller ones — if they even had an article about the person. Wikidata should help fix this problem by being a central place that stores general purpose data (like those found in those “infoboxes” on Wikipedia, such as the number of inhabitants of a city or the names of the actors in a movie.) related to the millions of concepts covered in Wikipedia articles.

To start this knowledge base, Wikidata began by solving a simple but long-standing problem for Wikipedians, the headache of links between different language versions of an article. Each article contained links to all other language versions covering the same topic but this was a lot of redundancy and caused synchronisation issues. Wikidata’s first contribution was to store these links centrally and thereby eliminate needless duplication. With this first simple step, Wikidata has helped eliminate over 240 million lines of unnecessary wikitext from Wikipedia and at the same time created pages for millions of concepts on Wikidata, providing the basis for the next stage. Once the initial set of concepts were created and connected to Wikipedia articles it was time for the actual data and the ability to make statements about the concepts (e.g. Berlin is the capital of Germany). And, last but not least, followed the capability to use this data in Wikipedia articles. Now Wikipedia editors could enrich their infoboxes automatically with data coming from Wikidata.

Along the way a fantastic community maintaining that data developed, much faster than the development team could have dreamed of. This new community included new people who had never contributed to a Wikimedia project before and were now becoming interested because Wikidata was a good fit for them. It also included contributors from adjacent Wikimedia projects who were more interested in structuring information than writing encyclopedic articles and found their calling in Wikidata.

The number of concepts represented in Wikidata Items.

The number of concepts represented in Wikidata Items.

The number of editors on Wikidata since its start. The circles indicate the beginning and end of the mass-import of interwiki links.

The number of editors on Wikidata since its start. The circles indicate the beginning and end of the mass-import of interwiki links.

Later Wikidata’s scope expanded to also support the other Wikimedia projects like Wikivoyage, Wikisource, Wikimedia Commons and so on since they can benefit from the same kind of centralized knowledge base as Wikipedia.

As it evolved, Wikidata became an attractive source for Wikimedia projects and those who used to data-scrape Wikipedia infoboxes. External websites, apps, and visualisations used this information as a basic ingredient: from a website for browsing artwork, to book inventory manager, to history teaching tools, to digital personal assistants. Now, Wikidata is used in countless places without most users even being aware of it.

And most recently it became clear that we need to think beyond Wikidata and think of a large network of knowledge bases running the same software (Wikibase) to publish data in an open and collaborative way, called the Wikibase ecosystem. In this ecosystem many different institutions, activists and companies are opening up their data and making it accessible to the world by connecting it with Wikidata and among each other. Wikidata doesn’t need to be and shouldn’t be the only place where collaborative open data happens.

At the time of writing of this chapter Wikidata provides data about more than fifty-five million concepts. It includes data about such things as movies, people, scientific papers and genes. Additionally it provides links to over 4,000 external databases, projects and catalogs, making even more data accessible. This data is added and maintained by more than 20,000 people every month and used in over half of all articles in Wikimedia projects.

Helping People (and Machines) Come Together

Just like Wikipedia is not like any other encyclopedia, Wikidata is not like any other knowledge base. There are a number of things that set Wikidata apart. They are a result of striving to be a global knowledge base and covering a multitude of topics in a machine-readable way.

The most important differentiator is probably the acknowledgement that the world is complex and can’t easily be pressed into simple data. Did you know that there is a woman who married the Eiffel Tower? That the Earth is not a perfect sphere? A lot of technology today is trying to simplify the world by hiding necessary complexity and nuance. Conflicting worldviews need to be surfaced. Otherwise we take away people’s ability to talk about, understand, and ultimately resolve their differences. Wikidata is striving to change that by not trying to force one truth but by collecting different points of view with their sources and context intact. This additional context can, for example, include which official body disputes or supports which view on a territorial dispute. Without this focus on verifiability instead of truth and not trying to force agreement it would be impossible to bring together a community from different languages and cultures. For the same reason, Wikidata doesn’t have an enforced schema that restricts the data, but, rather, has a system of editor-defined constraints that highlight potential problems.

Being able to cover different points of view and nuance is not enough however for a truly global project. The data also needs to be accessible to everyone in their language without privileging any particular language by design. Because of this, every concept in Wikidata is identified by a unique ID instead of an English name. Q5, for instance, is the identifier for the concept of a human. It is then given labels in the different languages: “human” in English, “người” in Vietnamese and “ihminen” in Finnish. This way the underlying data is language-independent and everyone can see the data in their language when viewing or editing it. This of course does not eliminate the language issue but it goes a long way towards more equity in contributing to Wikimedia’s content.

Besides fabulous people, Wikidata’s ultimate secret sauce are its connections. All concepts in Wikidata are connected to each other through statements. The statement “Iron Man -> member of -> Avengers” for example tells us that Iron Man is a member of the Avengers. That one connection alone does not tell us much yet. But if you take a number of other similar connections you can easily get a list of all Avengers. And then make a list of the movies they first appeared in and the actors they were portrayed by. A lot of simple individual connections taken together are powerful. If you add on top of that the wide range of topics Wikidata covers it becomes even more powerful because you can make connections that have not been made before. How about a list of species named after politicians? Now possible, thanks to these simple connections! And those are just the connections inside Wikidata itself; Wikidata also connects to a large amount of external databases, catalogs and projects that make even more data available. Since Wikidata has such a large number of links to external resources it can act as a hub so that way you, and even more importantly any machine, can find a vast amount of additional information based on a single piece of data. If the ISBN of a book is known, then knowing its entry in the relevant national library is just a hop away. There might not be a direct link from an artist’s entry in the Louvre’s catalog to their entry in the Rijksmuseum’s catalog but with Wikidata this connection is easily made, opening up yet more options for discovering knowledge.