|Google Scholar (GS) is a freely-accessible academic search engine that indexes academic literature from a wide range of disciplines, document types, and languages. Unlike Web of Science (WoS) and Scopus, which have a selective approach to document indexing (they only index documents published in certain venues), GS follows an inclusive approach. Apart from being the most frequently used tool by researchers to find scholarly information, what made GS stand out was that it builds its own citation graph by processing the references at the end of each document and matching them to documents already identified in their index. These citation counts are now widely consulted by researchers, because up to the point when GS was launched (2004), the main citation index (WoS) was only accessible via subscription. In subsequent years, GS launched several services based on data from its document base: Google Scholar Citations (an author profile service), Google Scholar Metrics (a journal ranking service), and GSCP (a short-lived service that listed highly-cited documents).
Despite its opacity (not much information on the coverage is available officially) and lack of native data exporting capabilities, many studies have tried to analyse the main characteristics of GS, and compare it to WoS and Scopus. These studies show that GS has a much more comprehensive coverage, especially in Arts, Humanities, and Social Sciences (AHSS), although GS also presents errors and limitations that other citation indexes do not have.
The general goal of this thesis was to explore whether it is feasible and sustainable to re-use data available in GS to generate data products or tools of a bibliometric nature that provide functionalities that GS does not provide. In order to do this, we followed two approaches that have ran side by side.
In our first approach, we endeavored to carry out studies that analysed the general characteristics of Google Scholar as a source of data: its strengths and weaknesses related to size, coverage, errors, bibliometric indicators. In order to do this, we analysed the characteristics of various of samples of GS data (the largest samples of GS data analysed to date), in some cases benchmarking it against the data available in WoS or Scopus.
The studies that resulted from this first approach show that GS has an extensive coverage of academic documents. Its coverage includes most of the documents covered in the multidisciplinary citation databases WoS and Scopus, as well as theses, dissertations, books, conference papers, and other unpublished materials (preprints, reports). Spearman correlation coefficients of citation counts between GS and WoS, or between GS and Scopus are generally very high. Thus, if GS is used for research evaluations then its data would be unlikely to produce large changes in the results, despite the additional citations found. It is also shown that, despite the limitations to control which documents are returned for a query, it is possible to identify in GS the most highly-cited documents in a given discipline, given how GS generally presents documents with high citation counts first. Even when considering only highly-cited documents, GS appears to have a more extensive coverage than GS or Scopus, especially in the areas of AHSS.
In our second approach, we tested the knowledge obtained in the previous studies in practical real-life situations. These projects took the form of tailored web applications built for a variety of purposes, and open to everyone. The applications display data extracted from Google Scholar (and sometimes also other services) in ways that the native GS, GSC and GSM interfaces do not, thus expanding the range of ways in which users can interact with this information. Three different types of prototype applications were developed and are presented here. The first application presents journal-level bibliometric indicators for a large collection of journals in the Arts, Humanities, and Social Sciences (AHSS): Journal Scholar Metrics (http://www.journal-scholar-metrics.infoec3.es). The second application presents data from a specific academic community at various levels of aggregation (author-, document-, journal-, and publisher-level), combining data not only from GS but from other sources: Scholar Mirrors (http://www.scholar-mirrors.infoec3.es). In the third application, a large sample of data from GS is used to analyse Open Access levels by country, subject category, journal, and publication year. Lastly, we describe the work carried out so far for a fourth, more ambitious application capable of displaying information about all researchers working in Spain with a public GS profile.
For the second approach, a new methodology was developed which allowed us to combine information from several scholarly sources: the MADAP method (Multifaceted Analysis of Disciplines through Academic Profiles). The data extracted using this method allowed us to compare a large number of author-level bibliometric indicators from various sources. Author-level indicators in GS (all based on citations) correlated well with other production and citation-based indicators from ResearchGate and ResearcherID, and also with Mendeley’s “Reader” indicator. On the other hand, GS indicators did not correlate well with conectivity-based metrics (followers).
The results of this thesis consistently find that GS data, and especially its citation data, can be useful for bibliometric analyses. Nevertheless, throughout all the analyses that have been performed, it has also become clear that there are important limitations that have to be considered when deciding whether to use data from GS for these purposes. Many of these limitations arise from the desire to use this tool for a purpose that falls outside the original scope intended by its creators, and the errors are derived from completely automated processing of documents from a great variety of sources and in a great variety of formats.