Master’s theses and open scholarship: a case study

Purpose – This paper aims to show how Master’s theses can contribute to open scholarship and give reasons why this should be done. Design/methodology/approach – The paper provides an overview of published studies and, based on the experience at the University of Lille (France), describes some essential aspects for the processing and valorization of these documents in the academic cloud, as a contribution of open scholarship. Findings – Because of their number and diversity, collections of Masters’ theses in open repositories could be an excellent showcase for the universities’ Master programs and research. They could also offer interesting and large samples for content analysis, citation analysis and text and data mining (TDM). However, some issues need attention, above all intellectual property, quality and preservation. Quality is crucial, and the paper describes how the Lille project proceeds to assure sufﬁcient quality and right clearance, and why the project shifted from students’ self-archiving to a digital library collection in the academic cloud, run by faculty and information professionals. The paper presents also some usage statistics to illus-trate the potential, global impact of such a collection. Practical implications – The paper provides helpful and empirical evidence and insight for those who want to develop the dissemination of Master’s theses via open repositories. Originality/value – In the context of open scholarship, only few studies deal with Master’s theses, and this paper is the only recent reference that brings together a review of other papers and a case study with empirical evidence.


Introduction
Open scholarship has become the new paradigm of scientific communication. The term refers to "teaching and research practices that espouse openness (i.e.) a collection of emergent scholarly practices that espouse openness and sharing", including open access, open publishing and open education (Veletsianos and Kimmons, 2012, p. 167). An ever-increasing number of books, journals and repositories provide free and unrestricted access to scientific content from all disciplines. Two third of the open institutional repositories contain more or less important collections of electronic theses and dissertations (ETDs), mostly PhD theses, considered as the most useful kinds of invisible scholarship and the most invisible kinds of useful scholarship (Suber, 2012).
The situation is quite different for Master's theses. Even if their number is several times higher 1 , they are much less well represented in open repositories than PhD theses. Less than 1 per cent of all repositories clearly indicate that they contain Master's theses, and freely available Master's theses represent less than 0.5 per cent of the scientific items retrieved by the Bielefeld Academic Search Engine. However, Master's theses can be an excellent showcase for the academic Master programs and research. They are original work of the mind, and at least for the best of them, they contribute to scientific and professional knowledge. The following paper provides an overview on the nature, publishing an interest of Master's theses and, based on our project at the University of Lille (France), describes some essential aspects for the processing and valorization of these documents in the academic cloud, as a contribution to open scholarship.

Master's theses -an overview
The Master's degree dates back to the origin of European universities. Today, it is usually a second-cycle academic degree requiring previous study at the Bachelor's level (or equivalent). It is awarded by universities or colleges upon completion of a one or two years program demonstrating advanced disciplinary knowledge or professional skills. Master's degree names and the structure and duration of Master programs vary according to the country and even the university; diversity is the rule.
In general, the granting of a Master's degree requires some kind of thesis or dissertation on a research or a professional project (internship), where the students have to demonstrate their ability of independent and rigorous thinking and of complex problem solving, along with field-specific methodology and knowledge. A recent report issued by the US National Academies of Sciences, Engineering, and Medicine (2017) highlights the potential scientific quality of Master's theses based on substantial undergraduate research experiences defined as "an inquiry or investigation conducted by an undergraduate student that makes an original intellectual or creative contribution to the discipline" 2 .
Alongside research capacities and originality, quality is assessed in terms of innovation and discovery, refinement of experimental design, new research techniques, teamwork and writing skills. "As Master's theses aim to provide evidence of the skills 1 In France for instance, Master students represent 35% of all university students, compared to 4% PhD students, in other words, the relation is 10 to 1.  (Seliger, 2015, p. 131). Their quality is impacted by the Master program and in particular, by the importance of research. Master's theses are a "complex genre", considering both the set of problems related to its configuration (structure, language, norms of reference) and the factors that constrain its production (methodological procedures, student/supervisor relationship, time management etc.) (Carvalho et al., 2017 Institutional repositories: Some institutions disseminate their undergraduate students' theses and dissertations in their own institutional repository, as part of its intellectual output and along with articles, communications, working papers etc. For instance, in the University of Amsterdam, Scripties repository Master's theses represent 70 per cent of the nearly 50,000 undergraduate works 3 .
General aggregating repositories: These multi-institutional or consortial repositories aggregate files and metadata from different document types and institutions. The Latin America portal LA Referencia for instance provides access to nearly 1.5 million publications, one third being Master's theses, mostly from Brazil 4 .
ETD repositories: As a specific variant of the multi-institutional repositories mentioned above, they only contain ETDs from different institutions, like the OhioLINK Electronic Thesis and Dissertation Center with over 90,000 undergraduate, Master's and doctoral theses and dissertations from 32 universities and colleges 5 and the Scandinavian DiVA portal for academic publications and student papers produced at 44 universities, colleges of higher education, research institutes and museums, with more than 250,000 student papers from BA, Master and other diplomas of which 72 per cent are in open access 6 .
Because of their number, diversity and quality, Master's theses prove their usefulness for studies on the evolution and distribution of research subjects and methodologies. A synthesis of published research with samples of Master's theses reveals three different approaches (for more details and further reading, see annex).
Content analysis: Nearly 20 recent papers from Iran, Slovenia, Brazil, China, Zambia, Turkey and other countries share an exploratory-descriptive approach and aim at assessing research paradigms, the diversity and trends of research topics, writing skills and information literacy, the students' background and the choice of research methods like qualitative or action research. They also share the basic assumption that Master's theses generally display sufficient quality and relevance for content analysis and scientometrics.
TDM: Some studies, especially from China, apply TDM techniques to Master's theses, such as quantitative analysis of the frequency of key words, co-word matrixes, co-occurrence network analysis, factorial or cluster analysis, strategy coordinate analysis and multidimensional scaling analysis. Their objective is the assessment of research topics ("hot issues"), methodologies or funding trends through the automatic exploitation of Master's theses.
Citation analysis: Citation analysis constitutes a third approach to Master's theses. These studies are based on the references produced by the Master students and assess the use of scientific literature in terms of document types, preferred journals, geographic origin of cited sources, and so on.
Along with the potential for scientometrics and TDM, there are other reasons for open access to Master's theses, such as enhancing accessibility, increasing the impact in scientific journals and conferences (Gillet et al., 2013), preventing plagiarism, strengthening visibility for Master's institutions (Xia and Opperman, 2010), contributing to the curriculum quality and fostering the potential use of outcomes especially from less known programs and disciplines (Martens, 2011).
Quality is essential for the interest and potential of Master's theses. Not all Master's theses' reflect substantial undergraduate research experiences or professional projects. Poor quality of abstracts can be a limit to TDM. Evaluation can be helpful to control the quality and to select the best papers. Even if the criteria are neither homogeneous nor applied in the same way everywhere, with a large diversity of written guidelines and evaluation grids, of the supervisors' role and of oral exams to adjust the final mark (Bøhn and Hasselgreen, 2011), the evaluation generally includes some similar features, such as background knowledge, description of methods, data and findings, discussion of the findings and the use of references. Sometimes, the Master assessment includes research skills and the relation with the supervising research team. And even if their quality may vary and should be controlled, their diversity and richness make them interesting and useful resources for citation analyses, for content mining and scientometric assessment. But this potential requires open accessibility and reusability. The Lille project MemorySID provides some insight how to proceed.

The Lille project MemorySID
The following section describes the project MemorySID at the University of Lille (France), with nearly 70,000 students the largest French University and one the largest French-speaking universities worldwide. The project started in 2008, to contribute to open scholarship through open access publishing of Master's theses in library and information sciences, via a national infrastructure in the public academic cloud. With the years, the project rationale shifted from self-archiving to a two-layer approach with two digital libraries, one in intranet and the other in the cloud.

Phase one (2008-2015): self-archiving
Ten years ago, from 2008 on, the Department of Information and Documentation Sciences (SID) at the University of Lille (France) started encouraging the Master students to deposit their theses on a national repository for Master's theses library and information sciences called mémSIC 7 . The objective was to publish a selection of good and non-confidential works in open access and to guarantee long-term findability, accessibility and preservation through a national OA infrastructure, to valorize the students' work and to increase the Master programs' visibility (Chauvin et al., 2010;Mann, 2010). The three guiding principles for this first project phase were: • validation by the jury (faculty); • double authorization by the student and by the project tutor (intership); and • self-archiving by the students.
The academic cloud was preferred to a local, institutional solution to optimize visibility and usage. The success was mixed. Between 2008 and 2015, 70 theses were deposited on mémSIC, with scores equivalent to the US academic A and B grades (for French readers: 14/20 or more). The impact in terms of usage was more than satisfying, with statistics going up to several thousand downloads per paper (see below). However, the main problem was the lack of motivation for self-archiving and, in some cases, the lack of quality of metadata and files. Another problem was that mémSIC does not allow the creation of an institutional collection or portal, thus reducing the impact for the Master program.

Phase two (from 2016 on): digital library
To resolve these issues, a follow-up project was launched in 2016 called MemorySID (Vanacker, 2017). For this second phase, the priorities were different: • retrodigitization of all print Master's theses still in the academic library and department holdings; • integration of all digitized Master's theses into the campus-wide Nuxeo document management system (DMS), accessible for students and scholars via their intranet workspace; and • creation of a new collection on the national repository DUMAS, with a selection of the best theses ("showcase").
In other words, we privileged a two-layered approach with two digital libraries, one as a folder in the academic intranet and the other as a portal in the cloud, on the national ETD repository DUMAS (Figure 1). DUMAS is hosted by the French Center for Direct Scientific Communication 8 which operates also the French national open repository HAL, and it contains more than 19,000 Master's theses. Only library and faculty can deposit files and create metadata. There is no self-archiving. Each deposit needs validation by the repository's administrator. Insofar, DUMAS is a kind of crossover between an open repository and a digital library. After removing duplicates, 765 print theses were digitized, 583 from the library holdings (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004) and 227 from the SID department (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016). The text files (PDF) were uploaded in the DMS and indexed in a Dublin Core (DC) compliant metadata format. To the 765 print works from 1996-2016, 9 native digital Master's theses were added in 2017. From these 774 Master's theses, 715 are searchable and available for all students and scholars on the campus, via the academic virtual workspace, while the others remain in the "dark" part of the intranet because of their confidential content. The digital library (folder) is administered by the faculty (SID Department), while the content of the DMS is maintained and backed-up by the university IT Department.
From the 715 Master's theses in the public DMS, 87 per cent have a sufficient score (grade A or B) and are authorized for online dissemination, according to the administrative records (i.e. defense minutes). So far (July 2018), 523 theses from the DMS have been deposited on the DUMAS repository and are online. Additionally, the metadata from 64 self-archived theses on mémSIC (first project phase) were transferred to DUMAS. This means that at present, the new Lille LIS Master's theses collection on the DUMAS repository provides unrestricted access to 587 theses from 1996 to 2017.

Usage of the DUMAS collection
The following usage statistics are from July 19, 2018. They assess the download figures for all 587 Master's theses available on DUMAS, for the period from January 2015 to June 2018.
Cumulative figures: up to now, the 587 items have been downloaded 296,074 times ( Figure 2). The median download statistics for the last 12 months are 3,855 per month, and they are continuously increasing.
Usage statistics per item: the analysis of the usage statistics reveals that all deposited Master's theses, without exception, have been accessed and downloaded (Figure 3). Figure 3 shows that the differences of usage among the Master's theses are very important, ranging from 10 to more than 60,000 downloads, with a median of 43. At least two factors may explain these differences: 2. The topics: probably, theses with "hot issues" will be more downloaded. But the real impact of the content on usage statistics is difficult to assess and surely affected by other variables like the quality of indexing and referencing.
Information of interesting and recent deposits is published via social media, e.g. Twitter, Facebook and LinkedIn; metadata, links and the full text can be discovered on sites like Scoop.it, CiteULike or portals for students. However, without Digital Object Identifier minting and a more sophisticated altmetrics tool it is impossible to assess the real impact of the deposits inside and outside of the academic communities.
Where do the successful download requests come from? The DUMAS repository identifies the country of nearly 80 per cent downloads.
In total, 30 per cent successful download requests come from France, another 13 per cent are from the three French-speaking Maghreb countries Morocco, Algeria and Tunisia. The USA represents 3 per cent, Germany and Canada 2 per cent each, the UK, South Africa and China 1 per cent each. The member states of the so-called "Francophonia" represent 60 per cent of the download requests, i.e. 75 per cent of all geographically identified downloads. However, the collection's outreach is global and not limited to the French-speaking world: the complete download list contains 186

Discussion
MemorySID is a faculty-based project in social sciences and humanities, launched and undertaken by the SID department staff to increase the outreach and impact of the Lille LIS Master program. From the beginning, it raised questions and problems that had to be fixed by the faculty, in partnership with the academic library, the IT department and the host of the national open repository. Four issues appear crucial for the future development of open and unrestricted dissemination of Master's theses.

IP clearance
Intellectual property (IP) is an important issue. Open access to Master's theses via open repositories means uploading and publishing, and both need authorization from the rights holder and, at least in France where dissertations are considered as administrative proof for the diploma, also from the institution.
Once the student has left the Master program, this IP clearance is difficult to obtain and the risk is high that time produces an increasing number of "orphan theses". The best moment is at the defense (viva), when the student presents the results of her or his research work and internship. This is the opportunity for the student and the jury to sign the authorization for open access publishing, as part of the minutes of the thesis defense. The signed minutes are archived by the department and serve as proof for the dissemination via the DMS and the open repository.
But this is not enough. Some theses contain sensitive information or personal data while others include material protected by third party rights (photos, maps, etc.). Also it may be appropriate to request a third authorization from the professional tutor of the internship. If the student cannot guarantee full rights clearance for included material, it will be prudent either to drop this material or to exclude the thesis from open access.
At present, of all theses preserved in the DMS, four out of five are freely shared on the internet, while 12 per cent are restricted to on-campus access via the intranet and 8 per cent are in the DMS but not available at all (Figure 4).

Open access 80%
Confidential 8% Intranet 12% The key factor for the IP-related process is the faculty support, i.e. the personal and professional commitment to open scholarship which implies two "upstream actions":

A clear and explicit communication about open repositories, open scholarship
and sharing.
2. Guidelines about how to write a Master's thesis, how to use protected thirdparty material, how to handle confidential and sensitive information, personal data (privacy issues), etc.
Sometimes, fears are expressed that unrestricted dissemination of Master's theses may increase the risk of plagiarism. Indeed, this "collateral damage" cannot be excluded but on the other hand, open access to Master's theses will improve the performance of anti-plagiarism software.

Quality, metadata and identifiers
The second issue is quality. While all Master's theses should be preserved in the intranet DMS, only the best ones should be disseminated in open access, for two reasons: 1. Insofar, as the collection provides an excellent showcase and high visibility for the Master program, the dissemination of poor papers should be avoided.
2. To foster the potential for TDM, particular attention should be paid to the quality of metadata.
The quality issue has a paradoxical effect: if the deposit is limited to the best ("outstanding") Master's theses, only a small part will be available, reducing their diversity and representativeness. On the other hand, a selective "institutional quality label" will increase their value for the human reader and the machine.
At Lille, the threshold for online dissemination is set at 14 marks out of 20, which is similar to the US A and B grades. This score is the result of collegiate evaluation by at least two scholars, applying the usual criteria, e.g. quality of the analysis, independent thinking, presentation, structure, language and style and references. Nevertheless, the score is only one criteria and the final decision is taken at the defense (viva), by the jury.
The format and quality of metadata remains a challenge. Standard metadata improve findability and interoperability. Both systems, the DMS (Nuxeo) as well as DUMAS, make use of the Dublin Core (DC) elements. To limit inconsistent indexing, some elements were refined, such as rights (level of availability), source (Master program, level, date of the viva), description (author's abstract) and filename.
The DUMAS metadata schema follows the Text Encoding Initiative format of the French national HAL repository with some specific elements for Master's theses, e.g. the name of the Master program with its subdivisions, specialties and options. The problem is that the DMS and DUMAS follow different schemas, with different refinements and values, and that their metadata silos are not connected. This means the creation of two different metadata sets for the same document, one for the DMS and the other for the repository. The lack of interconnection and interoperability is a source of inconsistency and errors, causes double workload and requires more post-deposit curating than necessary. A three-level routine reduces the metadata heterogeneity and error level: 1. Inconsistencies and anomalies are rejected by the DUMAS administrator during the validation.
2. The index of authors and directors' names is periodically controlled by the local administrator.
3. The tag cloud is checked from time to time for synonymy, variants etc.
A last comment on unique identifiers: DUMAS systematically attributes a specific HAL identifier to each new deposit. To track impact on social media (altmetrics), it may be interesting to consider additionally digital object identifier minting.

Text and data mining
TDM tools and methods are already applied to Master's theses, especially to assess research topics, methodologies and funding trends. In Lille, a research team started to mine PhD and Master's theses in exploratory research projects, to identify specific topics in different disciplines like Agronomics, Cultural Studies or Information Sciences, and geographical information related to the concept of territory (Kergosien et al., 2018).
These studies confirm that the diversity and volume of Master's theses bear interesting potential for TDM and can produce relevant new knowledge about research topics, methods, people, etc. As the output of TDM is conditioned by the quality of the corpus, this means that special attention should be paid to the quality of titles, abstracts, key words and the structuration, annotation, indexing and open formats, whenever possible.
In an uncertain legal environment of TDM like in France, it would be appropriate and helpful to publish Master's theses under a liberal open license (CC-BY or CC-BY-SA) to foster their reuse and exploitation; however, this remains an exception up to now.

Preservation
Master's theses are part of what is called gray literature, not easy to identify, not always well controlled by catalogs and databases and often at risk of disappearance (Schöpfel and Farace, 2010). In France, Master's theses must be conserved for at least five years, as a proof of graduation. MemorySID adopts a three-level strategy to guarantee longterm preservation at least for the best Master's theses: • Remaining print copies are preserved by the faculty administration during the obligatory five-year period, in the facilities of the SID Department.
• All digital copies are preserved in the university DMS and maintained by the IT Department for at least 5-10 years, the usual lifespan of documents in intranet.
• As for the selection of best and non-confidential Master's theses, they are preserved in the DUMAS open repository. DUMAS is part of the French national open scholarship infrastructure HAL which is backed up by the French National Computing Center for Higher Education 9 specialized in long-term preservation of digital library resources, research data etc.
Thus, the DUMAS solution includes outsourcing of the long-term availability of Master's theses, in a dark archive in the public academic cloud which can be used as a failsafe during disaster recovery.

Conclusion
The environment of open scholarship provides excellent opportunities for a new approach to the management of Master's theses. Based on the Lille project and other 9 CINES https://www.cines.fr/en/ studies, the following matrix sums up some key factors for the contribution of Master's theses to open scholarship: • Strength: their diversity, richness and representativeness make Master's theses interesting for reading, content mining, citation analysis, etc. which is an important feature of the emerging landscape of open science. Also, they contribute to the visibility and promotion of the institution and the Master program.
• Weakness: the main problem is related to their unequal quality. Sensitive content (personal data, corporate information etc.) is another problem.
• Opportunity: the European and French open science policy creates a favorable environment for the dissemination of Master's theses on open repositories; this is emphasized by the institutions' interest to promote and valorize their Master programs.
• Threat: similar to PhD dissertations (Schöpfel and Prost, 2013), the major threat is access restriction because of intellectual property, i.e. no authorization by students. Restrictive policies of institutional repositories ("scientists only") and lack of interest for undergraduate work are other barriers.
A collection of Master's theses in the public academic cloud is a unique showcase for the academic excellence of Master programs, of teaching and research. The download statistics show that its impact and visibility are real and significant. It can also produce a rich and representative corpus for TDM and scientometrics. However, this potential is conditioned by a couple of key variables, such as a selective approach, long-term preservation and free dissemination on open repositories. It is only when these conditions are met, that Master's theses fully contribute to open scholarship.