ChinaXiv.org 中国科学院科技论文预发布平台

按提交时间

2022
9

按主题分类

计算机科学的集成理论
9

按作者

按机构

当前资源共 9条

隐藏摘要

点击量

时间

下载量

您选择的条件: Max Planck Computing and Data Facility, Gießenbachstraße 2, 85748 Garching, Germany

1. ChinaXiv:202211.00433
下载全文

Evaluation of Application Possibilities for Packaging Technologies in Canonical Workflows

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-28 合作期刊: 《数据智能（英文）》

Thomas, Jejkal Sabrine, Chelbi Andreas, Pfeil Peter, Wittenburg

摘要： In Canonical Workflow Framework for Research (CWFR) packages are relevant in two different directions. In data science, workflows are in general being executed on a set of files which have been aggregated for specific purposes, such as for training a model in deep learning. We call this type of package a data collection and its aggregation and metadata description is motivated by research interests. The other type of packages relevant for CWFR are supposed to represent workflows in a self-describing and self-contained way for later execution. In this paper, we will review different packaging technologies and investigate their usability in the context of CWFR. For this purpose, we draw on an exemplary use case and show how packaging technologies can support its realization. We conclude that packaging technologies of different flavors help on providing inputs and outputs for workflow steps in a machine-readable way, as well as on representing a workflow and all its artifacts in a self-describing and self-contained way

点击量 1510 下载量 228 评论
2. ChinaXiv:202211.00403
下载全文

Not Ready for Convergence in Data Infrastructures

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-27 合作期刊: 《数据智能（英文）》

Keith, Jeffery Peter, Wittenburg Larry, Lannom George, Strawn Claudia, Biniossek Dirk, Betz Christophe, Blanchi

摘要： Much research is dependent on Information and Communication Technologies (ICT). Researchers in different research domains have set up their own ICT systems (data labs) to support their research, from data collection (observation, experiment, simulation) through analysis (analytics, visualisation) to publication. However, too frequently the Digital Objects (DOs) upon which the research results are based are not curated and thus neither available for reproduction of the research nor utilization for other (e.g., multidisciplinary) research purposes. The key to curation is rich metadata recording not only a description of the DO and the conditions of its use but also the provenance the trail of actions performed on the DO along the research workflow. There are increasing real-world requirements for multidisciplinary research. With DOs in domain#2;specific ICT systems (silos), commonly with inadequate metadata, such research is hindered. Despite wide agreement on principles for achieving FAIR (findable, accessible, interoperable, and reusable) utilization of research data, current practices fall short. FAIR DOs offer a way forward. The paradoxes, barriers and possible solutions are examined. The key is persuading the researcher to adopt best practices which implies decreasing the cost (easy to use autonomic tools) and increasing the benefit (incentives such as acknowledgement and citation) while maintaining researcher independence and flexibility.

点击量 399 下载量 114 评论
3. ChinaXiv:202211.00407
下载全文

Open Science and Data Science

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-27 合作期刊: 《数据智能（英文）》

Peter, Wittenburg

摘要： Data Science (DS) as defined by Jim Gray is an emerging paradigm in all research areas to help finding non-obvious patterns of relevance in large distributed data collections. Open Science by Design (OSD), i.e., making artefacts such as data, metadata, models, and algorithms available and re-usable to peers and beyond as early as possible, is a pre-requisite for a flourishing DS landscape. However, a few major aspects can be identified hampering a fast transition: (1) The classical Open Science by Publication (OSP) is not sufficient any longer since it serves different functions, leads to non-acceptable delays and is associated with high curation costs. Changing data lab practices towards OSD requires more fundamental changes than OSP. 2) The classical publication-oriented models for metrics, mainly informed by citations, will not work anymore since the roles of contributors are more difficult to assess and will often change, i.e., other ways for assigning incentives and recognition need to be found. (3) The huge investments in developing DS skills and capacities by some global companies and strong countries is leading to imbalances and fears by different stakeholders hampering the acceptance of Open Science (OS). (4) Finally, OSD will depend on the availability of a global infrastructure fostering an integrated and interoperable data domainone data-domain as George Strawn calls itwhich is still not visible due to differences about the technological key pillars. OS therefore is a need for DS, but it will take much more time to implement it than we may have expected.

点击量 429 下载量 133 评论
4. ChinaXiv:202211.00413
下载全文

Comments to Jean-Claude Burgelman’s article Politics and Open Science: How the European Open Science Cloud Became Reality (the Untold Story)

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-27 合作期刊: 《数据智能（英文）》

Peter Wittenburg

摘要： Coming from an institute that was devoted to analysing data streams of different sorts from its beginning to understand how the human brain is processing language and how language is supporting cognition, building efficient data infrastructures of different scope was a key to research excellence. While first local infrastructures were sufficient, it became apparent in the 90s that local data would not be sufficient anymore to satisfy all research needs. It was a logical step to first take responsibilities in setting up the specific DOBES (Dokumentation bedrohter Sprachen) infrastructure focussing on languages of the world, then the community#2;wide CLARIN RI (European Research Infrastructure for Language Resources and Technology) and later the cross-disciplinary EUDAT data infrastructure [1,2,3]. Realising the huge heterogeneity in data practices, it was also a logical step to start the Research Data Alliance (RDA) [4] as a truly bottom-up initiative to discuss harmonisation across disciplines and across borders. On this background, determined by always looking for concrete results, the European Open Science Cloud (EOSC) process had Kafka-esc characteristics to me, despite the many interactions I had with EOSC key persons and other colleagues involved. Talking at a level where the technological core remained widely absent was difficult to do for me. Due to Jean-Claude Burgelmans (JCB) excellent paper I finally understood that excluding the discussions about the core was the only chance to get EOSC accepted. Of course, the discussions about the EOSC core would have to happen at a certain moment and obviously eternal types of disputes would determine these discussions. Therefore, the fallback on the analogy with Greek tragedies was an excellent idea by JCB.

点击量 232 下载量 87 评论
5. ChinaXiv:202211.00339
下载全文

From Persistent Identifiers to Digital Objects to Make Data Science More Efficient

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-25 合作期刊: 《数据智能（英文）》

Peter， Wittenburg

摘要： Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of the time in data#2;intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of dataour methods do not scale. Therefore experts worldwide are looking for strategies and methods that have a potential for the future. The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers (PIDs) and metadata (MD). In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use. It is argued, however, that assigning PIDs is just the first step. If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DOs) which could indeed indicate a solution to solve some of the basic problems in data management and processing. In addition to standardizing the way we assign PIDs, metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations. We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry. A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.

点击量 548 下载量 142 评论
6. ChinaXiv:202211.00214
下载全文

FAIR Practices in Europe

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-18 合作期刊: 《数据智能（英文）》

Wittenburg, Peter Lautenschlager, Michael Thiemann, Hannes Baldauf, Carsten Trilsbeek, Paul

摘要： Institutions driving fundamental research at the cutting edge such as for example from the Max Planck Society (MPS) took steps to optimize data management and stewardship to be able to address new scientific questions. In this paper we selected three institutes from the MPS from the areas of humanities, environmental sciences and natural sciences as examples to indicate the efforts to integrate large amounts of data from collaborators worldwide to create a data space that is ready to be exploited to get new insights based on data intensive science methods. For this integration the typical challenges of fragmentation, bad quality and also social differences had to be overcome. In all three cases, well-managed repositories that are driven by the scientific needs and harmonization principles that have been agreed upon in the community were the core pillars. It is not surprising that these principles are very much aligned with what have now become the FAIR principles. The FAIR principles confirm the correctness of earlier decisions and their clear formulation identified the gaps which the projects need to address.

点击量 380 下载量 124 评论
7. ChinaXiv:202211.00218
下载全文

State of FAIRness in ESFRI Projects

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-18 合作期刊: 《数据智能（英文）》

Wittenburg, Peter de Jong, Franciska van Uytvanck, Dieter Cocco, Massimo Jeffery, Keith Lautenschlager, Michael Thiemann, Hannes Hellstroem, Margareta Asmi, Ari Holub, Petr

摘要： Since 2009 initiatives that were selected for the roadmap of the European Strategy Forum on Research Infrastructures started working to build research infrastructures for a wide range of research disciplines. An important result of the strategic discussions was that distributed infrastructure scenarios were now seen as complex research facilities in addition to, for example traditional centralised infrastructures such as CERN. In this paper we look at five typical examples of such distributed infrastructures where many researchers working in different centres are contributing data, tools/services and knowledge and where the major task of the research infrastructure initiative is to create a virtually integrated suite of resources allowing researchers to carry out state-of-the-art research. Careful analysis shows that most of these research infrastructures worked on the Findability, Accessibility, Interoperability and Reusability dimensions before the term FAIR was actually coined. The definition of the FAIR principles and their wide acceptance can be seen as a confirmation of what these initiatives were doing and it gives new impulse to close still existing gaps. These initiatives also seem to be ready to take up the next steps which will emerge from the definition of FAIR maturity indicators. Experts from these infrastructures should bring in their 10-years experience in this definition process.

点击量 531 下载量 133 评论
8. ChinaXiv:202211.00165
下载全文

FAIR Principles: Interpretations and Implementation Considerations

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-16 合作期刊: 《数据智能（英文）》

Jacobsen, Annika Azevedo, Ricardo De Miranda Juty, Nick Batista, Dominique Coles, Simon Cornet, Ronald Courtot, Melanie Crosas, Merce Dumontier, Michel Evelo, Chris T. Goble, Carole Guizzardi, Giancarlo Hansen, Karsten Kryger Hasnain, Ali Hettne, Kristina Heringa, Jaap Hooft, Rob W. W. Imming, Melanie Jeffery, Keith G. Kaliyaperumal, Rajaram

摘要： The FAIR principles have been widely cited, endorsed and adopted by a broad range of stakeholders since their publication in 2016. By intention, the 15 FAIR guiding principles do not dictate specific technological implementations, but provide guidance for improving Findability, Accessibility, Interoperability and Reusability of digital resources. This has likely contributed to the broad adoption of the FAIR principles, because individual stakeholder communities can implement their own FAIR solutions. However, it has also resulted in inconsistent interpretations that carry the risk of leading to incompatible implementations. Thus, while the FAIR principles are formulated on a high level and may be interpreted and implemented in different ways, for true interoperability we need to support convergence in implementation choices that are widely accessible and (re)-usable. We introduce the concept of FAIR implementation considerations to assist accelerated global participation and convergence towards accessible, robust, widespread and consistent FAIR implementations. Any self-identified stakeholder community may either choose to reuse solutions from existing implementations, or when they spot a gap, accept the challenge to create the needed solution, which, ideally, can be used again by other communities in the future. Here, we provide interpretations and implementation considerations (choices and challenges) for each FAIR principle.

点击量 720 下载量 191 评论
9. ChinaXiv:202211.00178
下载全文

FAIR Convergence Matrix: Optimizing the Reuse of Existing FAIR-Related Resources

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-16 合作期刊: 《数据智能（英文）》

Sustkova, Hana Pergl Hettne, Kristina Maria Wittenburg, Peter Jacobsen, Annika Kuhn, Tobias Pergl, Robert Slifka, Jan McQuilton, Peter Magagna, Barbara Sansone, Susanna-Assunta Stocker, Markus Imming, Melanie Lannom, Larry Musen, Mark Schultes, Erik

摘要： The FAIR principles articulate the behaviors expected from digital artifacts that are Findable, Accessible, Interoperable and Reusable by machines and by people. Although by now widely accepted, the FAIR Principles by design do not explicitly consider actual implementation choices enabling FAIR behaviors. As different communities have their own, often well-established implementation preferences and priorities for data reuse, coordinating a broadly accepted, widely used FAIR implementation approach remains a global challenge. In an effort to accelerate broad community convergence on FAIR implementation options, the GO FAIR community has launched the development of the FAIR Convergence Matrix. The Matrix is a platform that compiles for any community of practice, an inventory of their self-declared FAIR implementation choices and challenges. The Convergence Matrix is itself a FAIR resource, openly available, and encourages voluntary participation by any self-identified community of practice (not only the GO FAIR Implementation Networks). Based on patterns of use and reuse of existing resources, the Convergence Matrix supports the transparent derivation of strategies that optimally coordinate convergence on standards and technologies in the emerging Internet of FAIR Data and Services.

点击量 548 下载量 152 评论