Open access scientific resources: digital databases

B A S I C   L E V E L

An Open access or OA is a set of principles and a range of practices through which research outputs are distributed online, free of cost or other access barriers.

Contents

 

Open access scientific resources

An introduction to « Open Access » resources

An Open access or OA is a set of principles and a range of practices through which research outputs are distributed online, free of cost or other access barriers. With open access strictly defined (according to the 2001 definition), or libre open access, barriers to copying or reuse are also reduced or removed by applying an open license for copyright.

The main focus of the open access movement is “peer reviewed research literature”. Historically, this has centered mainly on print-based academic journals. Whereas conventional (non-open access) journals cover publishing costs through access tolls, such as subscriptions, site licenses or pay-per-view charges, open-access journals are characterized by funding models which do not require the reader to pay to read the journal’s contents. Open access can be applied to all forms of published research output, including peer-reviewed and non-peer-reviewed academic journal articles, conference papers, theses, book chapters, monographs and images.

However, when it comes to define “free” access, one has to distinguish “gratis” from “libre”.

In order to reflect real-world differences in the degree of open access, the distinction between gratis open access and libre open access was added in 2006 by Peter Suber and Stevan Harnad, two of the co-drafters of the original Budapest Open Access Initiative (BOAI) definition of open access publishing. Gratis open access refers to online access free of charge and libre open access refers to online access free of charge plus some additional re-use rights. Libre open access is equivalent to the definition of open access in the BOAI, the Bethesda Statement on Open Access Publishing and the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities. The re-use rights of libre OA are often specified by various specific Creative Commons licenses; these almost all require attribution of authorship to the original authors.

The document released in February 2002 by the BOAI contains the following very widely used definition:

  • By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

In light of the information above, the use of open source scientific resources must follow the rules commonly adopted. Publishing open source scientific resources must also clearly mention if they are libre or gratis and must be attributed to the original author.

An introduction to data (basic level)

What are “data”

According to the Merriam-Webster dictionary, there are three different definitions of data:

  1. Factual information, such as measurements or statistics, used as a basis for reasoning, discussion, or calculation
  2. Information in digital form that can be transmitted or processed
  3. Information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

In this document we will cover most of the three definitions.

A brief history of data

Since humans began to communicate, they have experienced the need to retain information for the long term. Keeping information was necessary for our ancestors to ensure their survival. Transmitting information across generations allowed them to keep track of potential dangers, but also to have an inventory of the best places to collect food, the best spots for fishing, the most interesting animals to hunt and where to find the best shelters. All of this information was transmitted orally. With the evolution of knowledge and the invention of writing, they began to store information on indelible media.

Without going into detail about the evolution of the representation of information, some significant examples will be provided that have helped in the structuring of thought, which lead to the discovery of the computer tools we use daily.

Data prior to the invention of computers

As human societies emerged, collective motivations for the development of writing were driven by pragmatic exigencies. These include organizing and governing societies through the formation of legal systems, contracts, deeds of ownership, taxation, trade agreements, treaties, census records, keeping history, maintaining culture, keeping track of scientific discoveries, codifying knowledge through curricula and lists of texts that are artistically exceptional or deemed to contain

Figure 1: Cuneiform writing

foundational knowledge, and many other needs.

For example, around the 4th millennium BC, the complexity of trade and administration in Mesopotamia outgrew human memory, and writing became a more dependable method of recording and presenting transactions in a permanent form.

Cuneiform was one of the earliest systems of writing, invented by Sumerians in ancient Mesopotamia. It is distinguished by its wedge-shaped marks on clay tablets, made by means of a blunt reed for a stylus, as demonstrated in Fig. 1.

Over time, the development of knowledge, the multiplication of information, the limitation of human memory, the necessity of writing and keeping record of huge quantities of information has become essential. However, despite keeping record of almost every kind of information or data on various media, it became more and more complex to retrieve it in a simple manner. One had to read tens of reports and books to be able to synthesize on a subject

Data in modern age

Today, the quantity of data produced every year and kept digitally, e.g., to-do lists, recipes, reminders, logbooks, maps, photos, e-mails, scientific data, political reports, videos, etc. is so exponential that it creates the need to structure the way we can retrieve these phenomenal quantities.

Computers gained popularity and became cost effective to use by individuals and private companies in the early 80’s. However, the 60’s can be considered as the new era in the field of databases. The introduction of the term “database” coincided with the availability of direct-access storage or DAS, from the mid-60s onward. This new technology represented a contrast with the past punch cards and the tape-based systems, allowing shared interactive use rather than daily batch processing. Two main data models were developed – network model “CODASYL” (Conference on Data System Language) and hierarchical model “IMS” (Information Management System).

The first generation of database systems was “navigational, in opposition to the sequential access due to the previous technologies used to store data, i.e. tapes and punch cards. Applications typically accessed data by following pointers from one record to another. Storage details depended on the type of data to be stored.

Adding an extra field to a database required rewriting the underlying access/modification scheme. Emphasis was on records to be processed, not the overall structure of the system. A user would need to know the physical structure of the database in order to query for information. One database that proved to be a commercial success was the “SABRE” system that was used by IBM to help American Airlines manage its reservations data. This system is still utilized by the major travel services for their reservation systems.

In modern Information technology, confusion has always existed among users between databases and internet web search engines accessed by browsers. A database usually contains structured data, in contrast to the World Wide Web (www), which usually contains unstructured data. Even if retrieving information from both databases and “www” are seamless and look similar, the content and the way queries are addressed are completely different. Structured and unstructured data will be explained later in this document.

Understanding the basic vocabulary

Terminology

Like any other science, computer science has its own language. In order to fully comprehend the information that will be provided in this document, it is essential to become familiar with the vocabulary related to this topic.

Moreover, the communication with a DBA (Database Administrator) will be eased. When a Biochemist will have to express his needs in terms of structuring or managing data in a Database, he will be tempted to use his own technical language. Then the DBA will have to understand the request and transform it into a computer language, which will be understandable by biochemists.

What are “Data” in the computer age

Figure 2:A bit can be 0 or 1

As mentioned in section 2.1, according to the domain that is being referred to, data might have different meanings. In the case of computing and databases, data is defined as any sequence of one or more symbols. Data requires interpretation to become information. In information technology the “bit” is the smallest quantity of data. A bit is binary. Binary numbers are a representation of numbers using only two digits, 0 and 1 (Fig. 2). It is a base-2 numeral system, i.e.:

  • 0 0 0 1 = numerical value 20
  • 0 0 1 0 = numerical value 21
  • 0 1 0 0 = numerical value 22
  • 1 0 0 0 = numerical value 23

Table 1

A sequence of “bits” constitutes a “Byte”. Bytes are made of a multiple of 4 bits (a byte of 4 bits is called a Nibble) as in the example above. Today, the byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer. With a byte of eight bits the maximum decimal number is 256. Historically, the byte is also the unit of computer information or data-storage capacity used to measure the quantity of data (Table 1).

Table 2: ASCII Table

An example of usage is the ASCII (American Standard Code for Information Interchange) table of characters commonly used for alphabetical characters (Table 2). The first 32 characters are called control characters. Initially, they were not designed to represent printable information, but to control devices that use ASCII code, such as printers, or to provide meta-information about data streams, e.g., those stored on magnetic tape.

What is “Metadata”

Figure 3: Picture taken in Greece

Metadata, or, put simply, meta-information, is used to reference the data about the data. Having data is not enough to simply put them online. Data are not usable until they can be explained in a manner that both humans and computers can process.

Metadata may be implied, specified or given. It includes data relating to physical events, or processes, and will also have a temporal component. In almost all cases this temporal component is implied. It may be slightly tricky to understand, however, the following example will provide a clearer explanation of this term.

Figure 4:
Metadata of the picture

Imagine that you are traveling with your favorite smartphone in some paradisiac island. You start taking pictures (Fig. 3) to keep nice records of your trip. A week later, your trip reaches its end and you have to go back home.

Back home, you invite your best friends for a party and want to share with them the beauties you have seen during your trip. You start showing the pictures, but you cannot recall which day, at what time and where some of them were taken. This is where the metadata of the pictures can help. In a few words, it is the description of the data. In this example, the picture is the data and the description of the picture is the metadata (Fig. 4).

In Biotechnology, one must understand that metadata are by far more important than data. It is very simple to understand the reason why metadata are a crucial component directly related to data. Imagine an experiment that will lead to a specific result. This experiment, to be valid, must be documented. This documentation should include all the conditions, under which the experiment was conducted. This might include the description of the kind of raw material used, its source, in which conditions it was collected, the types of machines to process the experiment, temperature, date, time, etc. For the result of this experiment to be comparable to other results of similar experiments, all the conditions must be similar. Raw data without metadata are useless.

The biggest challenge in Biotech, and any other science, is to standardize metadata. In most of the Biotech databases, this is not respected. One must absolutely be conscious of this phenomenon and thoroughly respect the standards.

What is a “Database”

Figure 5: Database partial structure

In general, a database is defined as a collection of data items, such as phone books, price lists, inventory lists, customer’s addresses, etc. Nonetheless, in technical terms, a database is referred to as “a self-describing collection of integrated records”. It implies computer technology, completed with a specific computer language, such as SQL (Structured Query Language).

A database consists of multiple tables (Fig. 5) and of both data and metadata. Metadata is the data that describes the structure of the data within a database. If you know how your data is arranged, then you can retrieve it. Since the database contains a description of its own structure, it is referred to as self-describing. The database is integrated because it includes not only data items but also the relationships among them.

The database stores metadata in an area called the data dictionary, which describes the tables, columns, indexes, constraints and other items that make up the database.

Because a flat file system i.e. “Spreadsheet” has no metadata, applications written to work with flat files must contain the equivalent of the metadata as part of the application program.

What are “Tables” in a database

A table is a collection of related data held in a table format composed of columns and rows within a database. It resembles a spreadsheet (Fig. 6).

What are “Columns” in a database

Figure 6: Table with rows and columns

A column is a set of data values, all of a single type, in a table. Columns define the data in a table. Most databases allow columns to contain complex data like images, whole documents or even video clips. Therefore, a column allowing data values of a single type does not necessarily mean it only has simple text values. Some databases go even further and allow the data to be stored as a file on the operating system, while the column data only contains a pointer or link to the actual file. This is done for the purpose of keeping the overall database size manageable – a smaller database size means less time taken for backups and less time required to search for data within the database.

In a table, each column is typically assigned a data type and other constraints, which determine the type of value that can be stored in that column. For example, one column might accept email addresses, and another might accept phone numbers with a constraint of 10 digits.

What is a “Record”

A record is a representation of a physical or conceptual object. Say, for example, that you want to keep track of the customers of a business. You assign a record for each customer. Each record has multiple attributes, such as name, address, and telephone number. Individual names, addresses and so on are the data.

What are “Indexes”

Figure 7: Example of index

Structured data are stored in the form of records in a database. Every record has a key field, which helps it to be recognized uniquely, i.e. the ID of a patient. No other patient can have the same ID number, but another patient may have the same first name and last name.

Indexing a database is a technique to efficiently retrieve records from the database files, based on some attributes on which the indexing has been performed. To make it simple, indexing in database systems is similar to what we usually see in books. At the beginning or the end of a book, an index may be found (which is different from a table of contents), which provides all the page numbers for a specific topic. For example, an Atlas may be divided into chapters containing maps, chapters containing data on population and chapters dedicated to countries production or agricultural data. If you are looking for a specific country and you would like to have an overview of all the data regarding this specific country, the index might be very helpful as it will show you the page related to that country in each chapter (Fig. 7).

What is an “Object”

In computer science, an object can be a variable, a data structure, a function, or a method and, as such, is a value in memory referenced by an identifier. In the relational model of database management, an object can be a table or column, or an association between data and a database entity, such as relating a person’s age to a specific person.

Structured data

Figure 8: Structured data

According to SNIA (Storage Networking Industry association), structured data is defined as:

“Data that is organized and formatted in a known and fixed way.

The format and organization are customarily defined in a schema. The term structured data is usually taken to mean data generated and maintained by databases and business applications.”

Three conditions are needed to describe data as structured:

  • It must conform to a data model,
  • It must have a well define structure,
  • It must follow a consistent order and can be easily accessed and used by a person or a computer program.

Structured data is usually stored in well-defined schemas such as Databases. It is generally tabular with columns and rows that clearly define its attributes (Fig. 8).

SQL (Structured Query language) is often used to manage structured data stored in databases.

Unstructured data

Figure 9: Extract of a PDF file

Information that is not organized in a predefined model is called unstructured data or unstructured information. In computer science, files like text files, photos, video files, audio files and presentations are considered unstructured files. Typically, a PDF file contains unstructured data (Fig. 9).

It is estimated that 80 to 90% of the worldwide total dematerialized data is unstructured. Usual query algorithms are unable to simply and efficiently extract the required information from an unstructured file, such as in the example of Fig. 9. The same information contained in Fig. 9 can easily be retrieved with a query. However, today, unstructured data analytics tools powered by artificial intelligence (AI) are available, which were specifically created to access the insights available from unstructured data (see 3.1.12 Analytics).

Big data

According to SNIA (Storage Networking Industry association), big data is defined as:

“A characterization of datasets that are too large to be efficiently processed in their entirety by the most powerful standard computational platforms available.”

In other words, Big Data refers to huge quantities of structured or unstructured data that cannot be processed by usual software as traditional database query language or any other kind of fetching engine.

Confusion exists concerning the current usage of the terms Big Data and Analytics. Big Data is the information, while Analytics is the way to extract the desired information from huge quantities of available information.

Analytics

In computer technology, Analytics is a method to extract value from big data.

In the field of healthcare, Big Data Analytics has led to many improvements by providing personalized medicine and predictive analytics. As the volume of data is dramatically increasing, traditional databases and search engines will not be able to handle and retrieve specific information. Patient data is generated by MRI’s, X-rays, blood tests machines, monitoring sensors and many more sources of data complex to process. Extensive information in healthcare is now in electronic form; it fits under the big data umbrella as most of it is unstructured and difficult to use.

Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research. Subsequently, trends seen in data analysis can be tested in traditional, hypothesis-driven follow-up biological research and eventually clinical research.

Repository

A data repository or data warehouse is a centralized place to store and maintain data. A data repository can consist of one or more structured data files, such as databases or unstructured data files, which can be distributed over a network and preserved over the long-term.

Basic structure of a database

This section is dedicated to the overview of the main building blocks constitutive of a database.

Introduction

Since the invention of computers, the amount of data stored and managed electronically has increased drastically. It is estimated that the quantity of data will reach 175 zettabytes (1021 Bytes) by 2025 growing from a few petabytes (1015 Bytes) in the year 2000. One common way of simplifying the lives of users and making the most of their resources is by storing and retrieving it more efficiently. For example, while a flat file works just fine for storing your personal data, such as an address book or some recipes, it is not as suitable for storing a city phone directory or, more precisely, the genomic data in the Biotech field. In addition, if you want to store several genomic species worth of data, it is very difficult to search and retrieve data from a flat file. Databases offer a solution to this problem by making the storage, handling and retrieval of data much easier.

The software used to manage a database is called a database management system (DBMS). This specialized software acts as a go between to help end users access the database. Usually, users do not interact directly with a database because this may result in its disorganization. Instead, they use a DBMS that reads data from or writes data to the database.

The growing complexity of big quantities of data required some companies to use data management tools based on the relational model, such as the classic RDMBS. RDBMS stands for Relational Database Management System. Nevertheless, major Internet companies, such as Google, Yahoo and Amazon, or all the popular Social Media, each faced challenge in dealing with huge quantities of data in real-time, something that conventional RDBMS solutions could not cope with. That explains the soaring popularity of NoSQL database systems that sprang up alongside.

NoSQL systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel, high-performance data processing across a large number of commodity servers. They arose out of a need for agility, performance and scale, and can support a wide set of use cases, including exploratory and predictive analytics in real-time. Built by top internet companies to keep pace with the data deluge, NoSQL databases scale horizontally and are designed to scale to hundreds of millions and even billions of users performing updates as well as reads.

Some of the common applications of NoSQL databases are social media, large scale e-mail providers and governmental healthcare systems.

Usually, a social application can scale from zero to millions of users in a few weeks and to better manage this growth, one needs a DB that can manage a massive number of users and data, but can also easily scale horizontally.

In this course, we will focus on DBMS and RDBMS only. These are the two kinds of Databases commonly used in the Biotech world up to date.

Overview of a database architecture

Figure 10: Hardware architecture for a database

Databases can store all kinds of information, from numbers and text, to email, web content, phone records, biological, geographical data, etc. Databases are officially classified according to how they store this data. Relational databases store data in tables. Object oriented databases store data in object classes and subclasses. We are going to focus on relational databases, as they are most commonly used. However, most of the basic topologies of databases need to have backend servers in order to host the database management system, a storage system attached to the servers to store the structure and the data of the database and, of course, computers, laptops, desktops or terminals as an interface to allow users to access the database, its management system and its content. Also required is a network to exchange between all the hardware components and a Cloud attachment to allow remote users to access the database. Fig. 10 summarizes in a simple way the minimum required to run a database.

Another basic way to describe it, is to show the three level architecture of a database. It is a virtual view of the necessary layers to make a database function properly. Fig. 11 demonstrates the three-level view architecture. It is called the ANSI-SPARC model. Nonetheless, despite the fact that this model never became a formal standard, it presents the idea of logical data independence that has been widely adopted.

Information stored inside a relational database is contained within tables. These tables are composed of rows of data and each row contains fields or columns. In a well-designed database definition, called a schema, only similar data is stored within each table and duplication of columns is kept to a minimum. Developers can connect, or join, data from two tables to link different types of information to each other.

Figure 11: Three-level view architecture

Indexes can be created on fields in the database table to make it easier for the DBMS to retrieve data. Indexes are usually configured for frequently searched columns, like a person’s name or a date value. The drawback to using indexes is that they take up storage disk space and can slow things down, if too many of them are maintained, because every time a row in the database is updated, the index also has to be updated.

Most databases support Structured Query Language (SQL), a standard language for interacting with information contained in a database. SQL allows users and applications to interact with specific subsets of data from one or more tables using several statements as SELECT, INSERT, UPDATE and DELETE.

Relational databases also provide a layered approach to storage, allowing the definition of what database objects reside in specific data files and where those data files are placed within the operating system’s file structure. On top of managing the physical storage location of database objects, many database systems give some control over how the data is stored within the data files.

Common database terms

Certain database terms derive from ways that databases automate write actions. Database developers often automate writing to certain fields or other tables, such as writing a copy of the row being inserted – along with a timestamp or username – to a history or audit table. Most DBMS systems provide several ways to automatically manage database write actions.

Database triggers are the most common method of taking action on data as it is being written to the database. Triggers are usually associated with a particular table and configured to execute at a certain point during a specific write action, such as before or after an update, or after a row is inserted. Triggers can be used to format data, populate a column with data derived from existing information, or even write to another table based on the row being inserted or updated.

A stored procedure is another way of interacting with a relational database. Stored procedures are more complex than triggers and are not tied to a single specific table. Typically created by a developer, they use a combination of SQL and a programming language, such as Java or SQL (depending on the database platform). Stored procedures provide developers a lot of control over how data is validated or massaged by an application. A stored procedure could be used to manage how a user logs in to an application. The procedure might first validate the username and password, then log the success or failure of the attempt to another table, along with other information, including the computer name and a timestamp. An alert could even be sent to the user informing them that their password has expired and must be changed.

Functions are simpler than stored procedure, and can sometimes even be used from within SQL queries. Functions are usually used in a database to perform a set of actions that return one or more values, such as calculating the sum of a column for rows that match a certain condition. While these actions can be performed using SQL, building them into a function can make them easier to use in other code. Both functions and stored procedures can perform common actions in a streamlined and consistent manner, easing the workload for database administrators and developers.

What is the difference between major DBMS systems?

The DBMS is generally driven by what the user applications need to support. That said, here is a brief comparison of the three most widely used platforms.

Microsoft SQL Server is widely used in enterprise applications and integrates easily with other Microsoft tools. Microsoft SQL Server 2019 Express is the latest version of Microsoft’s free offering and is often bundled with applications that use SQL Server.

MySQL has been a favorite for open source developers for the better part of two decades. Often used as a back end for open-source blog or content management systems, MySQL has a massive installed base across the globe. In 2008, MySQL AB was acquired by Sun Microsystems, which was itself acquired by Oracle Corp. in 2009, bringing MySQL under the umbrella of one of its largest competitors. However, the MySQL Community Edition remains free and is well supported by the community. MySQL is available for numerous operating systems, including Linux, UNIX, Mac OS X and Windows.

Oracle Database is considered by many to be the standard in enterprise-level database platforms and supports numerous enterprise applications. Oracle Database Express Edition is available free of charge and is also free to distribute (though it is not technically free software), making it another popular option for developers or hobbyists on Windows or Linux.

Now that you have learned the fundamental database terms and concepts, you are that much closer to speaking the same language as your organization’s database developers.

Databases in the scientific word

This part deals with the basics of databases used in the scientific world

Introduction to existing databases dedicated to science

This section is dedicated to the overview of the most common open access databases used in science.

Figure 12: Example of names of Biotech databases

Continuous developments in the fields of biotechnology and information technology have led to the exponential growth of data. Studies conducted by researchers at the European Bioinformatics Institute (EMBL-EBI) have demonstrated that this growth of information is doubling approximately every year. These extensive amounts of data are stored, organized and constantly updated in scientific databases, where they are readily available for scientists, including biologists and bio-informaticians, to use for research purposes. The information available in biological databases is obtained from a range of scientific fields, including metabolomics, microarray gene expression and proteomics. Apart from storing, organizing and sharing huge volumes of data, the main goal of biological databases is to offer web application programming interfaces (APIs) for computers to exchange and integrate data from many different database resources via an automated method.

Biological databases can be defined as data collections, which are structured in such a way making their contents easy to explore, handle and update. Examples of such databases are presented in Fig. 12. In 1972, the first protein structure database, known as the Protein Data Bank (PDB), was created. This database originally contained only 10 entries, which has now expanded to contain more than 10,000 entries, signifying the rapid growth of biological data. A biological database may contain several types of data, including protein sequences, textual descriptions, attributes, and tabular data. Generally, they can be divided into primary, secondary, and composite databases. Primary databases include data about the sequence or structure alone, whereas secondary databases include data originating from the primary database. Data, such as the conserved sequence and active site residues of protein families, can be found in secondary structure databases. Furthermore, entries of the PDB, which is a primary database, can be found in secondary structure databases, stored in an organized way.

Broadly speaking, biological databases can be categorized into sequence, structure, and pathway databases:

  • Sequence databases: The most commonly used biological databases. These include protein and nucleotide sequence databases, which contain wet lab results and are the main source for experimental results. GenBank and EMBL are examples of sequence databases.
  • Structure databases: These databases contain information regarding protein structure and molecular interactions. PDB is an example of a structure database.
  • Pathway databases: These databases are based on data derived from the comparative study of metabolic pathways. The Kyoto Encyclopedia of Genes and Genomes (KEGG) and Biocyc are two indicative pathway databases.

A typical search in a nucleotide sequence database may, for example, generate data concerning the scientific name of the source organism from which it was isolated, contact name, the input sequence with details of the molecule type and, frequently, literature citations related to the sequence.

Certain tools have been developed to facilitate scientists in data processing and retrieval from biological databases. These tools, which are termed bioinformatics tools, are software programs created for the extraction of meaningful data from the vast number of biological databases and for conducting sequence or structural analysis. Bioinformatics tools are used to obtain data from genomic sequence databases and for the visualization, analysis and retrieval of date from proteomic databases. These tools are largely divided into:

  • Homology and similarity tools: These tools are used for the detection of similarities between the sequences of unknown structural and functional sequences, whose function and structure are already known.
  • Protein function analysis tools: Programs applied for the comparison of one protein sequence to a secondary (or derived) protein, which permit the estimation of the biochemical function of a query protein.
  • Structural analysis tools: These tools allow the comparison of structures with the known structure databases and the establishment of the 2D/3D structure of a protein.
  • Sequence analysis tools: Programs used for the additional, more comprehensive assessment of a query sequence, involving evolutionary analysis and identification of mutations.

Biological databases may also be categorized, based on the scope of data coverage, into:

  • Comprehensive databases: These databases comprise various types of data from a number of species. Examples of comprehensive databases are GenBank and EMBL.
  • Specialized databases: These databases include particular types of data or data from particular organisms. An example of specialized databases is WormBase, which contains information on nematode biology and genomics.

In relation to the level of biocuration, which is defined as the activity of organizing, demonstrating and making biological information readily available to both humans and computers, biological databases are classified as primary and secondary or derivative databases. Primary databases consist of raw data as archival repository, while secondary or derivative databases consist of curated information as added value. In regard to the method employed for curating the data, biological databases may be further classified as expert-curated databases or community-curated databases, which are curated in a co-operative way by numerous researchers.

Additional categorization of biological databases can also be made based on data type. The data types that accordingly classify databases include DNA, RNA, protein, expression, pathway, disease, nomenclature, literature and standard and ontology. Some of the most important and widely used biological databases are the following: GenBank, the UCSC Genome Browser and Ensembl, which are sequence databases/portals; WormBase and The Arabidopsis Information Resource (TAIR), which are model organism databases; and the PDB, Online Mendelian Inheritance in Man (OMIM), MetaCyc and KEGG, which are characterized as non-sequence-centric databases.

Data manipulation is an essential part of the experimental process of all studies, regardless of their scale. The online availability of biological data combined with the decreasing costs of automated genome sequencers have made it possible for small biology laboratories to become big-data generators. Even if a laboratory is not equipped with such instruments, it can still become a big-data user by gaining access to public repositories containing biological data, such as the US National Center for Biotechnology Information in Bethesda. A large part of the construction in big-data biology is virtual, based on cloud computing, in which data and software are located in massive, off-site centers that can be accessed on demand. Therefore, it is not necessary for users to purchase their own hardware. The cloud computing system allows potential users to create virtual spaces for data, software and results that are freely accessible by everyone, or to keep spaces locked up behind a firewall permitting access to a chosen group of collaborators.

The use of biological databases can be advantageous in several research areas. For example, databases may aid experimental design by allowing the automatic analysis and easy processing of experimental data and making the examination of experimental results simple and quick. Drug discovery is another area that may be simplified by using databases.  In this specific area, databases can be scanned in order to find new candidates for drugs by training a classifier on a dataset where functioning and non-functioning drugs have been identified. Moreover, machine learning techniques may be applied to design virtual assays that are able to identify promising new drugs, which can subsequently be analyzed in a laboratory setting. (REF. 4) And most importantly, new scientific experiments can be carried out and new results generated by analyzing existing data sets.

Without the existence of databases, sharing and integration of large quantities of data would be virtually impossible. Although many life scientists have advanced computational skills, a large percentage are not familiar with developing or adapting the relevant software. Nonetheless, the involvement of life scientists in this process is crucial, since they can provide feedback to computer science specialists focusing on different needs and approaches to science. The ability to have access to the actual data sets originally used in a specific study provides researchers with the opportunity to reproduce and expand on such study. This is why it is important for data to become freely available to scientists at any time without restrictions, a notion supported by Open Science and numerous related initiatives. One of these initiatives is known as ELIXIR, a project designed to help scientists across Europe safeguard and share their data and to reinforce current resources, including databases and computing facilities, in individual countries.

Although the creation of biological databases has brought about many benefits, such as the promotion of scientific quality production enabled by networking, they still require improvement in terms of knowledge optimization. It is crucial to manage transdisciplinary knowledge in such a way that will lead to an increase in its quality and quantity. Data heterogeneity is another common issue faced in biological data integration. In the field of biology, several different methods exist for the representation of similar data. This complicates data integration and processing, which, in turn, makes it harder to acquire unified views of such data. An example of this problem is the use of various alternate names when referring to genes, regardless of the existence of full guidelines issued in 1979 proposing the adoption of gene nomenclature standard, leading to difficulties in data sharing. The implementation of standards enables the re-use of data, however, their absence causes significant loss of productivity and contributes to a decrease in data accessible by researchers. Therefore, it is imperative to find a solution to this matter in order to eliminate the challenges faced by scientists when using biological databases to conduct their research.

Final thoughts

Dealing with data implies a drastic discipline to keep access on a long term to the stored information. Technology evolves, which means that the hardware and software used today is not the standard of tomorrow. This means that to be able to read any data written today we will have to execute two different kinds of migrations. A logical migration and a technological migration. Logical migration is related to the kind of format in which the data is stored. Technological migration is related to the kind of hardware used. As an example, if you try to open a Word file written in 1993 with Word version 6 with the latest version Word 2019, it will not work. This example shows a lack of logical compatibility. To avoid this issue and keep an ascended compatibility, the file should have been migrated by the time to the latest version in order to keep it up to date and readable with the latest versions of software.

The same thing applies to hardware, i.e. servers, storage, networks, etc… Another example could be the kind of server and operating system used to run a database. In case you decide to change your hardware and to migrate from, let’s say, Windows to UNIX, a different kind of hardware will be needed to run UNIX and a different version of database to run on UNIX. Windows run on Intel based platforms (and Intel like) and Unix runs on SPARC based platforms, which means that you will have to migrate to a UNIX – SPARC compatible version of the database.

Keeping in mind this constant evolution of hardware, operating systems, software and formats, performing the appropriate logical and technological migrations on time could save you a lot of time and troubles.

Last but not least, it is important to keep backing up your data. Once every three to six months, perform a restore test to see if you are capable of retrieving your backups. This is crucial for two reasons:

  • It will keep you up to date on how to restore your data

It is the best testing method to see if your data was properly backed up

Test: LO5 Basic level

Welcome to your LO5BL: Open access scientific resources: Digital databases

References

  • Baxevanis AD, Bateman A. 2015. The importance of biological databases in biological discovery. Curr Protoc Bioinformatics., 50(1):1.1.1-1.1.8.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Brooksbank C, Bergman MT, Apweiler R, Birney E, Thornton J. 2014. The European Bioinformatics Institute’s data resources 2014. Nucleic Acids Res., 42:D18–D25.
  • Caspi R, Billington R, Ferrer L, Foerster H, Fulcher CA, Keseler IM, et al. 2016. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 44(D1):D471-80.
  • Figueiredo MSN, Pereira AM. 2017. Managing knowledge – the importance of databases in the scientific production. Procedia Manuf., 12:166–73.
  • Harris TW, Baran J, Bieri T, Cabunoc A, Chan J, Chen WJ. 2014. WormBase 2014: new views of curated biology. Nucleic Acids Res., 42:D789–D793.
  • Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, et al. 2008. Big data: The future of biocuration: Big data. Nature., 455(7209):47–50.
  • Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. 2021. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res., 49(D1): D545–51.
  • Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, et al. 2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform., 20(4):1085–93.
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Res., 12(6):996-1006.
  • Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: an overview. J Biol Res (Thessalon). 2015;22(1):9.
  • Marx V. 2013. Biology: The big challenges of big data: Biology. Nature., 498(7453):255–60.
  • Nature Structural Biology 10, 980. 2003; doi: 10.1038/nsb1203-980
  • Oliveira AL. 2019. Biotechnology, big data and artificial intelligence. Biotechnol J., 14(8):e1800613.
  • Razvi SRH, Rampogu S. 2016. Bioinformatics in the present day. MOJ proteom bioinform [Internet]., 3(1):11–2. Available from: http://dx.doi.org/10.15406/mojpb.2016.03.00073
  • Toomula N, Kumar A, Kumar D S, Bheemidi VS. 2012. Biological databases- integration of life science data. J Comput Sci Syst Biol., 04(05):087-092. Available from: http://dx.doi.org/10.4172/jcsb.1000081
  • Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. 2020. Ensembl 2020. Nucleic Acids Res., 48(D1): D682–8.
  • Zou D, Ma L, Yu J, Zhang Z. 2015. Biological databases for human research. Genomics Proteomics Bioinformatics., 13(1):55–63.
  • Baxevanis AD, Bateman A. 2015. The importance of biological databases in biological discovery. Curr Protoc Bioinformatics., 50(1):1.1.1-1.1.8.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Brooksbank C, Bergman MT, Apweiler R, Birney E, Thornton J. 2014. The European Bioinformatics Institute’s data resources 2014. Nucleic Acids Res., 42:D18–D25.
  • Caspi R, Billington R, Ferrer L, Foerster H, Fulcher CA, Keseler IM, et al. 2016. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 44(D1):D471-80.
  • Figueiredo MSN, Pereira AM. 2017. Managing knowledge – the importance of databases in the scientific production. Procedia Manuf., 12:166–73.
  • Harris TW, Baran J, Bieri T, Cabunoc A, Chan J, Chen WJ. 2014. WormBase 2014: new views of curated biology. Nucleic Acids Res., 42:D789–D793.
  • Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, et al. 2008. Big data: The future of biocuration: Big data. Nature., 455(7209):47–50.
  • Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. 2021. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res., 49(D1): D545–51.
  • Karp PD, Billington R, Caspi R, Fulcher CA, Latendresse M, Kothari A, et al. 2019. The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform., 20(4):1085–93.
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Res., 12(6):996-1006.
  • Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: an overview. J Biol Res (Thessalon). 2015;22(1):9.
  • Marx V. 2013. Biology: The big challenges of big data: Biology. Nature., 498(7453):255–60.
  • Nature Structural Biology 10, 980. 2003; doi: 10.1038/nsb1203-980
  • Oliveira AL. 2019. Biotechnology, big data and artificial intelligence. Biotechnol J., 14(8):e1800613.
  • Razvi SRH, Rampogu S. 2016. Bioinformatics in the present day. MOJ proteom bioinform [Internet]., 3(1):11–2. Available from: http://dx.doi.org/10.15406/mojpb.2016.03.00073
  • Toomula N, Kumar A, Kumar D S, Bheemidi VS. 2012. Biological databases- integration of life science data. J Comput Sci Syst Biol., 04(05):087-092. Available from: http://dx.doi.org/10.4172/jcsb.1000081
  • Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. 2020. Ensembl 2020. Nucleic Acids Res., 48(D1): D682–8.
  • Zou D, Ma L, Yu J, Zhang Z. 2015. Biological databases for human research. Genomics Proteomics Bioinformatics., 13(1):55–63.

Open access scientific resources: Digital databases

A D V A N C E D   L E V E L

This part deals with the advanced design of a Database. It explains the structure of a Database and how to make relations between Database tables.

Contents

 

Open access scientific ressources

Advanced structure of a Database

This part deals with the advanced design of a Database. It explains the structure of a Database and how to make relations between Database tables. It also presents the specific language used to make queries (SQL) to retrieve data from a Database.

Database Management Systems

A modern database can be defined as a structured collection of information (data) that is representative of the real world. Database Management Systems (DBMS) are used for the creation, management and query of databases. At present, relational database management systems (RDBMS) are the most mature and widely operated database systems in production. Almost all online transactions and most online content management systems (e.g. blogs and social networks) rely on these types of database systems, which are central to the world’s application infrastructure.  The focal point of a DBMS is the compilation of services that offer the persistence of data in the database and the functionality to ensure that the data is correct and consistent and that transactions follow the ACID properties. ACID refers to four essential properties of a transaction:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Database models’ languages

All database models have a language for the specification of the database’s structure and content. The specification is known as the schema design and represents the logical view of information that will be managed by a certain DBMS. This database specification language needs to be flexible so as to be useful and lasting. The most visible element of a database, which is identifiable by database professionals and application developers, is the data manipulation language. It can exhibit many forms, with the most common being a programming-language-like interface. Today, the textual and procedural languages, including Structured Query Language (SQL) and Object Query Language (OQL), remain the most widespread forms of data manipulation language.

Database characteristics

A database can be characterised as coherent, logical and internally consistent. It can also be characterised as self-describing, as it includes metadata, which define and describe the data and relationships between tables in the database. It is designed to contain data for a specific purpose. Each data item is stored in a field; a combination of fields is referred to as table. A number of tables may exist in a database.

In contrast to the file-based system, in database systems the data structure is stored in the system catalogue and not in the application programs. This separation between the programs and data is named program-data independence.

The architecture of a database system is composed of a set of services that are constructed on top of basic operating system services, system file storage services and primary memory buffer management services. This set of services is comprised of the following: catalogue management, integrity management, transaction management, concurrency control, lock management, deadlock management, recovery management, security management, query processing, communications management and log management.

Database model types

Data models can be divided into two types:

  • High-level conceptual data models
  • Record-based logical data models

High-level conceptual data models propose concepts for presentation of data in ways that are similar to how people perceive data. An example of this data model is the entity-relationship (ER) model, which is based on concepts, such as entities, attributes and relationships. An entity corresponds to a real-world object, attributes represent properties of the entity and a relationship indicates an association among entities.

Record-based logical data models propose concepts that users can comprehend, but are similar to the way data is stored in the computer. Relational data models, network data models and hierarchical data models are three of the most prevalent record-based logical data models.

  • In the relational model, data are represented in the form of relations, or tables.
  • In the network model, data are represented as record types. Also represented by this model is a set type, defined as a limited type of one-to-many relationships.

In the hierarchical model, data are represented as a hierarchical tree structure, each branch of which is representative of a number of related records.

Phases of database design

Data modelling constitutes the first step of database design. This step is at times though to be a high-level and abstract design phase, known as the conceptual design. This phase aims to describe the following:

  • The data present in the database
  • The relationships between data items
  • The constraints on data

At this initial phase of the database design process, information-requirement analysis is essential. It is the most important phase because the overall effectiveness of the system relies on how accurately the information requirements and user views are specified in the beginning. The specifications about information requirements made at this stage affect the final form and content of the database system.

After the specifications have been determined and developed, they must be structured into an integrated, cohesive system, a procedure called logical design. Logical design includes the following steps:

  1. developing a data model for each user view
  2. integrating the entities, attributes and relationships into a composite logical schema that describes the database for that module in terms that are not related to the software package being used
  • transforming the logical schema into a software schema expressed in the language of the chosen database management package

The final step of designing a database is physical design. This step is required in order to change the software schema into a form that can be implemented with the specific hardware, operating system and database management system of an organisation. Involved in physical design is the implementation of integrity and security requirements and the design of navigation paths.

Degree of abstraction

Data abstraction signifies the concealing of certain details of the way data are stored and maintained. In terms of their degree of abstraction, database models can be divided into three levels, which are:

  • The external or view level, which is the highest level of abstraction and represents only part of the entire database
  • The logical level, which describes what data are stored in the entire database

The physical level, which is the lowest level of abstraction and describes how the data are stored in the database

Database schemas

The database schema can be defined as the early-stage database description that is not expected to frequently change. Numerous schemas exist in a database system. The database architecture consists of three levels of schemas.

External level

This is the highest level of schemas. The external level data view is concentrated on specific data-processing applications or user views. It contains several views and represents a fragment of the actual database. Each view is offered for a user or group of users so that it helps make the interaction between the user and the system simpler.

Conceptual level

This level describes the logical structure of the entire database, which is, in turn, described by simple logical concepts, including objects, their properties or relationships. Therefore, the intricacy of the implementation details of the data will not be visible by the users. Only one conceptual level view is maintained in the database. In order for entities or attributes to be referred to in the database system, they must first be defined in the conceptual level view, formally described as the logical schema. This level view has to be highly stable, since it considered to be the basis for the development of external and internal level views.

Internal level

The way the data are stored and the way to access the data are described in this schema. The internal level represents the internal or physical state of the database. Its objective is to increase the efficiency of the database system, while fulfilling the required needs.

Data independence

Data independence refers to the ability of user applications to remain unaffected by changes made in the definition and organisation of data. Two types of data independence exist: logical and physical.

Logical data independence is the ability to alter the logical (conceptual) schema without affecting the external schema or user view. Adjustments to the logical schema, such as changes to the database structure like adding tables, should not have an effect on the function of the application (external views).

Physical data independence is the ability of the conceptual level schema to remain unaffected by changes made to the internal schema. Alterations to file organisation or storage structures, storage devices or indexing strategy do not bring about changes in the conceptual level.

The Relational Data Model

The relational data model was developed by Dr. Edgar F. Codd in 1970. It represents data in a tabular form, which is a familiar to many people way of representing data. The logical simplicity of flat file structures is maintained in this model. The relational model is based on a set theory, which provides the foundation for several of the operations that are performed on relations. It offers the most flexible access to data and, thus, is useful in dynamic decision-making environments.

SQL is a relational transformation language; it offers ways to form relations and manipulate the data. The outcome of a transformation operation is always another relation, which may contain just one row and column.

Basic Elements of a Relational Data Model

Table 1. Basic components of a relational data model.

Database component Description
Table includes columns and rows; a subset of the Cartesian product of a list of domains characterised by a name
Columns main storage units; contain the basic elements of data into which the content can be divided
Rows contain columns that are associated; together with columns form the basis of all databases
Domain a set of acceptable values that can be included in a column
Degree the number of columns present in a table

A relation, which is also called a table or file, can be characterised as a two-dimensional table that consists of data regarding an entity class or the relationships between entity classes. In each row of a table, data referring to a specific entity is included, and, in each column, a specific attribute is included. The rows, or records, of a relation can be referred to as tuples. A record within a table represents an instance of an entity. The number of rows in a relation are indicative of its cardinality. The number of columns, also known as fields or attributes, in a relation corresponds to the degree of the relation. The basic elements of a relational data model are described in Table 1. A unary relation consists of only one attribute; a binary relation consists of only two attributes; a ternary relation consists of only three attributes.

Characteristics of a Table

  • Each table in a database has a unique name
  • No duplicate rows exist; each row is different
  • Each row has a different name
  • The sequence of rows and columns is not important
  • Entries from columns are derived from the same domain according to their data type, including: date, logical (true/false), character (string) and number (numeric, integer, float, …)

Differentiating features of the Relational Database Model

Essentiality: A data structure is considered essential if it results in a loss of information in the database, when it is removed.

Integrity Rules: These ensure that the database content remains accurate and consistent. There are two types of integrity:

  1. Entity integrity: Allows the unique identification of each entity in the relational database. This ability ensures access to all data. Requires that no primary key has a null value.
  2. Referential Integrity: Allows the reference of tuples using foreign keys. Requires that the values assumed by a foreign key either match a primary key that is present in the database or are completely null.

Data manipulation: A method to manipulate the data; principal approach for creating information for decision making.

The Entity-Relationship Model

The entity-relationship (ER) data model has been available for more than 35 years. It is relatively abstract and easy to explain. ER models are readily translated to relations and represented by ER diagrams. Relationships and entities are the fundamentals of this model. An entity may be an object that exists physically or has conceptual existence. If its tables are existence-dependent, then an entity is characterised as weak. Conversely, if it can exist separately from all of its associated entities, then an entity is referred to as being strong.

Different kinds of entities exist:

  • Independent entities or kernels: The building blocks of the database. They are strong entities. The primary key is not a foreign key and can be simple or composite. The different types of keys are described in Table 2.
  • Dependent or derived entities: They are existence-dependent on two or more tables. They are used to bring two kernels together and may include other attributes. Each related table is identified by the foreign key. Three options are available for the primary key: i) use a composite of foreign keys of related tables, if unique, ii) use a composite of foreign keys and a qualifying column, or iii) create a new simple primary key.
  • Characteristic entities: These entities offer additional information about another table. They describe other entities and are representative of multivalued attributes. The foreign key is used for further identification of the characterised Two options are available for the primary key: i) use a composite of foreign keys and a qualifying column, or ii) create a new simple primary key.

Table 2. Types of keys.

Types of keys Description
Candidate key simple or composite key that is unique, because no two rows in a table can have the same value at any time, and minimal, since every column is needed to achieve uniqueness
Composite key must be minimal; composed of two or more attributes
Primary key candidate key chosen by the database designer for use as an identifying mechanism for the whole entity set; must uniquely identify tuples in a table and not be null; indicated in the ER model by underlining the attribute
Secondary key attribute strictly used for retrieval purposes; can be composite
Alternate key all candidate keys not selected as the primary key
Foreign key attribute in a table that references the primary key in another table OR it can be null

Null values: Different from zero or blank values; do not depend on data type. A null value means that either the actual value is unknown or that the attribute is not applicable.

Examples of entity types and relationships in biological databases

An entity type describes the characteristics that are shared by a collection of entities in a domain. For example, Protein can be considered as an entity type, with attributes, which include sequence, name, molecular weight, species and accession number. A single entity type will likely have several instances, each of which provides values to the attributes that are specified in the corresponding type. For example, the names of two instances of the entity type Protein are human α-haemoglobin and whale myoglobin. The values of their attribute species would be human and whale, respectively.

Relationships indicate that two or more entity types are associated. For example, a Protein may interact with many other Proteins, or may be a member of a family. Different categories of relationship may describe the nature of the relationship. For example, one entity type could be represented as a part of another (e.g. a Beta strand is part of a Sheet in the secondary structure of a Protein) or as a kind of another (e.g. an Enzyme is a kind of Protein).

Modification Anomalies

Unintentional mistakes may occur in a database during the processes of insertion, deletion or modification of data. If the mistake is a result of the database design, then this is called a modification anomaly.

There are three types of modification anomalies:

  1. Deletion anomaly: the removal of one logical entity that leads to loss of information about an unrelated logical entity
  2. Insertion anomaly: the insertion of data about one logical entity that necessitates the insertion of data about an unrelated logical entity

Update anomaly: the alteration of the information for one logical entity that necessitates more than one alteration to a relation.

Key Definitions

Centralised database system: data in this system is stored at a single site

Distributed database system: database and DBMS software are distributed in different sites connected by a computer network.

Database: a shared collection of associated data to be used for supporting the activities of organisations.

Data Definition Language (DDL): used to define the conceptual and internal schemas

Database Management System (DBMS): computer programs used for the creation, management and query of databases

Data Model: a collection of concepts used for the description of the database structure

Data redundancy: storage of the same data piece in two or more places in the database system

Normalisation: a method that structures data in such a way that problems are decreased or avoided

Recovery: the procedure of using logs and backup copies to recreate a database that has been damaged

Structured Query Language (SQL)

SQL stands for Structured Query Language, which is a computer language for storing, manipulating and retrieving data stored in a relational database. It is the most widely used database language. It offers ways to construct relations and manipulate data. SQL is the standard language for Relational Database Systems. All the Relational Database Management Systems (RDMS), like MySQL, MS Access, Oracle, Sybase, Informix, Postgres and SQL Server, use SQL as their standard database language although, they use different “dialects”:

  • MS SQL Server uses T-SQL
  • Oracle uses PL/SQL
  • MS Access uses a version of SQL called JET SQL (native format) etc.

SQL Commands List

A list of SQL commands that covers all the necessary actions with SQL databases follows. However, as previously mentioned, there might be some differences between different types of databases, including the use of different “dialects”. Each SQL command is provided with its syntax and description.

The commands in SQL are called Queries and they are of two types:

  1. Data Definition Query: The statements that define the structure of a database, create tables, specify their keys, indexes and so on,
  2. Data manipulation queries: These are the queries that can be edited.

SQL Commands List (*1):

Command Syntax Description
ALTER table ALTER TABLE table_name ADD column_name datatype; It is used to add columns to a table in a database
AND SELECT column_name(s)FROM table_nameWHERE column_1 = value_1 AND column_2 = value_2; It is an operator that is used to combine two conditions
AS SELECT column_name AS ‘Alias’FROM table_name; It is a keyword in SQL that is used to rename a column or table using an alias name
AVG SELECT AVG(column_name)FROM table_name; It is used to aggregate a numeric column and return its average
BETWEEN all candidate keys not selected as the primary key all candidate keys not selected as the primary key
CASE attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
COUNT all candidate keys not selected as the primary key all candidate keys not selected as the primary key
Create TABLE attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
DELETE all candidate keys not selected as the primary key all candidate keys not selected as the primary key
GROUP BY attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
HAVING all candidate keys not selected as the primary key all candidate keys not selected as the primary key
INNER JOIN attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
INSERT all candidate keys not selected as the primary key all candidate keys not selected as the primary key
IS NULL/ IS NOT NULL attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
LIKE all candidate keys not selected as the primary key all candidate keys not selected as the primary key
LIMIT attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
MAX all candidate keys not selected as the primary key all candidate keys not selected as the primary key
MIN attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
OR primary key all candidate keys not selected as the primary key
ORDER BY attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
OUTER JOIN all candidate keys not selected as the primary key all candidate keys not selected as the primary key
ROUND attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
SELECT all candidate keys not selected as the primary key all candidate keys not selected as the primary key
SELECT DISTINCT attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
SUM all candidate keys not selected as the primary key all candidate keys not selected as the primary key
UPDATE attribute in a table that references the primary key in another table OR it can be null attribute in a table that references the primary key in another table OR it can be null
WHERE all candidate keys not selected as the primary key all candidate keys not selected as the primary key
WITH WITH temporary_name AS (SELECT *FROM table_name)SELECT *FROM temporary_nameWHERE column_name operator value; It is used to store the result of a particular query in a temporary table using an alias

Commands and syntax for querying data from a single table or multiple tables(*2):

Single Table Multiple Table
SELECT c1 FROM t
To select the data in Column c1 from table t
SELECT c1, c2
FROM t1
INNER JOIN t2 on conditionSelect column c1 and c2 from table t1 and perform an inner join between t1 and t2
SELECT * FROM t
To select all rows and columns from table t
SELECT c1, c2
FROM t1
LEFT JOIN t2 on condition
Select column c1 and c2 from table t1 and perform a left join between t1 and t2
SELECT c1 FROM t
WHERE c1 = ‘test’
To select data in column c1 from table t, where c1=test
SELECT c1, c2
FROM t1
RIGHT JOIN t2 on condition
Select column c1 and c2 from table t1 and perform a right join between t1 and t2
SELECT c1 FROM t
ORDER BY c1 ASC (DESC)
To select data in column c1 from table t either in ascending or descending order
SELECT c1, c2
FROM t1
FULL OUTER JOIN t2 on condition
Select column c1 and c2 from table t1 and perform a full outer join between t1 and t2
SELECT c1 FROM t
ORDER BY c1LIMIT n OFFSET offset
To skip the offset of rows and return the next n rows
SELECT c1, c2
FROM t1
CROSS JOIN t2
Select column c1 and c2 from table t1 and produce a Cartesian product of rows in a table
SELECT c1, aggregate(c2)
FROM t
GROUP BY c1
To group rows using an aggregate function
SELECT c1, c2
FROM t1, t2Select column c1 and c2 from table t1 and produce a Cartesian product of rows in a table
SELECT c1, aggregate(c2)
FROM t
GROUP BY c1HAVING condition
Group rows using an aggregate function and filter these groups using ‘HAVING’ clause
SELECT c1, c2
FROM t1 A
INNER JOIN t2 B on condition
Select column c1 and c2 from table t1 and join it to itself using INNER JOIN clause

Commercial and Free Databases used in the real world

Figure 1: Non-exhaustive list of available databases

This part deals with the common databases found on the market, whether they are free or proprietary. However, there are so many databases available (figure 1) that we cannot mention all of them. A choice had to be made and the ones presented below are the “most popular” or the “most frequently used”.

Commercial Databases

From the vast number of databases available on the market, we chose to present three commercial databases commonly used by the major companies and organisations.

SAP HANA

This database is designed by the European company SAP SE, founded in Germany. SAP HANA is a database engine that is column-oriented and can handle SAP and non-SAP data. The engine is designed to save and retrieve data from applications and other sources across multiple tiers of storage. SAP HANA can be deployed on-premises or in the cloud from a number of cloud service providers. This database is usually chosen by organizations that are pulling data from applications and are not under a terribly constrained budget.

Its main features are:

  • It supports SQL, OLTP and OLAP.
  • The engine reduces resource requirements through compression.
  • Data is stored in memory, reducing access times, in some cases, significantly.
  • Real-time reporting and inventory management are available.
  • It can interface with a number of other applications.

Αs of January 2021, the currently supported hardware platforms3 for SAP HANA are:

  • Intel-based hardware platforms
  • IBM Power Systems

Αs of January 2021, the currently supported operating systems4  for SAP HANA are:

  • Linux SUSE

Linux Red Hat

IBM Db2 Database

IBM Db2 database traces its roots back to the beginning of the 1970s when Edgar F. Codd, a researcher working for the company, described the theory of relational databases, and in June 1970 published the model for data manipulation. Today, it is a database engine that has NoSQL capabilities, and it can read JSON5 and XML files.6

The current version of DB2 is LUW 11.1, which offers a variety of improvements. One, in particular, was an improvement of BLU Acceleration (BLink Ultra or Big Data, Lightning fast and Ultra-easy), which is designed to make this database engine work faster through data skipping technology. Data skipping is designed to improve the speed of systems with more data than can fit into memory. The latest version of Db2 also provides improved disaster recovery functions, compatibility and analytics.

Its main features are:

  • BLU Acceleration can make the most of available resources for enormous databases.
  • It can be hosted from the cloud, a physical server or both at the same time.
  • Multiple jobs can be run at once using the Task Scheduler.
  • Error codes and exit codes can determine which jobs are run via the Task Scheduler.

The currently supported hardware platforms7 as of January 2021 for IBM Db2 are:

  • IBM z/Architecture mainframe
  • Intel-based hardware platforms

The currently supported operating systems as of January 2021 for IBM Db2 are:

  • z/OS
  • Unix
  • Linux
  • Windows

Oracle Database

Oracle Database is commonly used for running online transaction processing (OLTP) or data warehousing (DW). It can also mix OLTP and DW database workloads. Oracle Database is available on-premises, on-cloud or as hybrid cloud installation. It may be run on third-party servers, as well as on Oracle Exadata hardware on-premises, on Oracle Cloud or on a private Cloud at customer premises.

The first version was released in 1979 and its development was influenced by the research of Edgar F. Codd on relational database design.

Its main features are:

  • It is a cross-platform database. It can run on various hardware across operating systems, including Windows Server, Unix and various distributions of GNU/Linux.
  • It has its networking stack that allows applications from a different platform to communicate smoothly with the Oracle Database, e.g. applications running on Windows can connect to the Oracle Database running on Unix.
  • It is an ACID-compliant database that helps maintain data integrity and reliability.

The currently supported hardware platforms are:

  • Proprietary Oracle Database Appliance
  • Sparc
  • IBM Power Systems
  • X64-based hardware platforms

The currently supported operating systems8 are:

  • Unix
  • Linux
  • Windows

Free Databases9

If a database is free, this does not necessarily mean that no fees are charged to the user. It is true for some of the following databases, however, some developers choose to limit certain features and charge a fee to be able to unlock those features (refer to the first unit of the Basic Level).

MySQL

MySQL is an open-source relational database, which runs on a number of different platforms, including Windows, Linux, macOS, etc. A cloud version. MySQL can be used for packaged software, business-critical systems and high-volume websites.

Its main features are:

  • It provides scalability and flexibility
  • The tool has web and data warehouse strengths
  • It provides high performance

It has robust transactional support

PostgreSQL

PostgreSQL is an enterprise-class open source database management system. It supports both SQL for relational and JSON for non-relational queries. It is backed by an experienced community of developers who have made a tremendous contribution to make it a highly reliable database management software. It runs on three different platforms, namely Windows, Linux and macOS. A cloud version is not available. PostgreSQL enables the creation of custom data types and a range of query methods. A stored procedure can be run in different programming languages.

Its main features are:

  • It is compatible with various platforms using all major languages and middleware
  • Standby server and high availability
  • The tool has mature server-side programming functionality
  • Log-based and trigger-based replication SSL
  • It offers a most sophisticated locking mechanism
  • Support for multi-version concurrency control
  • It provides support for client-server network architecture
  • The tool is Object-oriented and ANSI-SQL2008 compatible

PostgreSQL allows linking with other data stores like NoSQL, which act as a federated hub for polyglot databases.

Microsoft SQL

SQL Server is an RDBMS developed by Microsoft. It supports ANSI SQL, which is the standard SQL (Structured Query Language) language. However, SQL Server comes with its implementation of the SQL language, T-SQL (Transact-SQL). It runs on Docker Engine, Ubuntu, SUSE Linux Enterprise Server and Red Hat Enterprise Linux. A cloud version is available.

Its main features are:

  • It provides integration of structured and unstructured data with the power of SQL Server and Spark.
  • The tool offers scalability, performance and availability for mission-critical, intelligent applications, data warehouses and data lakes.
  • It offers advanced security features to protect your data.

Access to rich, interactive Power BI reports, to make a faster and better decision.

MariaDB

MariaDB is a fork of the MySQL database management system. It was created by its original developers. This DBMS tool provides data processing capabilities for both small and enterprise tasks. It runs on three platforms, namely Windows, Linux and macOS. A cloud version is available. MariaDB is an alternate software to MySQL. It provides high scalability through easy integration.

Its main features are:

  • It operates under GPL, BSD or LGPL licenses.
  • It comes with many storage engines, including the high-performance ones that can be integrated with other relational database management systems.
  • It provides the Galera cluster technology.

MariaDB can run on different operating systems and it supports numerous programming languages.

Oracle

Oracle is a self-repairing, self-securing and self-driving database designed to eliminate manual data management. It is an intelligent, secure and highly available database in the cloud that helps businesses grow. It runs on two platforms, namely Windows and Linux. A cloud version is also available.

Its main features are:

  • Oracle Cloud is optimized for high-performance database workloads, streaming workloads and hyperscale big data.
  • You can easily migrate to the Cloud.

It provides the services based on how you like to operate, in order to run Oracle Cloud in your data center.

Firebirdsql

Firebird is an open source SQL RDBMS that runs on Microsoft Windows, macOS, Linux and several Unix platforms, including HP-UX, Solaris and AIX. A cloud version is available. Firebird has development-friendly language support, stored procedures and triggers.

Its main features are:

  • Firebird allows you to build a custom version.
  • It is free to download, register and deploy.
  • The tool has enhanced multi-platform RDBMS.

Provides a range of funding options from firebird memberships to sponsorship commitments.

Databases in the scientific world advanced module

This section is dedicated to further exploring the open access databases used in science and how to use and get benefit of the existing knowledge.

Overview of databases in the scientific world

Existing databases dedicated to science and how to use them

As previously mentioned, sharing, integrating and annotating data is a crucial part of biological research as it allows researchers to reproduce the examination and interpretation of experimental findings. Although it is thought that bioinformaticians and computer scientists are responsible for these actions, life scientists have an equal role in promoting data integration, since they are the ones that generate these types of data and are usually the end users.

Data integration is defined as the process of combining data from different sources in order to offer users a unified view of such data. In computational sciences the theoretical frameworks for data integration have been categorised, based on the method used to integrate the data, into “eager” and “lazy”. According to the eager method, which is also known as warehousing, the data are copied over to a global schema and stored in a central data warehouse. The term “schema” refers to an organised and “queryable” approach for storing data. In the lazy method, the data are located in distributed sources and are integrated on demand in accordance with a global schema used for mapping the data between sources. The volume of data, the owner of the data and the existing infrastructure are the main factors that ultimately determine which of the two methods will be used for data integration. Moreover, in biological sciences, these methods may be applied in various different ways and used at a range of levels. As a result, six distinct and widely used schemata have been formulated for the integration of data:

  • Data centralisation: The data reside in centralised resources. UniProt and GenBank are two examples of databases following this method.
  • Data warehousing: data from a variety of sources reside on one central repository. Pathway commons is a database that follows this approach to integrate data.
  • Dataset integration: In-house workflows access databases that are distributed and download data to a local repository.
  • Hyperlinks: This approach enables users to access databases and tools in different fields of life science, thus promoting interoperability. ExPASy is an indicative example of a portal that is based on this data integration methodology.
  • Federated databases: A translational layer is required in order for data to be integrated among heterogeneous databases. This means that data from the database are transformed to a commonly accepted format in such a way that they are able to be interpreted in the same way from a mapping service. The Distributed Annotation System (DAS), which is a client-server system, is an indicative example.
  • Linked data: A network of interconnected data accessible online. Graphical user interfaces (GUI) that consist of hyperlinks, which connect associated data from numerous data providers and, hence, form a large system of Linked Data. BIO2RDF is an indicative example of a database that uses this approach as the basis for data integration.

Data centralisation, data warehousing and dataset integration are based on the “eager” theoretical framework, whereas hyperlinks, federated databases and linked data are based on the “lazy” theoretical framework regarding the way that is selected for data integration.

Data formats are described as an organised way for the demonstration of data and metadata in a file. Scientists began to store biological data in formatted files because the exponential growth of data created the need to analyse them using computer systems and databases. A problem that has arose in relation to file formatting is the emergence of various formats, even for the representation of the same type of data. In some cases, it has been observed that more than one format classes may be used to represent the data and metadata in a single file. Moreover, research has demonstrated that the most widely used format classes are: i) tables, ii) FASTA-like, iii) tag-structured, and iv) GenBank-like. The ideal solution to this matter would be for scientists to agree upon using a limited number of specific formats so as to simplify the process of data integration. The design of converters that have the ability to translate all the different classes of formats would also provide a helpful solution.

Currently, over 1,700 databases including data of biological interest are in use, according to the non-exhaustive list curated by the Nucleic Acids Research journal. In order to be deemed valuable for a specific purpose, all datasets that are present in a database have to be integrated and structured. The existing biological databases are comprised of information about a vast range of biology research topics, such as non-vertebrate genomics, protein sequence, human genes and diseases, DNA nucleotide sequence, cell biology, immunology, metabolic and signalling pathways, proteomics, etc.

As previously mentioned in the Basic Level, the classification of biological databases is dependent upon several factors, including the scope of data coverage and the level of biocuration. Nevertheless, their classification according to the type of data is one of the simplest and most comprehensive ways of categorising biological databases. Therefore, in the following section, these will be described as DNA, RNA, protein, disease, expression and pathway databases.

DNA databases

DNA databases focus on handling DNA data from numerous or a few particular species. The main purpose of human DNA databases is to establish the reference genome, to conduct profiling of human genetic variation, to associate genotype with phenotype and to identify human microbiome metagenomes. A DNA database example is GenBank, a publicly available collection of all the studied DNA sequences. As of February 2021, over 776 billion nucleotide bases in over 226 million sequences are available in GenBank (http://www.ncbi.nlm.nih.gov/genbank/statistics).

RNA databases

These databases include information about non-coding RNAs (ncRNAs), such as microRNAs and long non-coding RNAs (lncRNAs), which do not encode proteins. The purpose of RNA databases is to decode ncRNAs, of which lncRNAs are the most commonly studied, and to describe their functions and interactions. An RNA database example is RNAcentral, which consists of a unified view of ncRNA sequence data derived from a number of databases, some of which are Rfam, miRBase and lncRNAdb.

Protein databases

Protein databases were developed in order to create a vast compilation of universal proteins, identify protein families and domains, reconstruct phylogenetic trees and conduct profiling of protein structures. PDB, which consists of thousands of structures of biological macromolecules, is an indicative example of protein databases.

Disease databases

By definition disease databases include information about different types of diseases, but are mostly focused on providing data concerning various types of cancer. One of the most important cancer projects that has been developed is The Cancer Genome Atlas (TCGA), the objective of which is to gather a broad range of omics data, such as mRNA, SNP and methylation, for over twenty different forms of human cancer.

Expression databases

Expression databases may be utilised for a number of tasks, such as studying tissue-specific gene expression and regulation, storing expression data, detecting differential and baseline expression, and examining and reviewing expression information obtained from RNA and protein data. As an expression database, the Human Protein Atlas incorporates expression profiles for a significant percentage of human protein-coding genes derived from RNA and protein data.

Pathway databases

Pathway databases include data about biological pathways that can be utilised by researchers for the analysis of metabolic, regulatory and signalling pathways. A characteristic example of pathway databases is KEGG PATHWAY, which contains information regarding molecular interaction and reaction networks.

The National Center for Biotechnology Information (NCBI), part of the US National Library of Medicine at the National Institute of Health, has developed an integrated database retrieval system, which offers access to 34 different databases collectively containing 3.0 billion records, named Entrez. The global search page of Entrez (https://www.ncbi.nlm.nih.gov/search/) provides links to the web portal for each of the 34 databases. The Entrez system is easy to use because it allows users to download data in a variety of formats and to perform text searching using simple Boolean queries. Records are linked between databases on the basis of asserted relationships; these records can be represented in various formats. Moreover, users of Entrez have the option to download single records or batches of records. Some of the 34 databases that are part of Entrez are the following: PubMed (https://pubmed.ncbi.nlm.nih.gov), which contains scientific and medical abstracts/citations; BioSample (https://www.ncbi.nlm.nih.gov/biosample), which comprises descriptions of biological source materials; GEO Profiles (https://www.ncbi.nlm.nih.gov/geoprofiles), which includes gene expression and molecular abundance profiles; and, dbVar (https://www.ncbi.nlm.nih.gov/dbvar), which contains data from genome structural variation studies.

The data submitted to NCBI are derived from three sources: i) directly from researchers, ii) national and international partnerships or agreements with data providers and research consortia, and iii) internal curation efforts. Of note, NCBI is responsible for the management of the GenBank database and partakes in the International Nucleotide Sequence Database Collaboration (INSDC) in cooperation with the EMBL-EBI European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ).

As databases have proven to be a useful tool in many scientific fields, their use is steadily gaining ground in the healthcare sector. Nowadays, technological advancements in the area of data science have enabled healthcare professionals to compile, process and analyse health-related data, leading to the improvement of not only the delivery of care, but also the safety of patients and consumers. In order for these improvements to take place, relevant data must be gathered, stored and analysed in an efficient and secure manner, and exchanged across the different service levels present in a healthcare system. This has led to the development of Electronic Health Records (EHRs), databases which store patient data that can be accessed and utilised by healthcare professionals.

EHRs can be defined as medical databases that offer users, which in this case are healthcare professionals and administrative staff, access to health records. The most distinct types of EHRs are the Electronic Medical Record (EMR) and the Personal Health Record (PHR). EMRs consist of information, which is submitted by a single hospital department, an entire hospital or parts of the hospital. They can also contain information from a number of hospitals. Information to this type of EHR is normally added only by hospital staff. On the contrary, PHRs are managed by the patients, which are able to enter information. PHRs are described as electronic applications that provide a secure platform for patients to control and share their health data. The main difference between the two types of EHR systems is that, in PHRs, health records have to be presented in such a way that is understandable by the patient, whereas, in EMRs, the way health records are presented resembles health records on papers, since they are only accessed by healthcare providers.

The first EHR system became available in the 1960s mainly due to the build-up of unstructured and unused patient information over a period of several decades. Large organisations started to set up database systems in order to store and structure data into central repositories. These databases allowed the organisation and collection of data from many different sources, including pharmacies, laboratories, clinical studies, and constituents of clinical care, such as medication administration records. Currently, the implementation of EHR systems is mostly observed in high income countries. For example, the Health Information Technology for Economic and Clinical Health Act (HITECH Act of 2009) prompted the digitization of the healthcare delivery system in the US and the subsequent development of the Medicare and Medicaid EHR Incentive Programs.

The primary purpose for the creation of EHRs was the need to archive and structure patients’ records. They were later designated for billing and quality improvement reasons. As technological advancements took place, over the years EHRs became more inclusive, dynamic and interconnected. Nonetheless, compared to other industries, big data have not been used to best advantage in the medical industry. This has happened predominantly due to the poor quality of collected data and poorly structured datasets. Prior to the development of EHRs, medical research was based on disease registries or chronic disease management systems (CDMS). These repositories have significant limitations, since they consist of collections of data that are often related to only one particular disease. Furthermore, they cannot perform translation of the data or conclusions to other diseases and may include information from a group of patients in a specific geographic area. On the other hand, EHR data is largely varied, thus facilitating the analysis of complex clinical interactions and decisions.

The components of EHRs are different types of medical data, ranging from health records to raw sensory data. Medical data can be categorised into sensitive data or non-sensitive data. Sensitive data include patient information or can be associated with a patient. Non-sensitive data include sensory data, which are also called measurement data due to the fact that they are only comprised of samples of sensors, such as samples of an EEG measurement. Data stored in a medical database are referred to as metadata. The most common type of database used for storing medical data is the relational database, which presents data in the form of tables comprised of rows and a set number of columns. Some databases may include patient information, such as the medical history of a patient, or anonymised data that can be utilised in studies.

Medical data can be divided into several categories as described below:

  • Medical and laboratory data: Healthcare workers can submit orders for medication or laboratory studies into a physician order entry system, which are subsequently carried out by laboratory or nursing staff. Examples of this category of data are prescriptions for medication and microbiology results.
  • Billing data: This category of medical data is comprised of codes used by hospitals to file claims with their insurance providers. The International Classification of Diseases, constructed by the WHO, and the Current Procedural Terminology, sustained by the American Medical Association, are the most popular coding systems.
  • Images: These may be radiographic images resulting from x-rays, echocardiograms and computed tomography (CT) scans.
  • Notes and reports: These may be associated with the progress of patients. Discharge summaries also belong to this category. Findings from imaging studies are usually described in operative reports. Notes have to be partially structured using a templating system.
  • Physiological data: This category of medical data contains vital signs, such as heart rate and blood pressure, as well as ECG and EEG waveforms.

Relational databases are most frequently used for medical data management and storage. They can be referred to as a collection of tables that are connected by shared keys. A database schema determines how the tables will be structured and their relationships. A simple medical database may contain four tables:

  • Table 1: a patient list
  • Table 2: a hospital admissions log
  • Table 3: a list with vital sign measurements
  • Table 4: a dictionary of vital sign codes and associated labels

Primary and foreign keys can be used in order to link the four tables.

The preponderance of healthcare databases provides limited access to data for various reasons, including privacy concerns and plans to monetise the data. Nonetheless, a number of open access health databases are available for public use, some of which are described below.

The Medical Information Mart for Intensive Care (MIMIC) database

The MIMIC database (http://mimic.physionet.org) was created in 2003 as the result of a collaboration between MIT, Philips Medical Systems and the Beth Israel Deaconess Medical Center (BIDMC). The data entered into this database was sourced from medical and surgical patients admitted to all Intensive Care Units at BIDMC. It consists of information from over forty thousand patients, detailed physiological and clinical data, and is de-identified and openly accessible to researchers. Two types of data are present in this database: clinical data derived from EHRs, which are stored in a relational database comprised of approximately 50 tables, and bedside monitor waveforms stored in flat binary files.  The goal of this collaboration is to produce and assess advanced ICU patient monitoring and decision support systems that will ultimately make the process of decision-making in critical care more efficient, quicker and accurate.

PCORnet

PCORnet, the National Patient-Centered Clinical Research Network, is an initiative that began in 2013 with the objective to integrate data from several Clinical Data Research Networks and Patient-Powered Research Networks. It contains 29 networks that will facilitate access to vast amounts of research. It collects data from routine patient visits and data that are shared by individual patients via personal health records or community networks with other patients.

Open NHS

The National Health Services (NHS England) maintains one of the largest repositories in the world containing data related to peoples’ health. Open NHS10  is an open source database that provides access to information made available to the public by the government or other public bodies. This project was established in order to increase transparency and monitor the efficiency of the British healthcare sector. Patients, healthcare workers and commissioners are given the opportunity to compare the quality of care in various locations of the country by simply accessing the available data in the specially designed database.

Database de-identification

One of the primary steps to building an EHR database is de-identification. Before a database becomes available for use by researchers and applications, it is essential that measures are taken in order to ensure that privacy policies and regulations are followed. For structured data, such as columns of a table, de-identification is based on the categorisation of data and the subsequent deletion or cryptography of those that are flagged as protected. For unstructured data, such as discharge summaries, different techniques of natural language processing are used, from simple regular expressions to complex neural networks, which try to find all information that is protected throughout free text in order to perform deletion or cryptography.

The application of blockchain in Digital health

Blockchain technology is based on the concept of having a decentralised system for data storage, where a copy of the ledger of the performed transactions will be provided to each participant/node. This will make it unfeasible for someone to modify the data without the other participants being informed. Strong centralised entities would benefit from the application of blockchain. The applications of Digital health greatly depend on centralised systems. Therefore, blockchain has the potential to transform Digital health by altering the way data are stored and secured. Various areas have been proposed for its application, including supply chains, drug verification, claims reimbursement, access control and clinical trials.

Medical data has been found to be the most highly valued data by hackers, as recent studies have estimated that a single health record may cost up to 400 USD. This means that keeping the data secure in medical databases is of utmost importance. Blockchain can provide a solution to this issue by ensuring data privacy, integrity, authentication and authorisation. Blockchain data are encrypted and if someone has to delete or make their data useless, they are given that ability by applying a key destruction mechanism, where the key that was originally used for the encryption of the message will be destroyed, or made useless. Afterwards, the data stored in the blockchain will not be accessible to read.

Blockchain is able to fulfil two essential needs regarding data sharing: integrity and non-repudiation. Integrity means that the query and retrieved data cannot be altered, once the retrieval operation has been performed. Non-repudiation means that the knowledge retrieval service does not possess the ability to deny that the specific data have been delivered by the service as a response to a given query at a particular time. Blockchain can be defined as a distributed transaction management system that cannot be corrupted. It can be applied for EHR integration, sharing and access control, preservation and management.

A theoretical blockchain-based query notary service may be comprised of three computational layers:

  • a data consumer front-end
  • an interface for communicating with biomedical database interfaces, and
  • the contract engine, which organisesthe query and returns the retrieved results to the consumer, performs and prepares transactions, and manages contracts and their metadata

Two different schemes may be employed to apply the notary service: the basic scheme and the versioning scheme. The basic scheme applies a query-response ledger by which the user receives a sealed proof verifying that at a specific time a particular query has been placed in a biomedical database, which returned specific results. This scheme may be employed to ensure the integrity and non-repudiation of a query, when a vital biomedical task relies on the specific query. The versioning scheme permits the non-reputable versioning of data retrieved from a dynamically evolving biomedical database at numerous occasions in time, always using the same query. This scheme may be applied to confirm different versions of changing medical evidence as retrieved from a biomedical database with content that is frequently updated.

The incorporation of blockchain technology in pharmaceutical or life science applications has the capacity to decentralise the interface and data sharing, leading to increased efficiency, higher speeds and unlimited scalability. Blockchain renders data immutable, which would be useful in clinical trials for ensuring that clinical data cannot be manipulated by researchers at a later time. It can also be utilised in the process of drug identification, tracing and verification. There are certain risks associated with the implementation of blockchain, such as privacy concerns, off-chain transactions and doubts about this technology due to lack of adoption. Nonetheless, the benefits of blockchain technology far outweigh possible drawbacks and could have a significant role in limiting the methods used for illegal activities.

Test: LO5 Advanced Level

Welcome to your LO5AL: Open access scientific resources: Digital databases

References

  • Agha-Mir-Salim L, Sarmiento RF. 2020. Health information technology as premise for data science in global health: A discussion of opportunities and challenges. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing, 3–15.
  • Amid C, Alako BTF, Balavenkataraman Kadhirvelu V, Burdett T, Burgin J, Fan J, Harrison PW, Holt S, Hussein A, Ivanov E et al. 2020. The European nucleotide archive in 2019. Nucleic Acids Res., 48:D70–76.
  • Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. 2004. Uniprot: the universal protein knowledgebase. Nucleic Acids Res., 32 (Suppl 1):115–9. doi: 10.1093/nar/gkh131.
  • Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de Castro E, et al. 2012. ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res., 40(Web Server issue):597–603. doi: 10.1093/nar/gks400.
  • Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. 2008. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform., 41(5):706–16.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Bornberg-Bauer E, Paton NW. 2002. Conceptual data modelling for bioinformatics. Brief Bioinform., 3(2):166–80.
  • Bulgarelli L, Núñez-Reiz A, Deliberato RO. 2020. Building electronic health record databases for research. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing, 55–64.
  • Burge SW, Daub J, Eberhardt R , Tate J, Barquist L, Nawrocki EP, et al. 2013. Rfam 11.0: 10 years of RNA families, Nucleic Acids Res., 41: D226-232.
  • Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. 2013. The Cancer Genome Atlas Pan-Cancer analysis project, Nat Genet., 45: 1113-1120.
  • Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, et al. 2011. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res.; 39(Database issue): 685–90.
  • Chavali LN, Prashanti NL, Sujatha K, Rajasheker G, Kavi Kishor PB. 2018. The Emergence of Blockchain Technology and its Impact in Biotechnology, Pharmacy and Life Sciences. Current Trends in Biotechnology and Pharmacy., 12(3):304–10.
  • Courtney JF, Paradice DB, Brewer KL, Graham JC. 2010. Database Systems for Management. 3rd edition. The Global Text Project.
  • Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. 2001. The distributed annotation system. BMC Bioinformatics., 2:7.
  • Edgar F. Codd https://en.wikipedia.org/wiki/Edgar_F._Codd
  • Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. 2014. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc JAMIA., 21(4):578–582.
  • Fortier PJ, Michel HE. 2003. Computer Data Processing Hardware Architecture. In: Computer Systems Performance Evaluation and Prediction. Elsevier, p. 39–106.
  • Hellerstein JM, Stonebraker M, Hamilton J. 2007. Architecture of a database system. Found Tren Databases., 1(2):141–259.
  • Johnson A, Pollard T, Shen L et al. 2016. MIMIC-III, a freely accessible critical care database. Sci Data 3., 160035.
  • Karsch-Mizrachi I, Takagi T, Cochrane G. 2018. International Nucleotide Sequence Database, C The international nucleotide sequence database collaboration. Nucleic Acids Res., 46:D48–51.
  • Kleinaki A-S, Mytis-Gkometh P, Drosatos G, Efraimidis PS, Kaldoudi E. 2018. A blockchain-based notarization service for biomedical knowledge retrieval. Comput Struct Biotechnol J., 16:288–97.
  • Kozomara A, Griffiths-Jones S. 2014. MiRBase: annotating high confidence microRNAs using deep sequencing data, Nucleic Acids Res., 42: D68-73.
  • Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. 2015. Data integration in biological research: an overview. J Biol Res (Thessalon)., 22(1):9.
  • Lastdrager E. 2011. Securing Patient Information in Medical Databases [Internet]. University of Twente;. Available from: https://essay.utwente.nl/61035/1/MSc_E_Lastdrager_DIES_CTIT.pdf
  • Marshall J, Chahin A, Rush B. 2016. Review of clinical databases. In: Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing;, 9–16.
  • Nguyen KA. Database System Concepts. OpenStax CNX; 2009 [cited 2021 Jan 29]. Available from: http://cnx.org/contents/b57b8760-6898-469d-a0f7-06e0537f6817@1
  • Ogasawara O, Kodama Y, Mashima J, Kosuge T, Fujisawa T. 2020. DDBJ database updates and computational infrastructure enhancement. Nucleic Acids Res., 48:D45–50.
  • Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, et al. 2008. KEGG Atlas mapping for global analysis of metabolic pathways, Nucleic Acids Res., 36: W423-426.
  • Oliveira AL. 2019. Biotechnology, big data and artificial intelligence. Biotechnol J., 14(8):e1800613.
  • Pollard T, Dernoncourt F, Finlayson S, Velasquez A. 2016. Data Preparation. In: Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing;, 101–14.
  • Ponten F, Schwenk JM, Asplund A, Edqvist PH. 2011. The Human Protein Atlas as a proteomic resource for biomarker discovery, J Intern Med., 270: 428-446.
  • Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, et al. 2015. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs, Nucleic Acids Res., 43, D168-173.
  • Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, et al. 2011. The RCSB Protein Data Bank: redesigned web site and web services, Nucleic Acids Res., 39: D392-401.
  • Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. 2021. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 49(D1):D10–7.
  • Schuler G.D., Epstein J.A., Ohkawa H., Kans J.A. 1996. Entrez: molecular biology database and retrieval system. Methods Enzymol., 266:141–162.
  • The RNAcentral Consortium, RNAcentral: an international database of ncRNA sequences. 2015. Nucleic Acids Res., 43: D123-129.
  • Watt A, Eng N. Types of Data Models. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. Characteristics and Benefits of a Database. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01/
  • Watt A. Data Modelling. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. The Entity Relationship Data Model. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. The Relational Data Model. In: Watt A, Nelson E, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Zou D, Ma L, Yu J, Zhang Z. 2015. Biological databases for human research. Genomics Proteomics Bioinformatics., 13(1):55–63.
  • Zuniga PCC, Zuniga RAC, Mendoza MJ-A, Cariaga AA, Sarmiento RF, Marcelo AB. 2020. Workshop on Blockchain Use Cases in Digital Health. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing;, 99–107.
  • Agha-Mir-Salim L, Sarmiento RF. 2020. Health information technology as premise for data science in global health: A discussion of opportunities and challenges. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing, 3–15.
  • Amid C, Alako BTF, Balavenkataraman Kadhirvelu V, Burdett T, Burgin J, Fan J, Harrison PW, Holt S, Hussein A, Ivanov E et al. 2020. The European nucleotide archive in 2019. Nucleic Acids Res., 48:D70–76.
  • Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. 2004. Uniprot: the universal protein knowledgebase. Nucleic Acids Res., 32 (Suppl 1):115–9. doi: 10.1093/nar/gkh131.
  • Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de Castro E, et al. 2012. ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res., 40(Web Server issue):597–603. doi: 10.1093/nar/gks400.
  • Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. 2008. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform., 41(5):706–16.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2014. GenBank. Nucleic Acids Res., 42:D32–D37.
  • Bornberg-Bauer E, Paton NW. 2002. Conceptual data modelling for bioinformatics. Brief Bioinform., 3(2):166–80.
  • Bulgarelli L, Núñez-Reiz A, Deliberato RO. 2020. Building electronic health record databases for research. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing, 55–64.
  • Burge SW, Daub J, Eberhardt R , Tate J, Barquist L, Nawrocki EP, et al. 2013. Rfam 11.0: 10 years of RNA families, Nucleic Acids Res., 41: D226-232.
  • Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. 2013. The Cancer Genome Atlas Pan-Cancer analysis project, Nat Genet., 45: 1113-1120.
  • Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, et al. 2011. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res.; 39(Database issue): 685–90.
  • Chavali LN, Prashanti NL, Sujatha K, Rajasheker G, Kavi Kishor PB. 2018. The Emergence of Blockchain Technology and its Impact in Biotechnology, Pharmacy and Life Sciences. Current Trends in Biotechnology and Pharmacy., 12(3):304–10.
  • Courtney JF, Paradice DB, Brewer KL, Graham JC. 2010. Database Systems for Management. 3rd edition. The Global Text Project.
  • Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. 2001. The distributed annotation system. BMC Bioinformatics., 2:7.
  • Edgar F. Codd https://en.wikipedia.org/wiki/Edgar_F._Codd
  • Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. 2014. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc JAMIA., 21(4):578–582.
  • Fortier PJ, Michel HE. 2003. Computer Data Processing Hardware Architecture. In: Computer Systems Performance Evaluation and Prediction. Elsevier, p. 39–106.
  • Hellerstein JM, Stonebraker M, Hamilton J. 2007. Architecture of a database system. Found Tren Databases., 1(2):141–259.
  • Johnson A, Pollard T, Shen L et al. 2016. MIMIC-III, a freely accessible critical care database. Sci Data 3., 160035.
  • Karsch-Mizrachi I, Takagi T, Cochrane G. 2018. International Nucleotide Sequence Database, C The international nucleotide sequence database collaboration. Nucleic Acids Res., 46:D48–51.
  • Kleinaki A-S, Mytis-Gkometh P, Drosatos G, Efraimidis PS, Kaldoudi E. 2018. A blockchain-based notarization service for biomedical knowledge retrieval. Comput Struct Biotechnol J., 16:288–97.
  • Kozomara A, Griffiths-Jones S. 2014. MiRBase: annotating high confidence microRNAs using deep sequencing data, Nucleic Acids Res., 42: D68-73.
  • Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. 2015. Data integration in biological research: an overview. J Biol Res (Thessalon)., 22(1):9.
  • Lastdrager E. 2011. Securing Patient Information in Medical Databases [Internet]. University of Twente;. Available from: https://essay.utwente.nl/61035/1/MSc_E_Lastdrager_DIES_CTIT.pdf
  • Marshall J, Chahin A, Rush B. 2016. Review of clinical databases. In: Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing;, 9–16.
  • Nguyen KA. Database System Concepts. OpenStax CNX; 2009 [cited 2021 Jan 29]. Available from: http://cnx.org/contents/b57b8760-6898-469d-a0f7-06e0537f6817@1
  • Ogasawara O, Kodama Y, Mashima J, Kosuge T, Fujisawa T. 2020. DDBJ database updates and computational infrastructure enhancement. Nucleic Acids Res., 48:D45–50.
  • Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, et al. 2008. KEGG Atlas mapping for global analysis of metabolic pathways, Nucleic Acids Res., 36: W423-426.
  • Oliveira AL. 2019. Biotechnology, big data and artificial intelligence. Biotechnol J., 14(8):e1800613.
  • Pollard T, Dernoncourt F, Finlayson S, Velasquez A. 2016. Data Preparation. In: Secondary Analysis of Electronic Health Records. Cham: Springer International Publishing;, 101–14.
  • Ponten F, Schwenk JM, Asplund A, Edqvist PH. 2011. The Human Protein Atlas as a proteomic resource for biomarker discovery, J Intern Med., 270: 428-446.
  • Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, et al. 2015. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs, Nucleic Acids Res., 43, D168-173.
  • Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, et al. 2011. The RCSB Protein Data Bank: redesigned web site and web services, Nucleic Acids Res., 39: D392-401.
  • Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. 2021. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 49(D1):D10–7.
  • Schuler G.D., Epstein J.A., Ohkawa H., Kans J.A. 1996. Entrez: molecular biology database and retrieval system. Methods Enzymol., 266:141–162.
  • The RNAcentral Consortium, RNAcentral: an international database of ncRNA sequences. 2015. Nucleic Acids Res., 43: D123-129.
  • Watt A, Eng N. Types of Data Models. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. Characteristics and Benefits of a Database. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01/
  • Watt A. Data Modelling. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. The Entity Relationship Data Model. In: Watt A, Eng N, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Watt A. The Relational Data Model. In: Watt A, Nelson E, editors. Database Design – 2nd edition. BCcampus; 2014 [cited 2021 Jan 29]. Available from: https://opentextbc.ca/dbdesign01
  • Zou D, Ma L, Yu J, Zhang Z. 2015. Biological databases for human research. Genomics Proteomics Bioinformatics., 13(1):55–63.
  • Zuniga PCC, Zuniga RAC, Mendoza MJ-A, Cariaga AA, Sarmiento RF, Marcelo AB. 2020. Workshop on Blockchain Use Cases in Digital Health. In: Leveraging Data Science for Global Health. Cham: Springer International Publishing;, 99–107.

1 Source https://intellipaat.com/blog/tutorial/sql-tutorial/sql-commands-cheat-sheet/

2 Source https://intellipaat.com/blog/tutorial/sql-tutorial/sql-commands-cheat-sheet/

3 Source SAP SE https://help.sap.com/viewer/eb3777d5495d46c5b2fa773206bbfb46/2.0.01/en-US/d3d1cf20bb5710149b57fd794c827a4e.html

4 For more information about supported operating systems for SAP HANA, see SAP Note 2235581 – SAP HANA: https://service.sap.com/sap/support/notes/2235581

5 JavaScript Object Notation is an open standard file format as XML and is considered as unstructured data.

6 XML is an open standard file format as JSON and is considered as unstructured data.

7 Source IBM Support https://www.ibm.com/support/pages/system-requirements-ibm-db2-linux-unix-and-windows#1155S

8 Source https://support.oracle.com/knowledge/Oracle%20Database%20Products/1369107_1.html

9 Source https://www.guru99.com/free-database-software.html updated on 2021

10 Open data at the NHS. Available from: http://www.england.nhs.uk/ourwork/tsd/data-info/open-data/