Penn State University Libraries

Empirical research in the social sciences and education.

  • What is Empirical Research and How to Read It
  • Finding Empirical Research in Library Databases
  • Designing Empirical Research
  • Ethics, Cultural Responsiveness, and Anti-Racism in Research
  • Citing, Writing, and Presenting Your Work

Contact the Librarian at your campus for more help!

Ellysa Cahoy

Introduction: What is Empirical Research?

Empirical research is based on observed and measured phenomena and derives knowledge from actual experience rather than from theory or belief. 

How do you know if a study is empirical? Read the subheadings within the article, book, or report and look for a description of the research "methodology."  Ask yourself: Could I recreate this study and test these results?

Key characteristics to look for:

  • Specific research questions to be answered
  • Definition of the population, behavior, or   phenomena being studied
  • Description of the process used to study this population or phenomena, including selection criteria, controls, and testing instruments (such as surveys)

Another hint: some scholarly journals use a specific layout, called the "IMRaD" format, to communicate empirical research findings. Such articles typically have 4 components:

  • Introduction : sometimes called "literature review" -- what is currently known about the topic -- usually includes a theoretical framework and/or discussion of previous studies
  • Methodology: sometimes called "research design" -- how to recreate the study -- usually describes the population, research process, and analytical tools used in the present study
  • Results : sometimes called "findings" -- what was learned through the study -- usually appears as statistical data or as substantial quotations from research participants
  • Discussion : sometimes called "conclusion" or "implications" -- why the study is important -- usually describes how the research results influence professional practices or future studies

Reading and Evaluating Scholarly Materials

Reading research can be a challenge. However, the tutorials and videos below can help. They explain what scholarly articles look like, how to read them, and how to evaluate them:

  • CRAAP Checklist A frequently-used checklist that helps you examine the currency, relevance, authority, accuracy, and purpose of an information source.
  • IF I APPLY A newer model of evaluating sources which encourages you to think about your own biases as a reader, as well as concerns about the item you are reading.
  • Credo Video: How to Read Scholarly Materials (4 min.)
  • Credo Tutorial: How to Read Scholarly Materials
  • Credo Tutorial: Evaluating Information
  • Credo Video: Evaluating Statistics (4 min.)
  • Next: Finding Empirical Research in Library Databases >>
  • Last Updated: Jan 5, 2024 5:11 PM
  • URL: https://guides.libraries.psu.edu/emp

Purdue University

  • Ask a Librarian

Research: Overview & Approaches

  • Getting Started with Undergraduate Research
  • Planning & Getting Started
  • Building Your Knowledge Base
  • Locating Sources
  • Reading Scholarly Articles
  • Creating a Literature Review
  • Productivity & Organizing Research
  • Scholarly and Professional Relationships

Introduction to Empirical Research

Databases for finding empirical research, guided search, google scholar, examples of empirical research, sources and further reading.

  • Interpretive Research
  • Action-Based Research
  • Creative & Experimental Approaches

Your Librarian

Profile Photo

  • Introductory Video This video covers what empirical research is, what kinds of questions and methods empirical researchers use, and some tips for finding empirical research articles in your discipline.

Help Resources

  • Guided Search: Finding Empirical Research Articles This is a hands-on tutorial that will allow you to use your own search terms to find resources.

Google Scholar Search

  • Study on radiation transfer in human skin for cosmetics
  • Long-Term Mobile Phone Use and the Risk of Vestibular Schwannoma: A Danish Nationwide Cohort Study
  • Emissions Impacts and Benefits of Plug-In Hybrid Electric Vehicles and Vehicle-to-Grid Services
  • Review of design considerations and technological challenges for successful development and deployment of plug-in hybrid electric vehicles
  • Endocrine disrupters and human health: could oestrogenic chemicals in body care cosmetics adversely affect breast cancer incidence in women?

an empirical review

  • << Previous: Scholarly and Professional Relationships
  • Next: Interpretive Research >>
  • Last Updated: Nov 10, 2023 3:32 PM
  • URL: https://guides.lib.purdue.edu/research_approaches

an empirical review

  • Meriam Library

SWRK 330 - Social Work Research Methods

  • Literature Reviews and Empirical Research
  • Databases and Search Tips
  • Article Citations
  • Scholarly Journal Evaulation
  • Statistical Sources
  • Books and eBooks

What is a Literature Review?

Empirical research.

  • Annotated Bibliographies

A literature review  summarizes and discusses previous publications  on a topic.

It should also:

explore past research and its strengths and weaknesses.

be used to validate the target and methods you have chosen for your proposed research.

consist of books and scholarly journals that provide research examples of populations or settings similar to your own, as well as community resources to document the need for your proposed research.

The literature review does not present new  primary  scholarship. 

be completed in the correct citation format requested by your professor  (see the  C itations Tab)

Access Purdue  OWL's Social Work Literature Review Guidelines here .  

Empirical Research  is  research  that is based on experimentation or observation, i.e. Evidence. Such  research  is often conducted to answer a specific question or to test a hypothesis (educated guess).

How do you know if a study is empirical? Read the subheadings within the article, book, or report and look for a description of the research "methodology."  Ask yourself: Could I recreate this study and test these results?

These are some key features to look for when identifying empirical research.

NOTE:  Not all of these features will be in every empirical research article, some may be excluded, use this only as a guide.

  • Statement of methodology
  • Research questions are clear and measurable
  • Individuals, group, subjects which are being studied are identified/defined
  • Data is presented regarding the findings
  • Controls or instruments such as surveys or tests were conducted
  • There is a literature review
  • There is discussion of the results included
  • Citations/references are included

See also Empirical Research Guide

  • << Previous: Citations
  • Next: Annotated Bibliographies >>
  • Last Updated: Feb 6, 2024 8:38 AM
  • URL: https://libguides.csuchico.edu/SWRK330

Meriam Library | CSU, Chico

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Module 2 Chapter 4: Reviewing Empirical Articles

After carefully reviewing the source you have located, it is time to critically review the piece itself. In this chapter, you will read about:

  • Steps in reviewing different sections of an empirical article
  • The importance of maintaining a critical perspective on what you are reading—being an “active reader”

The seven steps considered in this chapter relate to the structure of typical journal articles published in social work and allied discipline journals. The structure is familiar to anyone who works with the American Psychological Association (APA) guide to how journal articles are written and structured, the Publication Manual of the American Psychological Association (APA, 2009 for the sixth edition).

Step 1. What is in a Name? Reviewing the title

Authors vary tremendously in their approach to titling their work, much as parents differ markedly in their approach to naming their babies. Ideally, a title is sufficiently precise and specific to tell a reader what the article is about. Sometimes titles have attention catching phrases added at the front or back end. Ideally, a title is also not overly elaborate and lengthy.

For example, the following article titles clearly and succinctly communicates what each article is about.

  • “Race and ethnic differences in early childhood maltreatment in the United States” (Lanier, Maguire-Jack, Walsh, & Hubel, 2014).
  • “Tracking the when, where, and with whom of alcohol use: Integrating ecological momentary assessment and geospatial data to examine risk for alcohol-related problems” (Freisthler, Lipperman-Kreda, Bersamin, & Gruenewald, 2014).
  • “Meeting them where they are: An exploration of technology use and help seeking behaviors among adolescents and young adults” (Cash & Bridge, 2012).
  • “A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown” (Marchant et al, 2017). [Note the word “behavior” is spelled “behaviour” in the United Kingdom, but without the “u” in the United States.]

Examples of (hypothetical) article titles that are non-communicative or miscommunicate include:

  • “The problem of drugs in America.” This title is not sufficiently specific for an empirical article. First, a reader does not know if this is about how the pharmacology industry manufactures drugs (quality control), an epidemiology report about the scope of the problem, an etiology report about theories related to the causes of the problem, a test of a sociological, psychological, or biological theory. It is more likely a title about an opinion piece, a general book chapter, or even an entire book.
  • “The effectiveness of PFI and MI in AOD treatment.” This title is swimming in jargon, making it difficult to interpret without looking into what each acronym means, and also makes it difficult to locate in the literature. (PFI is Personal Feedback Intervention, MI is motivational interviewing, and AOD is about alcohol and other drugs.) In addition, there exists a problem of ambiguity with acronyms: BPD could refer to either bronchopulmonary dysplasia, a lung complication common among prematurely born infants, borderline personality disorder, or bipolar disorder (both are mental disorders).
  • “The effects of self-esteem on high school student retention and drop-out.” This title is only good if the study design and methods actually allow for a causal inference. If the study design only allows for conclusions about the existence of a relationship between these two variables, the title is a poor choice—the word “effects” implies causality and misrepresents the study.

    Step 2. What is it about? Reviewing the abstract

Authors present a summary of their manuscript in a brief abstract that appears at the start of a published article. Abstracts are also published in a number of indexing and abstracting resources, making them relatively easy to access. Journals limit the length of abstracts, usually to somewhere between 150-250 words depending on the journal. This makes it challenging to explain the important aspects of the manuscript, with enough detail to be clear, but without the luxury of unlimited space to present nuances. An abstract should address the following points:

  • Study’s purpose, research aims, questions, and/or hypotheses
  • Study approach
  • Study design and methods (including study participants and measures)
  • Data analysis and key results
  • Key implications of the study results.

As a reader, the abstract should provide enough information for you to determine whether it is relevant for you to pursue the full article. An article abstract is not sufficient information for you to evaluate the evidence, even the evidence that appears in the description of results! To evaluate the evidence, you need to acquire and review the full article.

This point is so important that it warrants repeating:

An article’s abstract is not sufficient information for you to evaluate the evidence, even the evidence that appears in the description of results! To evaluate the evidence, you need to acquire and review the full article.

            A Note about research abstracts. An excellent search of literature will often turn up published research abstracts . In this case, there will not be a full article to locate. The abstract describes a conference presentation, and these are the precursors to publishing an article about a research study. Many professional organizations publish these abstracts in a special journal issue, sometimes a “supplement” to the journal. The next step for your search, if the title and abstract seem relevant, will be to determine whether a paper was ever published based on the study described in the abstract. This is a place where you would search by author name(s) rather than by subject or topic alone.

Step 3. What is the rationale and background knowledge? Reviewing the introduction

Once you have acquired an article that seems interesting and relevant (based on its title and abstract), you will next encounter its introduction. The purpose of an introduction is to provide the reader with a background orientation to the study that was conducted. This might include background information regarding the scope of the problem being addressed by the research. It should certainly provide the reader with a review of literature related to the topic and research questions. This might include an overview of the theory or theories related to the research that was conducted. In the end, the reader should understand the following:

  • Why was the study was undertaken, why was it important, why does it matter?
  • What was known from the literature that informed the study’s development?
  • What knowledge gap or gaps did the study aim to fill, or what did the study aim to contribute to the body of knowledge?
  • What research questions did the investigators address in their study?

The introduction often also informs readers about the study’s approach (e.g., qualitative, quantitative, mixed methods) and type of study that was implemented (e.g., exploratory, descriptive, experimental). After reviewing the introduction, you should have an even better idea of whether the article is relevant for your purposes.

Step 4. What happened? Reviewing the methods

If it was not made clear in the introduction, the study approach and type of study should be made explicit in the methods section of the article. The methods section, at a minimum, needs to explain who participated in the study and how data were collected. In a quantitative study, the study design is also described in the methods section; in a qualitative study, the type of study is described. There are three basic sub-sections in the methods section of an empirical article: study participants, study measures, and study procedures.

Study participants.  A methods sub-section describes who actually participated in the study, including numbers and characteristics of the study participants, as well as the pool from which these participants were drawn. The purpose of this sub-section in describing a quantitative study is to inform readers about generalizability of the study’s results and inform other investigators about what they would need to do to replicate the study to determine if they achieve similar results. Authors may present some of the description material in the form of tables with information about numbers and proportions reflecting categorical variables (like gender or race/ethnicity) and distribution on scale/continuous variables (like age). The method of selecting these participants should be clear along with any inclusion or exclusion criteria that were applied. The participant response rate might also be calculated as the number of participants enrolled in the study divided by the number of persons eligible to be enrolled (the “pool”), multiplied by 100%. Very low response rates make a study vulnerable to selection bias—the few persons who elected to participate might not represent the general population. In a qualitative study, the study participants section again describes how the individuals were selected for participation, and details describing these individuals are provided. Generalizability is not a goal in qualitative studies, but information about study participants should provide an indication to a reader of how robust the results might be. Robust descriptions come from participants who exhibit a range of defining and/or experiential characteristics. Finally, regardless of study approach, authors typically make evident that the study was reviewed by an Institutional Review Board for the inclusion of human participants.

Study Measures. Another methods sub-section explains how the data were collected. For quantitative studies, the data collection instruments used to measure each study variable are described. If the tools used for data collection were previously published, authors cite the sources of those tools and published literature about them—their reliability and validity, for example. The authors may also summarize literature concerning how the measures are known to perform with the specific type of study participants involved in the study—for example, different ages, diagnoses, races/ethnicities, or other characteristics. For qualitative studies the interview protocols or questions asked of participants are described in detail. In observational data collection studies, the approach to recording and scoring/coding observed behavior are described. In any case, the approach to data collection or measurement is described in sufficient detail for a reader to critically appraise the adequacy of the data collection approach and conclusions that can be drawn from the data collection process, and for other investigators to be able to replicate the study should they wish to confirm the results.

Study procedures. Sometimes study procedures is a separate methods sub-section and sometimes this content is incorporated into the participants and measures sub-sections. This sub-section includes information about activities in which the study participants engaged during the study. In a quantitative, experimental study, the methods utilized to assign study participants to different experimental conditions might be described here (i.e., the randomization approach used). Additionally, procedures used in handling data are usually described. In a quantitative study, investigators may report how they scored certain measures and what evidence from the literature informs their scoring approach. In a qualitative study, details about how data were coded are reported here. Procedures for ensuring inter-observer or inter-rater reliability and agreements will also be reported for either type of study. Regardless of the study’s research approach, a reader should come away with a detailed understanding of how the study was executed. As a result, the reader should be sufficiently informed about the study’s execution to be able to critically analyze the strength of the evidence developed from the methods that were applied.

Step 5. What was found? Reviewing the results

The results section is where investigators describe the data they collected, how it was analyzed, and what was observed in the data.The structure and format of the results section varies markedly for different research approaches.

 Qualitative methods results description. The nature of qualitative research questions and methods leads to data that are richly descriptive. The results derived from the data are, therefore, generally descriptive in nature. There may be a great deal of direct quotes, presenting information in participants’ own words. Descriptions may include thematic or concept maps constructed by the study investigators as a means of “sense making” from the data. If statistics are included, they tend to be of a descriptive nature—perhaps demonstrating the frequency with which certain results were observed in the data. The results may include tables or figures representing results. Ideally, a reader can identify the way that reported results relate to the research aims or questions originally asked by the study.

Quantitative methods results description.  The nature of quantitative research questions and methods leads to numeric data that can be summarized using various forms of statistical analyses. The results section of a quantitative study report will describe which statistical analyses were utilized, and should indicate the rationale for selecting those analyses, as well as discussing how well the data were suited to those analyses. Analyses that involve hypothesis testing will indicate the statistical support for conclusions drawn from the data (i.e., descriptive statistics, test statistics, significance levels, and confidence intervals). The results may be presented in a combination of text descriptions, tables, and figures. An informed reader should be able to determine the appropriateness of the statistical approaches used in the analyses and the conclusions drawn from those analyses. Ideally, study results are presented in association with each study question or hypothesis as it is answered. Problems encountered with any specific analyses are also reported here, such as when data were not suitably distributed, sample sizes were inadequate, or assumptions underlying specific types of analytic approaches were violated.

Mixed methods results description. As mixed methods approaches continue evolving, so are creative ways of presenting results of studies that integrate qualitative and quantitative approaches. Results for a study employing mixed methods are often presented question-by-question. Where qualitative questions were addressed, the descriptive results will be presented as outlined above. Where quantitative questions were addressed, the numeric and statistical results will be presented as outlined above.

Regardless of study approach, a reader should have a clear understanding of the way data were analyzed and results of those analyses. This is not a place where authors have drawn conclusions about the implications of those results—that belongs in the article’s discussion section.

Step 6. What was concluded? Reviewing the Discussion

In the end, the authors will offer their interpretation of the evidence described in their Results section. This discussion should include several elements:

  • A brief overview summary of the key results.
  • Discussion of how each key result relates to the study aims, questions, and/or hypotheses.
  • Discussion of how the observed results relate to the previous existing literature (are they mutually confirming or contradicting), if the study results were completely new contributions, or if they were ambiguous and no conclusions could be drawn.
  • Discussion of the study’s methodological or analysis/results limitations.
  • Practical implications of the study results for practice and for future research.

It is important to remember that the discussion is the authors’ own interpretation of the results. This, again, is a place where readers must apply their own critical analysis to the study implications. For example, sometimes authors get a bit carried away with their interpretation and make suggestions that are not supported by the evidence in their studies. Or, they may not have gone far enough, and you see potential implications that they did not.

Step 7. Where are other relevant pieces? Reviewing the reference list

As you search for relevant literature, you might want to review the reference list to an article that you found to be relevant. Sometimes your own search methods and search terms might have missed some important items that the article’s authors were able to identify. This review will provide you with titles to consider, and possibly you will recognize the names of key scholars in the topic area. You can then pursue these background resources as part of your own search.

Interactive Excel Workbook Activities

Complete the following Workbook Activities:

  • Workbook Introduction and Steps to Install Microsoft Office 365
  • SWK 3401.2-4.1 Getting Started: Excel and Data Analysis ToolPak Access

Social Work 3401 Coursebook Copyright © by Dr. Audrey Begun is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License , except where otherwise noted.

Share This Book

sdsu library logo

  • Collections
  • Services & Support

facebook logo

  • San Diego State University
  • Research by Subject
  • Understanding Peer Review and Empirical Studies
  • AI Research Tools
  • Children's books by language and bilingual books
  • Tutorials -Educational Research Skills
  • Find Articles - Key Databases
  • Statistics/Data
  • Policy Resources
  • Frequently Researched Subjects
  • Quick Links on Action Research
  • Quick Links on Higher Education, Student Affairs, Community Colleges, Leadership Research
  • Research Handbooks
  • Streaming Videos
  • Lesson Plan/Teaching Resource Supersites/Primary Sources
  • Children's Literature
  • Picture Book Summaries - Videos
  • Dissertations
  • College Catalogs
  • K-8 Textbook Access - Learning Resources Display Center
  • Learn to Search the ERIC Database
  • Precision Searching
  • More Educational Research Tutorials
  • Cite Your Sources
  • Find K-12 Textbooks
  • Evaluate Your Sources
  • Learn to Search Education Full Text Database

Library Hours

E-resource problem report.

  • e-Resource Problem Report Please use this form to report issues you encounter while accessing any e-resources. more... less... E-resources include databases, websites, e-journals, and e-books that are accessed through OneSearch, the A-Z Databases List, Research Guides, etc.

Peer Review, Scholarly Sources, and Empirical Studies

How to Tell If a Journal is Peer Reviewed

Many library databases including those owned by EBSCO and ProQuest give you the option to limit your search results to only those results that are peer reviewed. Look for the option to limit your results either on the search page or after the results are returned as a way to refine your search.

If you are still unsure if an article has been you can try the following things.

  • Find the journal’s website.  Look on the website for information about the editorial policy, submission process or requirements for author’s submission.  This section of the website will often give insight into whether or not the journal has a peer review process. 
  • If you still cannot determine if it is peer reviewed, please feel free to call, text, or email the reference librarians and ask them and someone will find out for you and get back to you a.s.a.p.

DISTINGUISHING CHARACTERISTICS OF SCHOLARLY JOURNALS, POPULAR MAGAZINES, AND TRADE PUBLICATIONS

Chart created by: SDSU Library & Information Access

What Are Empirical Studies?

"Empirical studies are reports of original research. These include secondary analyses that test hypotheses by presenting novel analyses of data not considered or addressed in previous reports. They typically consist of distinct sections that reflect the stages in the research process and that appear in the following sequence:

-introduction: development of the problem under investigation, including its historical antecedents, and statement of the purpose of the investigation;

-method: description of the procedures used to conduct the investigation;

-results: report of the findings and analyses; and

-discussion: summary, interpretation, and implications of the results.

(This is an excerpt from the  Publication Manual of the American Psychological Association.  Washington, DC: American Psychological Association. 6th edition. 2009, and is intended for educational use only.)

What Doesn't Count as an Empirical Study?

  •  If an article does not discuss research methods it is unlikely to be an empirical study.

Empirical studies are unlikely to be found in popular magazines or newspapers, and if they are, they are unlikely to be reported with sufficient detail. 

Essays, textbooks, reviews of existing research or what is known in the field, and practitioner articles are all useful, but, they are not  empirical research. 

How to find an Empirical Study

You can find empirical studies reported in journals, in dissertations, and in government document sources like ERIC. Here is a video tutorial demonstrating some search techniques to use to find these. [LINK TO VIDEO]

  • << Previous: Precision Searching
  • Next: More Educational Research Tutorials >>
  • Last Updated: Feb 15, 2024 10:59 AM
  • URL: https://libguides.sdsu.edu/Education
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

an empirical review

Home Market Research

Empirical Research: Definition, Methods, Types and Examples

What is Empirical Research

Content Index

Empirical research: Definition

Empirical research: origin, quantitative research methods, qualitative research methods, steps for conducting empirical research, empirical research methodology cycle, advantages of empirical research, disadvantages of empirical research, why is there a need for empirical research.

Empirical research is defined as any research where conclusions of the study is strictly drawn from concretely empirical evidence, and therefore “verifiable” evidence.

This empirical evidence can be gathered using quantitative market research and  qualitative market research  methods.

For example: A research is being conducted to find out if listening to happy music in the workplace while working may promote creativity? An experiment is conducted by using a music website survey on a set of audience who are exposed to happy music and another set who are not listening to music at all, and the subjects are then observed. The results derived from such a research will give empirical evidence if it does promote creativity or not.

LEARN ABOUT: Behavioral Research

You must have heard the quote” I will not believe it unless I see it”. This came from the ancient empiricists, a fundamental understanding that powered the emergence of medieval science during the renaissance period and laid the foundation of modern science, as we know it today. The word itself has its roots in greek. It is derived from the greek word empeirikos which means “experienced”.

In today’s world, the word empirical refers to collection of data using evidence that is collected through observation or experience or by using calibrated scientific instruments. All of the above origins have one thing in common which is dependence of observation and experiments to collect data and test them to come up with conclusions.

LEARN ABOUT: Causal Research

Types and methodologies of empirical research

Empirical research can be conducted and analysed using qualitative or quantitative methods.

  • Quantitative research : Quantitative research methods are used to gather information through numerical data. It is used to quantify opinions, behaviors or other defined variables . These are predetermined and are in a more structured format. Some of the commonly used methods are survey, longitudinal studies, polls, etc
  • Qualitative research:   Qualitative research methods are used to gather non numerical data.  It is used to find meanings, opinions, or the underlying reasons from its subjects. These methods are unstructured or semi structured. The sample size for such a research is usually small and it is a conversational type of method to provide more insight or in-depth information about the problem Some of the most popular forms of methods are focus groups, experiments, interviews, etc.

Data collected from these will need to be analysed. Empirical evidence can also be analysed either quantitatively and qualitatively. Using this, the researcher can answer empirical questions which have to be clearly defined and answerable with the findings he has got. The type of research design used will vary depending on the field in which it is going to be used. Many of them might choose to do a collective research involving quantitative and qualitative method to better answer questions which cannot be studied in a laboratory setting.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

Quantitative research methods aid in analyzing the empirical evidence gathered. By using these a researcher can find out if his hypothesis is supported or not.

  • Survey research: Survey research generally involves a large audience to collect a large amount of data. This is a quantitative method having a predetermined set of closed questions which are pretty easy to answer. Because of the simplicity of such a method, high responses are achieved. It is one of the most commonly used methods for all kinds of research in today’s world.

Previously, surveys were taken face to face only with maybe a recorder. However, with advancement in technology and for ease, new mediums such as emails , or social media have emerged.

For example: Depletion of energy resources is a growing concern and hence there is a need for awareness about renewable energy. According to recent studies, fossil fuels still account for around 80% of energy consumption in the United States. Even though there is a rise in the use of green energy every year, there are certain parameters because of which the general population is still not opting for green energy. In order to understand why, a survey can be conducted to gather opinions of the general population about green energy and the factors that influence their choice of switching to renewable energy. Such a survey can help institutions or governing bodies to promote appropriate awareness and incentive schemes to push the use of greener energy.

Learn more: Renewable Energy Survey Template Descriptive Research vs Correlational Research

  • Experimental research: In experimental research , an experiment is set up and a hypothesis is tested by creating a situation in which one of the variable is manipulated. This is also used to check cause and effect. It is tested to see what happens to the independent variable if the other one is removed or altered. The process for such a method is usually proposing a hypothesis, experimenting on it, analyzing the findings and reporting the findings to understand if it supports the theory or not.

For example: A particular product company is trying to find what is the reason for them to not be able to capture the market. So the organisation makes changes in each one of the processes like manufacturing, marketing, sales and operations. Through the experiment they understand that sales training directly impacts the market coverage for their product. If the person is trained well, then the product will have better coverage.

  • Correlational research: Correlational research is used to find relation between two set of variables . Regression analysis is generally used to predict outcomes of such a method. It can be positive, negative or neutral correlation.

LEARN ABOUT: Level of Analysis

For example: Higher educated individuals will get higher paying jobs. This means higher education enables the individual to high paying job and less education will lead to lower paying jobs.

  • Longitudinal study: Longitudinal study is used to understand the traits or behavior of a subject under observation after repeatedly testing the subject over a period of time. Data collected from such a method can be qualitative or quantitative in nature.

For example: A research to find out benefits of exercise. The target is asked to exercise everyday for a particular period of time and the results show higher endurance, stamina, and muscle growth. This supports the fact that exercise benefits an individual body.

  • Cross sectional: Cross sectional study is an observational type of method, in which a set of audience is observed at a given point in time. In this type, the set of people are chosen in a fashion which depicts similarity in all the variables except the one which is being researched. This type does not enable the researcher to establish a cause and effect relationship as it is not observed for a continuous time period. It is majorly used by healthcare sector or the retail industry.

For example: A medical study to find the prevalence of under-nutrition disorders in kids of a given population. This will involve looking at a wide range of parameters like age, ethnicity, location, incomes  and social backgrounds. If a significant number of kids coming from poor families show under-nutrition disorders, the researcher can further investigate into it. Usually a cross sectional study is followed by a longitudinal study to find out the exact reason.

  • Causal-Comparative research : This method is based on comparison. It is mainly used to find out cause-effect relationship between two variables or even multiple variables.

For example: A researcher measured the productivity of employees in a company which gave breaks to the employees during work and compared that to the employees of the company which did not give breaks at all.

LEARN ABOUT: Action Research

Some research questions need to be analysed qualitatively, as quantitative methods are not applicable there. In many cases, in-depth information is needed or a researcher may need to observe a target audience behavior, hence the results needed are in a descriptive analysis form. Qualitative research results will be descriptive rather than predictive. It enables the researcher to build or support theories for future potential quantitative research. In such a situation qualitative research methods are used to derive a conclusion to support the theory or hypothesis being studied.

LEARN ABOUT: Qualitative Interview

  • Case study: Case study method is used to find more information through carefully analyzing existing cases. It is very often used for business research or to gather empirical evidence for investigation purpose. It is a method to investigate a problem within its real life context through existing cases. The researcher has to carefully analyse making sure the parameter and variables in the existing case are the same as to the case that is being investigated. Using the findings from the case study, conclusions can be drawn regarding the topic that is being studied.

For example: A report mentioning the solution provided by a company to its client. The challenges they faced during initiation and deployment, the findings of the case and solutions they offered for the problems. Such case studies are used by most companies as it forms an empirical evidence for the company to promote in order to get more business.

  • Observational method:   Observational method is a process to observe and gather data from its target. Since it is a qualitative method it is time consuming and very personal. It can be said that observational research method is a part of ethnographic research which is also used to gather empirical evidence. This is usually a qualitative form of research, however in some cases it can be quantitative as well depending on what is being studied.

For example: setting up a research to observe a particular animal in the rain-forests of amazon. Such a research usually take a lot of time as observation has to be done for a set amount of time to study patterns or behavior of the subject. Another example used widely nowadays is to observe people shopping in a mall to figure out buying behavior of consumers.

  • One-on-one interview: Such a method is purely qualitative and one of the most widely used. The reason being it enables a researcher get precise meaningful data if the right questions are asked. It is a conversational method where in-depth data can be gathered depending on where the conversation leads.

For example: A one-on-one interview with the finance minister to gather data on financial policies of the country and its implications on the public.

  • Focus groups: Focus groups are used when a researcher wants to find answers to why, what and how questions. A small group is generally chosen for such a method and it is not necessary to interact with the group in person. A moderator is generally needed in case the group is being addressed in person. This is widely used by product companies to collect data about their brands and the product.

For example: A mobile phone manufacturer wanting to have a feedback on the dimensions of one of their models which is yet to be launched. Such studies help the company meet the demand of the customer and position their model appropriately in the market.

  • Text analysis: Text analysis method is a little new compared to the other types. Such a method is used to analyse social life by going through images or words used by the individual. In today’s world, with social media playing a major part of everyone’s life, such a method enables the research to follow the pattern that relates to his study.

For example: A lot of companies ask for feedback from the customer in detail mentioning how satisfied are they with their customer support team. Such data enables the researcher to take appropriate decisions to make their support team better.

Sometimes a combination of the methods is also needed for some questions that cannot be answered using only one type of method especially when a researcher needs to gain a complete understanding of complex subject matter.

We recently published a blog that talks about examples of qualitative data in education ; why don’t you check it out for more ideas?

Since empirical research is based on observation and capturing experiences, it is important to plan the steps to conduct the experiment and how to analyse it. This will enable the researcher to resolve problems or obstacles which can occur during the experiment.

Step #1: Define the purpose of the research

This is the step where the researcher has to answer questions like what exactly do I want to find out? What is the problem statement? Are there any issues in terms of the availability of knowledge, data, time or resources. Will this research be more beneficial than what it will cost.

Before going ahead, a researcher has to clearly define his purpose for the research and set up a plan to carry out further tasks.

Step #2 : Supporting theories and relevant literature

The researcher needs to find out if there are theories which can be linked to his research problem . He has to figure out if any theory can help him support his findings. All kind of relevant literature will help the researcher to find if there are others who have researched this before, or what are the problems faced during this research. The researcher will also have to set up assumptions and also find out if there is any history regarding his research problem

Step #3: Creation of Hypothesis and measurement

Before beginning the actual research he needs to provide himself a working hypothesis or guess what will be the probable result. Researcher has to set up variables, decide the environment for the research and find out how can he relate between the variables.

Researcher will also need to define the units of measurements, tolerable degree for errors, and find out if the measurement chosen will be acceptable by others.

Step #4: Methodology, research design and data collection

In this step, the researcher has to define a strategy for conducting his research. He has to set up experiments to collect data which will enable him to propose the hypothesis. The researcher will decide whether he will need experimental or non experimental method for conducting the research. The type of research design will vary depending on the field in which the research is being conducted. Last but not the least, the researcher will have to find out parameters that will affect the validity of the research design. Data collection will need to be done by choosing appropriate samples depending on the research question. To carry out the research, he can use one of the many sampling techniques. Once data collection is complete, researcher will have empirical data which needs to be analysed.

LEARN ABOUT: Best Data Collection Tools

Step #5: Data Analysis and result

Data analysis can be done in two ways, qualitatively and quantitatively. Researcher will need to find out what qualitative method or quantitative method will be needed or will he need a combination of both. Depending on the unit of analysis of his data, he will know if his hypothesis is supported or rejected. Analyzing this data is the most important part to support his hypothesis.

Step #6: Conclusion

A report will need to be made with the findings of the research. The researcher can give the theories and literature that support his research. He can make suggestions or recommendations for further research on his topic.

Empirical research methodology cycle

A.D. de Groot, a famous dutch psychologist and a chess expert conducted some of the most notable experiments using chess in the 1940’s. During his study, he came up with a cycle which is consistent and now widely used to conduct empirical research. It consists of 5 phases with each phase being as important as the next one. The empirical cycle captures the process of coming up with hypothesis about how certain subjects work or behave and then testing these hypothesis against empirical data in a systematic and rigorous approach. It can be said that it characterizes the deductive approach to science. Following is the empirical cycle.

  • Observation: At this phase an idea is sparked for proposing a hypothesis. During this phase empirical data is gathered using observation. For example: a particular species of flower bloom in a different color only during a specific season.
  • Induction: Inductive reasoning is then carried out to form a general conclusion from the data gathered through observation. For example: As stated above it is observed that the species of flower blooms in a different color during a specific season. A researcher may ask a question “does the temperature in the season cause the color change in the flower?” He can assume that is the case, however it is a mere conjecture and hence an experiment needs to be set up to support this hypothesis. So he tags a few set of flowers kept at a different temperature and observes if they still change the color?
  • Deduction: This phase helps the researcher to deduce a conclusion out of his experiment. This has to be based on logic and rationality to come up with specific unbiased results.For example: In the experiment, if the tagged flowers in a different temperature environment do not change the color then it can be concluded that temperature plays a role in changing the color of the bloom.
  • Testing: This phase involves the researcher to return to empirical methods to put his hypothesis to the test. The researcher now needs to make sense of his data and hence needs to use statistical analysis plans to determine the temperature and bloom color relationship. If the researcher finds out that most flowers bloom a different color when exposed to the certain temperature and the others do not when the temperature is different, he has found support to his hypothesis. Please note this not proof but just a support to his hypothesis.
  • Evaluation: This phase is generally forgotten by most but is an important one to keep gaining knowledge. During this phase the researcher puts forth the data he has collected, the support argument and his conclusion. The researcher also states the limitations for the experiment and his hypothesis and suggests tips for others to pick it up and continue a more in-depth research for others in the future. LEARN MORE: Population vs Sample

LEARN MORE: Population vs Sample

There is a reason why empirical research is one of the most widely used method. There are a few advantages associated with it. Following are a few of them.

  • It is used to authenticate traditional research through various experiments and observations.
  • This research methodology makes the research being conducted more competent and authentic.
  • It enables a researcher understand the dynamic changes that can happen and change his strategy accordingly.
  • The level of control in such a research is high so the researcher can control multiple variables.
  • It plays a vital role in increasing internal validity .

Even though empirical research makes the research more competent and authentic, it does have a few disadvantages. Following are a few of them.

  • Such a research needs patience as it can be very time consuming. The researcher has to collect data from multiple sources and the parameters involved are quite a few, which will lead to a time consuming research.
  • Most of the time, a researcher will need to conduct research at different locations or in different environments, this can lead to an expensive affair.
  • There are a few rules in which experiments can be performed and hence permissions are needed. Many a times, it is very difficult to get certain permissions to carry out different methods of this research.
  • Collection of data can be a problem sometimes, as it has to be collected from a variety of sources through different methods.

LEARN ABOUT:  Social Communication Questionnaire

Empirical research is important in today’s world because most people believe in something only that they can see, hear or experience. It is used to validate multiple hypothesis and increase human knowledge and continue doing it to keep advancing in various fields.

For example: Pharmaceutical companies use empirical research to try out a specific drug on controlled groups or random groups to study the effect and cause. This way, they prove certain theories they had proposed for the specific drug. Such research is very important as sometimes it can lead to finding a cure for a disease that has existed for many years. It is useful in science and many other fields like history, social sciences, business, etc.

LEARN ABOUT: 12 Best Tools for Researchers

With the advancement in today’s world, empirical research has become critical and a norm in many fields to support their hypothesis and gain more knowledge. The methods mentioned above are very useful for carrying out such research. However, a number of new methods will keep coming up as the nature of new investigative questions keeps getting unique or changing.

Create a single source of real data with a built-for-insights platform. Store past data, add nuggets of insights, and import research data from various sources into a CRM for insights. Build on ever-growing research with a real-time dashboard in a unified research management platform to turn insights into knowledge.

LEARN MORE         FREE TRIAL

MORE LIKE THIS

360 review questions

360 Review Questions: Best Practices & Tips

Feb 16, 2024

survey analysis software

Exploring 8 Best Survey Analysis Software for Your Research

Feb 15, 2024

Apple NPS

Apple NPS 2024: Understanding the Success and Implementation

opinion mining

Opinion Mining: What it is, Types & Techniques to Follow

Feb 14, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Banner

  • UTEP Library
  • UTEP Library Research Guides

PSYC 3101: General Experimental Methods

Empirical articles vs. review articles.

  • Use Library Resources from Anywhere Off-Campus
  • Search Tips
  • Article Anatomy
  • Use RefWorks This link opens in a new window
  • University Writing Center
  • What is a Peer-Reviewed Article Anyway?

Empirical Articles

Review Articles

What is a Peer Reviewed Article?

Peer review is a process that many, but not all, journals use. Article manuscripts submitted to peer-reviewed journals are not automatically accepted and published.

In peer review, a panel of experts in the given field review the manuscript to determine aspects such as the quality of research, appropriateness for the journal, and relevance to the field. One of three decisions is made: accept, reject, or revise based on commentary from reviewers.

The process of peer review is thought to help ensure that high quality articles appear in journals.

Another term for peer-reviewed is  refereed . Peer-reviewed journals may also be called  scholarly.

Remember that magazines, Internet sources, and books are not the same as peer-reviewed journals.

In psychology, articles that report on original/new research studies may be referred to as primary sources or empirical .

  • Common sections in a research/empirical article include introduction, literature review, methods/process, data, results, discussion, conclusion / suggestions for further study, and references.
  • If the article is not divided into sections, it does not automatically mean it is not an empirical article.
  • You cannot assume that an article is empirical just because it is divided into sections.
  • In the methods section [which may be called something similar], or otherwise usually toward the beginning or middle of the article [if it does not have sections]: The authors will describe how they actively conducted new or original research -- such as an experiment or survey. Examples of what would likely be explained: How they identified participants, that they received Institutional Review Board (IRB) approval, control vs. experimental groups, and so forth about their research.

Articles that either interpret or analyze empirical articles are considered  review articles.  Such articles are often referred to as  secondary  sources  or  secondary research.

  • An entire article that is purely a literature review [usually a review of select other articles considered to be the best support for a research question/topic]
  • Systematic review
  • Meta-analysis / meta-analyses
  • Meta-synthesis / meta-syntheses
  • If an article is a review article, it is likely [but not always] to have the words literature review, systematic review, integrative review, meta-analysis, meta-synthesis, or other mention of review in the title.
  • The main indicator of a review article is if authors are just interpreting, analyzing, and/or comparing the results of empirical articles. So, in comparison to an empirical article, the authors of a review article do not describe an experiment or survey they conducted.
  • If there is a methods section, it will usually describe how the authors searched for other articles [which databases they searched, what search terms they used] and decided the criteria for articles to include and exclude as part of their review. Again, they will not be describing how they conducted new or original research, such as an experiment or survey.
  • In a literature review, for example, the authors' might point out what they believe to be the most pertinent/applicable research articles.
  • << Previous: Use Library Resources from Anywhere Off-Campus
  • Next: Search Tips >>
  • Last Updated: Feb 15, 2024 11:47 AM
  • URL: https://libguides.utep.edu/psyc3101

an empirical review

Sociology 245: Sociology of Health (Polonijo)

  • Research Frameworks (PICO, etc.)
  • Reference and Handbooks
  • Empirical & Review Articles

Empirical Articles

Similarities & differences, types of scholarly articles, review articles.

  • Methodology Limiters
  • Locating Articles from Citations
  • Subject Headings
  • Suggested Databases
  • Top Journals
  • Statistics & Data
  • Grey Literature
  • Writing a Literature Review
  • ASA Citation Style
  • Accessing Library Resources & Accounts

Empirical articles are based on an experiment or study.  The authors will report the purpose of the study, the research methodology, and results. This is a familiar structure for empirical articles (IMRAD):

  • introduction

In describing the purpose of their study, authors will present a mini literature review to discuss how previous research has led up to their original research project.

Also called:

  • primary research article/source
  • primary literature article
  • original research article

Example: The prevalence of sleep disorders in college students: Impact on academic performance

Both empirical articles & literature reviews are:

  • published in journals
  • often peer-reviewed
  • written by experts in the field

They are different in one important way:

Empirical articles report the findings of a research study, while review articles assess the findings of a variety of studies on a topic.

Review articles summarize or synthesize content from earlier published research and are useful for surveying the literature on a specific research area. Review articles can lead you to empirical articles.

There are several types.

  • narrative: a literature review that describes and discusses the state of the science of a specific topic or theme.
  • systematic: a comprehensive review of all relevant studies on a particular topic/question. The systematic review is created by following an explicit methodology for identifying/selecting the studies to include and evaluating their results.
  • meta-analysis: the statistical procedure for combining data from multiple studies. This is usually, but not always, presented with a systematic review.

Irwin, M. R. (2015). Why sleep is important for health: a psychoneuroimmunology perspective. Psychology , 66 (1), 143.

  • << Previous: Reference and Handbooks
  • Next: Search Strategies >>
  • Last Updated: Feb 16, 2024 10:50 AM
  • URL: https://libguides.ucmerced.edu/soc-245

University of California, Merced

  • Advertise with us
  • Saturday, February 17, 2024

Most Widely Read Newspaper

PunchNG Menu:

  • Special Features
  • Sex & Relationship

ID) . '?utm_source=news-flash&utm_medium=web"> Download Punch Lite App

Project Chapter Two: Literature Review and Steps to Writing Empirical Review

Writing an Empirical Review

Kindly share this story:

  • Conceptual review
  • Theoretical review,
  • Empirical review or review of empirical works of literature/studies, and lastly
  • Conclusion or Summary of the literature reviewed.
  • Decide on a topic
  • Highlight the studies/literature that you will review in the empirical review
  • Analyze the works of literature separately.
  • Summarize the literature in table or concept map format.
  • Synthesize the literature and then proceed to write your empirical review.

All rights reserved. This material, and other digital content on this website, may not be reproduced, published, broadcast, rewritten or redistributed in whole or in part without prior express written permission from PUNCH.

Contact: [email protected]

Stay informed and ahead of the curve! Follow The Punch Newspaper on WhatsApp for real-time updates, breaking news, and exclusive content. Don't miss a headline – join now!

Don't go and give your crypto to Pablo o. Use KOYN for a fast, secure and reliable transaction. Get started now Skip the Risk, Go with KOYN! Fast, safe, and reliable crypto trades. Start with KOYN! now.

BREAKING NEWS: Nigerians can now earn Dollars by flipping domain names, buy as low as $12 and sell for $1000-$5000. 100% practical guide/training available. Click here to start earning Dollars

ONLINE FOREX, BITCOIN, BINARY AND PROP FIRM CHALLENGE TRADING EXPERT.

Do you require the service of a tested, trusted and experienced trader, who can help you pass your Prop firm challenges and profitably trade your Forex, Bitcoin and Binary accounts for minimum of 10-20% weekly profit?. Click here for details; WhatsApp: +2348030797998

Nigerians can now work and earn US Dollars, acquire premium domains for as low as $1500, earn a profit up to ₦24 million,($17,000), click here to start now.

Experience the Thrill of Casino Gaming! Play Now at Mozzartbet.ng! Your Ticket to Unlimited Entertainment and Steading Winnings! Explore Mozzartbet.ng Casino Now!

Follow Punch on Whatsapp

Latest News

Nigeria ready to host african central bank 2028, says tinubu, bandits kill six, abduct scores in kaduna communities, osun speaker gifts 400 students utme forms, heatwave: drink plenty water, resident doctors issue safety tips, idahosa emerges edo apc governorship candidate.

Gov Bala Mohammed (3rd from left) reading the communique of the meeting of PDP Govs, former Vice President Atiku Abubakar with the party’s National Working Committee in Abuja on Saturday

Quit if you can't solve Nigeria's problems, PDP govs tell APC

Lorem ipsum dolor sit amet, conse adipiscing elit.

Eid Buhari 3

Google sign-in

How to empirically review the literature?

Empirical research according to Penn State University is based on “observed and measured phenomenon. It derives the knowledge from actual experience rather than from theory or belief”. The empirical review is structured to answer specific research questions within a research paper. Therefore, it enables the researcher to find answers to questions like; What is the problem? The methodology used to study the problem? What was found? What do the findings mean?

Components of empirical review

The key components of empirical review are defined in the figure below. Each of these components will be explained with the help of examples.

Empirical research

When we need to conduct an empirical review on more than 10-15 studies, it is important to be crisp in the presentation of information in a minimum number of words. The following example reflects this point precisely.

Nearly two decades ago, Kalleberg and Leicht (1991) (Authors) conducted a comparative study using longitudinal data (Methodology) in the United States to identify the factors affecting survival and success of small businesses started by men and women (Objective) . That time concept of women entrepreneurship was at a very nascent stage in India. The research findings revealed that the likelihood of success for women’s businesses was the same as the likelihood of success for men’s   (Findings) . This is contrary to the general belief that women are inferior when it comes to entrepreneurship. Also, the factors affecting the survival and success of entrepreneurship behaved in similar ways for both men and women (Kalleberg and Leicht, 1991). Therefore, we can conclude that there is no difference with respect to success when it comes to comparing entrepreneurship among the two genders   (Implications) .

When each study has to be presented individually and more elaboration is needed, then the following approach of the presentation can be adopted.

“Impact of FDI on Indian Economy” (Title) by Devajit (2012) (Author)

This study tries to find out how FDI is seen as an important economic catalyst of Indian economic growth by stimulating domestic investment, increasing human capital formation and by facilitating the technology transfers. The main purpose of this study is to investigate the impact of FDI on economic growth in India.

Methodology

An empirical review of previous studies in the period of 2008-2011.

Foreign Direct Investment (FDI) as a strategic component of investment is needed by India for its sustained economic growth and development through the creation of jobs, expansion of existing manufacturing industries, short and long term projects in the field of healthcare, education, research, and development (R & D), etc. The government should design the FDI policy in such a way that FDI inflow can be utilized as a means of enhancing domestic production, savings, and exports through the equitable distribution among states by providing much freedom to states. In this way, they can attract FDI inflows at their own level. FDI can help to raise the output, productivity, and export at the sectoral level of the Indian economy. However, it can observe the result of sectoral level output, productivity and export are minimal. This is due to the low flow of FDI into India both at the macro level as well as at the sectoral level.

Implications

Therefore for further opening up of the Indian economy, it is advisable to open up the export-oriented sectors and higher growth of the economy could be achieved through the growth of these sectors.

Key points to keep in mind when writing an empirical review

  • The studies should be discussed in chronological order in order to determine the progress in research over a specific period of time
  • There should be a link between the two studies which are discussed simultaneously and henceforth. Unless you form a link between studies, there won’t be any flow in your writing. The link can be formed in the form of arguments or agreements.

Example 1: In agreement

Furthermore, a study by Lall & Sahai (2008) was conducted to study of the issues & challenges faced by women entrepreneurs, using the data collected from women entrepreneurs in Lucknow, India. Various psychographic variables were identified including the degree of commitment, challenges in entrepreneurship & future expansion plans. The characteristics of entrepreneurship were identified as self-esteem, self-image, entrepreneurial passion and the ability to handle future operational and expansion problems.

According to the study, an increasing number of women have been found to work in family-owned businesses. However, they work with low statuses and more challenges. Another similar study (in agreement) conducted by Gupta (2008), on women entrepreneurs across the country, highlighted the constraints. This includes lack of finance, support from family, and male dominance in the society, which were constricting the entry of women entrepreneurship in India.

Example 2: In an argument

Nearly three years later, Surthi and Sarupriya (2003) conducted research on women entrepreneurs in India to study the psychological factors which affect women entrepreneurs. As per the findings of the research, demographic factors like; marital status, type of family, and the way they cope with stress affected women entrepreneurs. In addition, women who were living in a joint family experienced less stress in comparison to the ones living in a nuclear family. This is because the ones living in joint families were able to share their problems with their family members. (In argument) while this study identified the factors which were affecting women’s entrepreneurship, a study conducted by Mohiuddin (2006) was to determine the reasons which contribute to women opting for entrepreneurship in India. The reasons which came forward after the end of the study were; 1) economic needs; 2) personality needs, 3) utilization of knowledge gained through education; 4) family occupation and 5) to pass the leisure time.

  • In addition, the variables identified in each empirical study have to be used to form the conceptual framework  at the end Literature Review chapter.
  • Similarly, in order to refine your empirical review further, a meta-analysis is conducted. In meta-analysis statistical tools are applied to combine the results of different studies. This is done in cases when the number of studies to be analyzed is more than 25.
  • Penn State University (n.d.) Empirical Research, Retrieved from; https://www.libraries.psu.edu/psul/researchguides/edupsych/empirical.html [Accessed 7th Sep, 2015].
  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on WhatsApp (Opens in new window)
  • Click to share on Telegram (Opens in new window)

Notify me of follow-up comments by email.

2 thoughts on “How to empirically review the literature?”

Research paper, proofreading.

Empirical Research

Introduction, what is empirical research, attribution.

  • Finding Empirical Research in Library Databases
  • Designing Empirical Research
  • Case Sudies

Empirical research is based on observed and measured phenomena and derives knowledge from actual experience rather than from theory or belief. 

How do you know if a study is empirical? Read the subheadings within the article, book, or report and look for a description of the research "methodology."  Ask yourself: Could I recreate this study and test these results?

Key characteristics to look for:

  • Specific research questions to be answered
  • Definition of the population, behavior, or   phenomena being studied
  • Description of the process used to study this population or phenomena, including selection criteria, controls, and testing instruments (such as surveys)

Another hint: some scholarly journals use a specific layout, called the "IMRaD" format, to communicate empirical research findings. Such articles typically have 4 components:

  • Introduction : sometimes called "literature review" -- what is currently known about the topic -- usually includes a theoretical framework and/or discussion of previous studies
  • Methodology: sometimes called "research design" -- how to recreate the study -- usually describes the population, research process, and analytical tools
  • Results : sometimes called "findings" -- what was learned through the study -- usually appears as statistical data or as substantial quotations from research participants
  • Discussion : sometimes called "conclusion" or "implications" -- why the study is important -- usually describes how the research results influence professional practices or future studies

Portions of this guide were built using suggestions from other libraries, including Penn State and Utah State University libraries.

  • Next: Finding Empirical Research in Library Databases >>
  • Last Updated: Jan 10, 2023 8:31 AM
  • URL: https://enmu.libguides.com/EmpiricalResearch
  • Undergraduate Project Topics
  • MBA-MSC-PGD Project Topics
  • OND/NCE Project Topics
  • HND Project Topics

UniProjectMaterials.com Logo

Call Us Today: 09067754232, 09159097300

  • Hire A Writer
  • Hire A Data Analyst
  • Happy Customers
  •    
  • OND/NCE RESEARCH PROJECT TOPICS
  • HND RESEARCH PROJECT TOPICS
  • UNDERGRADUATE PROJECT TOPICS
  • MBA-MSC-PGD THESIS R...

Our Archives

  • Accounting 745
  • Accounting Education 12
  • Actuarial Science 5
  • Adult Education 11
  • African Languages 4
  • Agricultural Business And Financial Management 5
  • Agricultural Economics 17
  • Agricultural Engineering 3
  • Agricultural Extension 3
  • Agricultural Marketing And Cooperatives 11
  • Agricultural Science 3
  • Agricultural Science Education 1
  • Animal Production 3
  • Animal Science 5
  • Archaeology And Museum 2
  • Architecture 4
  • Atmospheric And Environmental Physics 2
  • Auditing And Forensic Accounting 9
  • Banking And Finance 549
  • Biochemistry 3
  • Biology Education 16
  • Biomathematics 2
  • Brewing Science 5
  • Building Technology 17
  • Business Administration 476
  • Business Education 18
  • Business Management 33
  • Chemical Engineering 4
  • Chemistry 6
  • Chemistry Education 6
  • Child & Basic Education 14
  • Child Right 3
  • Civil Engineering 8
  • Clothing And Fashion 1
  • Commerce 10
  • Communication Arts 7
  • Computer Science 231
  • Computer Science Education 17
  • Cooperative And Rural Development 4
  • Cooperative Economics 24
  • Criminology And Security Studies 22
  • Crop Production 9
  • Crop Science And Environmental Protection 3
  • Curriculum Studies 5
  • Defence Studies 7
  • Disaster & Risk Management 6
  • Economics 362
  • Economics Education 14
  • Education 2182
  • Education Foundation 18
  • Education Management And Policy 4
  • Educational Administration And Planning 9
  • Educational Measurement And Evaluation 5
  • Electrical Electronics Engineering 12
  • Electronic Accounting 17
  • Elementary Education 2
  • Energy Economics 4
  • English Language Education 16
  • English Literary Studies 27
  • Environmental Biology 2
  • Environmental Geochemistry 1
  • Environmental Geology 2
  • Environmental Science 9
  • Estate Management 44
  • Ethics And Civic Education 2
  • Fine & Applied Arts 5
  • Fisheries And Aquaculture 2
  • Food And Nutrition 3
  • Food Science & Technology 21
  • Forestry And Wildlife 2
  • French Education 4
  • Gender And Women Studies 5
  • Genetics And Biotechnology 1
  • Geography 2
  • Geography Education 4
  • Geophysics 1
  • Guidance Counseling 12
  • Health & Sex Education 5
  • Health Economics 8
  • Health Education 50
  • Health Environmental Education And Human Kinetics 6
  • Health Information Management 7
  • History & International Relations 31
  • Home And Rural Economics 7
  • Home Economics 5
  • Hospitality And Catering Management 11
  • Human Resource Management 268
  • Human Right 1
  • Hydrogeology 3
  • Industrial Chemistry 8
  • Industrial Mathematics 1
  • Industrial Physics 1
  • Information Technology 17
  • Insurance 16
  • Integrated Science Education 8
  • International Affairs And Strategic Studies 6
  • International Law And Diplomacy 24
  • Islamic And Arabic Studies 3
  • Journalism 8
  • Library And Information Science 5
  • Linguistics 2
  • Marine And Transport 3
  • Marine Biology 1
  • Marine Engineering 4
  • Marketing 152
  • Mass Communication 288
  • Mathematical Economics 2
  • Mathematics 15
  • Mathematics Education 10
  • Mba Finance 8
  • Mechanical Engineering 6
  • Medical And Health Science 13
  • Medicine And Surgery 2
  • Microbiology 17
  • Office Technology & Management 11
  • Petroleum Engineering 4
  • Philosophy 38
  • Physics Education 11
  • Political Science 128
  • Primary Science Education 2
  • Production And Management 1
  • Project Management 1
  • Psychology 12
  • Psychology Education 5
  • Public Administration 35
  • Public Health 29
  • Public Relations 12
  • Purchasing And Supply 11
  • Pure And Applied Chemistry 1
  • Quantity Surveying 13
  • Radiography And Radiological Sciences 5
  • Religious And Cultural Studies 7
  • Science And Computer Education 7
  • Science Laboratory And Technology 14
  • Secretarial Studies 9
  • Smes & Entrepreneurship 145
  • Social Science And Humanities 1
  • Social Studies Education 8
  • Sociology And Anthropology 24
  • Soil Science 3
  • Staff Development And Distance Education 4
  • Statistics 36
  • Surveying And Geo-informatics 3
  • Taxation 64
  • Teacher Education 8
  • Technical Education 1
  • Theatre Arts 4
  • Theology 17
  • Tourism And Hospitality Management 56
  • Urban & Regional Planning 13
  • Veterinary 1
  • Vocational Education 17
  • MBA-MSC-PGD Thesis research materials
  • Click Here For More Departments »

HOW TO WRITE EMPIRICAL REVIEW FOR UNDERGRADUATE PROJECT TOPICS

an empirical review

Empirical review is very importance in undergraduate project writing . There are so many benefits of having a good empirical review. Every undergraduate project student is expected to understand the technicalities involved in writing a good empirical review.

        The empirical review is the last section of the chapter two of undergraduate project topics . But there some undergraduate project topics that does not require empirical review. Some of these undergraduate project topics are those under the following departments:

  • mathematics
  • mechanical engineering
  • civil engineering
  • chemical engineering
  • electrical electronics engineering
  • agricultural science engineering etc

Outside these departments I listed above, departments like:

  • adult education
  • business education
  • business administration
  • chemistry education
  • banking and finance
  • human resource management etc

In short all undergraduate project topics under any department in social sciences use empirical review. You can now see how important the use of empirical review is.

        Lest I forget there is one more importance of empirical review; the empirical review for undergraduate project topics is very helpful in determining the gap in any research work. If you are a good researcher you will agree with me that a good research hypothesis comes from the literature review .

        Students that are new to research starts writing their projects from chapter one. But if become experienced in the field of research project writing; you will end starting your research project work from chapter two.

                If you want to write an empirical review, then this is what you should look out for research works by previous authors. Look at literature review that have the authors name, the year he/she carried out his/her research work, the topic he carried out his research work on, his/her findings (this could be in percentage or number), and his/her conclusion.

        Let us take a practical approach to this article. Let us consider the topic Effect of taxes on economic development of Nigeria. Below is example of empirical review for the above topic. As you are going through it, compare it with what I list last section above.

        uniprojectmaterials and research (2020) in their study of taxes and economic growth of U.S economy considered a large sample of countries and documented that 0.2 to 0.3 percentage point differences in growth rate in response to a major tax reform. He stated that small effects could have a large cumulative impact on living standards.

        uniprojectmaterials, (2011) using simple regression analysis and descriptive statistical method, find that the ratio of VAT to GDP averaged 1.3% compared to 4.5% in Indonesia, though VAT revenue accounts for as much as 95% significant variations in GDP in Nigeria.

        uniprojectmaterials, (2013) in their study based on survey method, used questionnaire on 40 respondents to generate data which was measured by a simple majority or percentage of opinions. The study found that more tax compliance is significantly associated with the adequate campaign and judicious utilization of tax funds

  • ACCOUNTING 745
  • ACCOUNTING EDUCATION 12
  • ACTUARIAL SCIENCE 5
  • ADULT EDUCATION 11
  • AFRICAN LANGUAGES 4
  • AGRICULTURAL BUSINESS ... 5
  • AGRICULTURAL ECONOMICS 17
  • AGRICULTURAL ENGINEERING 3
  • AGRICULTURAL EXTENSION 3
  • AGRICULTURAL MARKETING... 11
  • AGRICULTURAL SCIENCE 3
  • AGRICULTURAL SCIENCE E... 1
  • ANIMAL PRODUCTION 3
  • ANIMAL SCIENCE 5
  • ARCHAEOLOGY AND MUSEUM 2
  • ARCHITECTURE 4
  • ATMOSPHERIC AND ENVIRO... 2
  • AUDITING AND FORENSIC ... 9
  • BANKING AND FINANCE 549
  • BIOCHEMISTRY 3
  • BIOLOGY EDUCATION 16
  • BIOMATHEMATICS 2
  • BREWING SCIENCE 5
  • BUILDING TECHNOLOGY 17
  • BUSINESS ADMINISTRATION 476
  • BUSINESS EDUCATION 18
  • BUSINESS MANAGEMENT 33
  • CHEMICAL ENGINEERING 4
  • CHEMISTRY 6
  • CHEMISTRY EDUCATION 6
  • CHILD & BASIC EDUCATION 14
  • CHILD RIGHT 3
  • CIVIL ENGINEERING 8
  • CLOTHING AND FASHION 1
  • COMMERCE 10
  • COMMUNICATION ARTS 7
  • COMPUTER SCIENCE 231
  • COMPUTER SCIENCE EDUCA... 17
  • COOPERATIVE AND RURAL ... 4
  • COOPERATIVE ECONOMICS 24
  • CRIMINOLOGY AND SECURI... 22
  • CROP PRODUCTION 9
  • CROP SCIENCE AND ENVIR... 3
  • CURRICULUM STUDIES 5
  • DEFENCE STUDIES 7
  • DISASTER & RISK MANAGE... 6
  • ECONOMICS 362
  • ECONOMICS EDUCATION 14
  • EDUCATION 2182
  • EDUCATION FOUNDATION 18
  • EDUCATION MANAGEMENT A... 4
  • EDUCATIONAL ADMINISTRA... 9
  • EDUCATIONAL MEASUREMEN... 5
  • ELECTRICAL ELECTRONICS... 12
  • ELECTRONIC ACCOUNTING 17
  • ELEMENTARY EDUCATION 2
  • ENERGY ECONOMICS 4
  • ENGLISH LANGUAGE EDUCA... 16
  • ENGLISH LITERARY STUDIES 27
  • ENVIRONMENTAL BIOLOGY 2
  • ENVIRONMENTAL GEOCHEMI... 1
  • ENVIRONMENTAL GEOLOGY 2
  • ENVIRONMENTAL SCIENCE 9
  • ESTATE MANAGEMENT 44
  • ETHICS AND CIVIC EDUCA... 2
  • FINE & APPLIED ARTS 5
  • FISHERIES AND AQUACULT... 2
  • FOOD AND NUTRITION 3
  • FOOD SCIENCE & TECHNOL... 21
  • FORESTRY AND WILDLIFE 2
  • FRENCH EDUCATION 4
  • GENDER AND WOMEN STUDIES 5
  • GENETICS AND BIOTECHNO... 1
  • GEOGRAPHY 2
  • GEOGRAPHY EDUCATION 4
  • GEOPHYSICS 1
  • GUIDANCE COUNSELING 12
  • HEALTH & SEX EDUCATION 5
  • HEALTH ECONOMICS 8
  • HEALTH EDUCATION 50
  • HEALTH ENVIRONMENTAL ... 6
  • HEALTH INFORMATION MAN... 7
  • HISTORY & INTERNATIONA... 31
  • HOME AND RURAL ECONOMICS 7
  • HOME ECONOMICS 5
  • HOSPITALITY AND CATERI... 11
  • HUMAN RESOURCE MANAGEM... 268
  • HUMAN RIGHT 1
  • HYDROGEOLOGY 3
  • INDUSTRIAL CHEMISTRY 8
  • INDUSTRIAL MATHEMATICS 1
  • INDUSTRIAL PHYSICS 1
  • INFORMATION TECHNOLOGY 17
  • INSURANCE 16
  • INTEGRATED SCIENCE EDU... 8
  • INTERNATIONAL AFFAIRS ... 6
  • INTERNATIONAL LAW AND ... 24
  • ISLAMIC AND ARABIC STU... 3
  • JOURNALISM 8
  • LIBRARY AND INFORMATI... 5
  • LINGUISTICS 2
  • MARINE AND TRANSPORT 3
  • MARINE BIOLOGY 1
  • MARINE ENGINEERING 4
  • MARKETING 152
  • MASS COMMUNICATION 288
  • MATHEMATICAL ECONOMICS 2
  • MATHEMATICS 15
  • MATHEMATICS EDUCATION 10
  • MBA FINANCE 8
  • MECHANICAL ENGINEERING 6
  • MEDICAL AND HEALTH SCI... 13
  • MEDICINE AND SURGERY 2
  • MICROBIOLOGY 17
  • OFFICE TECHNOLOGY & MA... 11
  • PETROLEUM ENGINEERING 4
  • PHILOSOPHY 38
  • PHYSICS EDUCATION 11
  • POLITICAL SCIENCE 128
  • PRIMARY SCIENCE EDUCAT... 2
  • PRODUCTION AND MANAGEM... 1
  • PROJECT MANAGEMENT 1
  • PSYCHOLOGY 12
  • PSYCHOLOGY EDUCATION 5
  • PUBLIC ADMINISTRATION 35
  • PUBLIC HEALTH 29
  • PUBLIC RELATIONS 12
  • PURCHASING AND SUPPLY 11
  • PURE AND APPLIED CHEMI... 1
  • QUANTITY SURVEYING 13
  • RADIOGRAPHY AND RADIOL... 5
  • RELIGIOUS AND CULTURAL... 7
  • SCIENCE AND COMPUTER E... 7
  • SCIENCE LABORATORY AND... 14
  • SECRETARIAL STUDIES 9
  • SMEs & ENTREPRENEURSHIP 145
  • SOCIAL SCIENCE AND HUM... 1
  • SOCIAL STUDIES EDUCATION 8
  • SOCIOLOGY AND ANTHROPO... 24
  • SOIL SCIENCE 3
  • STAFF DEVELOPMENT AND ... 4
  • STATISTICS 36
  • SURVEYING AND GEO-INFO... 3
  • TAXATION 64
  • TEACHER EDUCATION 8
  • TECHNICAL EDUCATION 1
  • THEATRE ARTS 4
  • THEOLOGY 17
  • TOURISM AND HOSPITALIT... 56
  • URBAN & REGIONAL PLAN... 13
  • VETERINARY 1
  • VOCATIONAL EDUCATION 17
  • MBA-MSC-PGD Thesis resea... 17
  • Click Here For More Departments

Featured Posts

  • SPINBOT: ARTICLE REWRITER AND THE QUALITY OF UNDERGRADUATE PROJECTS
  • WHY STUDENTS MISTAKE CONCEPTUAL FRAMEWORK TO CONCEPTUAL LITERATURE
  • THE DIFFERENCE BETWEEN JUSTIFICATION OF THE STUDY AND SIGNIFICANCE OF THE STUDY
  • 6 TIPS ON HOW TO PRESENT AN UNDERGRADUATE SEMINAR PAPER
  • PICO PROCESS: HOW TO DO STUDY PROTOCOL FOR UNDERGRADUATE PROJECTS
  • SOLUTION TO THE CHALLENGES UNDERGRADUATE STUDENTS FACE DURING DISSERTATION WRITING

© 2024 UniProjectMaterials - HOW TO WRITE EMPIRICAL REVIEW FOR UNDERGRADUATE PROJECT TOPICS | UniProjectMaterials Blog | Terms of use

Prosocial Behavior and Well-Being: An Empirical Review of the Role of Basic Psychological Need Satisfaction

Affiliation.

  • 1 University of Rochester.
  • PMID: 38358728
  • DOI: 10.1080/00223980.2024.2307377

Although prosocial behavior is positively associated with one's well-being, researchers have yet to reach a consensus on the role played by basic psychological need satisfaction (BPNS) in this association. A systematic review of the existing empirical literature is conducted in this article to summarize and synthesize the relationship between prosocial behavior and well-being, with a special emphasis on the multifaceted role of BPNS (i.e. mediation, moderation, and concurrent mediation and moderation). Nineteen articles have been identified that meet the criteria of the research focus, being empirical and peer-reviewed. Results suggest that BPNS can act as a mediator, moderator, and differing mediation and moderation roles. Prosocial behavior can both individually and jointly satisfy the three needs for autonomy, competence, and relatedness, thus enhancing well-being. Moreover, the positive correlation between prosocial behavior and well-being can be augmented by a high level of satisfaction of one or multiple needs. Furthermore, those who have higher satisfaction of autonomy, competence, or relatedness display a greater increase in well-being after engaging in prosocial behavior, which can be mediated by BPNS. Drawing on these findings, the current body of work is evaluated in terms of its strengths and weaknesses, and potential future directions are explored.

Keywords: Prosocial behavior; basic psychological need satisfaction; empirical review; well-being.

Publication types

  • Open access
  • Published: 07 February 2024

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

  • Vanda M. Lourenço 1 ,
  • Joseph O. Ogutu 2 ,
  • Rui A.P. Rodrigues 1 ,
  • Alexandra Posekany 3 &
  • Hans-Peter Piepho 2  

BMC Genomics volume  25 , Article number:  152 ( 2024 ) Cite this article

410 Accesses

Metrics details

The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep , ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program.

Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction.

Conclusions

The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources.

Peer Review reports

Rapid advances in genotyping and phenotyping technologies have enabled widespread and growing use of genomic prediction (GP). The very high dimensional nature of both genotypic and phenotypic data, however, is increasingly limiting the utility of the classical statistical methods. As a result, machine learning (ML) methods able to efficiently handle high dimensional data are becoming widely used in GP. This is especially so because, compared to many other methods used in GP, ML methods possess the significant advantage of being able to model nonlinear relationships between the response and the predictors and complex interactions among predictor variables. However, this often comes at the price of a very high computational burden. Often, however, computational cost is less likely to present serious challenges if the number of SNPs in a dataset is relatively modest but it can become increasingly debilitating as the number of markers grows to millions or even tens of millions. Future advances in computational efficiencies of machine learning algorithms or using high-performance or more efficient programming languages may progressively ameliorate this limitation. Given their growing utility and popularity, it is important to establish the relative predictive performance of different groups of ML methods in GP. Even so, the formal comparative evaluation of the predictive performance of groups of ML methods has attracted relatively little attention. The rising importance of ML methods in plant and animal breeding research and practice, increases both the urgency and importance of evaluating the relative predictive performance of groups of ML methods relative to each other and to classical methods. This can facilitate identification of groups of ML methods that balance high predictive accuracy with low computational cost for routine use with high dimensional phenotypic and genomic data, such as for GP, say.

ML is perhaps one of the most widely used branches of contemporary artificial intelligence. Using ML methods facilitates automation of model building, learning and efficient and accurate predictions. ML algorithms can be subdivided into two major classes: supervised and unsupervised learning algorithms. Supervised regression ML methods encompass regularized regression methods, deep, ensemble and instance-based learning algorithms. Supervised ML methods have been successfully used to predict genomic breeding values for unphenotyped genotypes, a crucial step in genome-enabled selection [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]. Furthermore, several studies have assessed the relative predictive performance of supervised ML methods in GP, including two ensemble methods and one instance-based method [ 5 ]; four regularized and two adaptive regularized methods [ 6 ]; three regularized and five regularized group methods [ 9 ] and several deep learning methods [ 1 , 2 , 3 , 4 , 8 ]. However, no study has comprehensively evaluated the comparative predictive performance of all these groups of methods relative to each other or to the classical regularized regression methods. We therefore rigorously evaluate the comparative predictive performance as well as the computational complexity or cost of three groups of popular and state-of-the-art ML methods for GP using one simulated animal dataset and three empirical datasets obtained from a commercial maize breeding program. We additionally offer brief overviews of the mathematical properties of the methods with emphasis on their salient properties, strengths and weaknesses and relationships with each other and with the classical regularization methods. While we offer a somewhat comprehensive review of genomic prediction methods with a specific emphasis on ML, our contribution extends to showcasing novel findings derived from comparative assessments of ML techniques across both real and simulated datasets.

Besides ML methods, Bayesian methods are also becoming widely used for genomic prediction [ 3 , 8 , 10 ]. So, even though our goal is not to provide an exhaustive review of all genomic prediction methods, we offer two Bayesian methods for benchmarking the performance of the ML methods.

The rest of the paper is organized as follows. First we present the synthetic and real datasets. Second, we detail the methods compared in this study. Next, the results from the comparative analyses of the data are presented. Finally, a discussion of the results and closing remarks follow.

Simulated (animal) data

We consider one simulated dataset [ 9 ], an animal breeding outbred population simulated for the 16-th QTLMAS Workshop 2012 (Additional file 1 ). The simulation models used to generate the data are described in detail in [ 11 ] and are therefore not reproduced here. The dataset consists of 4020 individuals genotyped for 9969 SNP markers. Out of these, 3000 individuals were phenotyped for three quantitative milk traits and the remaining 1020 were not phenotyped (see [ 9 ] for details). The goal of the analysis of the simulated dataset is to predict the genomic breeding values (PGBVs) for the 1020 unphenotyped individuals using the available genomic information. The simulated dataset also provides true genomic breeding values (TGBVs) for the 1020 genotypes for all the traits.

As in [ 9 ], to enable model fitting for the grouping methods, markers were grouped by assigning consecutive SNP markers systematically to groups of sizes 10, 20, ..., 100 separately for each of the five chromosomes. Typically, the last group of each grouping scheme has fewer SNPs than the prescribed group size. Table 1 summarizes the simulated phenotypic data and highlights differences in the magnitudes of the three simulated quantitative traits \(T_1\) , \(T_2\) and \(T_3\) .

Real (plant) data

For the application to empirical data sets, we use three empirical maize breeding datasets produced by KWS (breeding company) for the Synbreed project during 2010, 2011 and 2012. We first performed separate phenotypic analyses of yield for each of the three real maize data sets to derive the adjusted means used in genomic prediction using a single stage mixed model assuming that genotypes are uncorrelated (Additional file 4 , S1 Text). The fixed effect in the mixed model comprised a tester (Tester) with two levels, genotypic group (GRP) with three levels, Tester \(\times\) GRP and Tester \(\times\) GRP \(\times\) G (G=genotype). The random factors were location (LOC), trial (TRIAL) nested within location, replicate (REP) nested within trial and block (BLOCK) nested within replicate. The fitted random effects were LOC, LOC \(\times\) TRIAL, LOC \(\times\) TRIAL \(\times\) REP, LOC \(\times\) TRIAL \(\times\) REP \(\times\) BLOCK, Tester \(\times\) GRP \(\times\) SWITCH2 \(\times\) G1 and Tester \(\times\) GRP \(\times\) SWITCH1 \(\times\) G2. SWITCH1 and SWITCH2 in the last two effects are operators defined and explained briefly in the supplementary materials (Additional file 4 , S1 text; and Additional file 5 , Section 1) and in greater detail in [ 12 , 13 ]. All the three maize datasets involved two testers and three genotypic groups. Accordingly, prior to genomic prediction, we accounted for and removed the effect of the tester \(\times\) genotypic group (GRP) effect from the adjusted means (lsmeans) of maize yield (dt/ha) by computing the arithmetic mean of the lsmeans for the interaction of testers with GRP for the genotyped lines. This mean was then subtracted from the lsmeans for each tester \(\times\) GRP interaction term. The resulting deviations were subtracted from the lsmeans of the individual genotypes corresponding to each Tester \(\times\) GRP interaction. This enabled us not to consider the Tester \(\times\) GRP effect in the genomic prediction model.

For all the years, every line was genotyped for 32217 SNP markers. A subset of the SNP markers with non-zero variances were split into groups of sizes 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. Groups were defined by systematically grouping consecutive and spatially adjacent markers, separately for each of 10 chromosomes (Additional file 4 , S2 Text). All the checks (standard varieties) and check markers were deleted prior to genomic prediction. More details specific to the three datasets follow (Table 2 summarizes the number of genotypes in the training and validation datasets). The true breeding values are not known in this case.

For each of the 2010, 2011 and 2012 datasets, the genotypes or test crosses were genotyped for 32217 SNPs and randomly split into 5 parts (folds) for 5-fold cross-validation (Additional file 4 , S3 Text & S4 Text). The random splitting procedure was repeated 10 times to yield 10 replicates per dataset. The total number of genotypes and the number of individuals assigned to the training and validation sets for each dataset are provided in Table 2 .

Table 3 summarizes the KWS phenotypic data for 2010, 2011 and 2012. Each data split for each year (2010, 2011 and 2012) contained approximately 20% of the phenotypic observations and was obtained using stratified random sampling using the algorithm of [ 14 ]. The strata were defined by the combinations of the two testers and three genotypic groups.

In this section we describe the four supervised ML groups of methods.

Regularized regression methods

Consider the general linear regression model

where \(y_i\) is the i -th observation of the response variable, \(x_{ij}\) is the i -th observation of the j -th covariate ( p is the number of all covariates), \(\beta _j\) are the regression coefficients (unknown fixed parameters), \(\varepsilon _i\) are i.i.d. random error terms with \(E(\varepsilon _i)=0\) and \(var(\varepsilon _i)=\sigma ^2_e\) , where \(\sigma ^2_e\) is an unknown random variance, and n is the sample size. The ordinary least squares estimator of \(\varvec{\beta }=(\beta _0,\dots ,\beta _p)'\) , which is unbiased, is obtained by minimizing the residual sum of squares (RSS), i.e.,

This estimator is typically not suitable when the design matrix \(\textbf{X}\) is less than full rank ( \(\textbf{X}\) has a full rank if the number of its linearly independent rows or columns \(k=\min (p,n)\) ) or is close to collinearity (i.e., the covariates are close to being linear combinations of one another) [ 15 ]; problems that are frequently associated with \(p>>n\) .

In genomic prediction (GP) one is interested in estimating the p regression coefficients \(\beta _j\) so that genomic breeding values of non-phenotyped genotypes can be predicted from the fitted model. The response variable \(\textbf{y}\) is often some quantitative trait and the \(\beta _j\) ’s are the coefficients of molecular markers spanning the whole genome, usually Single Nucleotide Polymorphisms (SNPs). Because in GP typically \(p>>n\) , the ordinary least squares (OLS) estimator breaks down and thus other methods for estimating \(\varvec{\beta }\) in ( 1 ) must be sought. Indeed, the increasingly high dimensional nature of high-throughput SNP-marker datasets has prompted increasing use of the power and versatility of regularization methods in genomic prediction to simultaneously select and estimate important markers and account for multicollinearity [ 5 , 6 ].

Without loss of generality, we assume, consistent with the standard practice in regularized estimation where a distance-based metric is used for prediction, that the response variable is mean-centered whereas the covariates in ( 1 ) are standardized, so that

Regularized regression methods minimize a non-negative loss function (RSS or other) plus a non-negative penalty function. Standardizing the covariates prior to model fitting ensures that the penalty is applied evenly to all covariates. Mean-centering the response and the covariates is usually done for notational simplicity but also eliminates the need to estimate the intercept \(\beta _0\) .

After the penalized models have been fit, the final estimates are obtained by back transformation to the original scale by re-introducing an intercept ( \(\beta _0\) ). In particular, for a mean-centered response \(\textbf{y}\) and standardized predictor \(\textbf{X}^{\varvec{*}}\) , predictions are obtained by

with \(\widehat{\varvec{\beta }}^*=(\widehat{\beta }^*_1,\dots ,\widehat{\beta }^*_p)\) , the regression coefficients from the model fit with the mean-centered response \(\textbf{y}\) and standardized covariates \(\textbf{X}^{\varvec{*}}\) , \({\textbf{X}}^*_j=(x_{1j},\dots ,x_{nj})'\) the j -th covariate and \(\beta _0=\bar{\textbf{y}}\) . One can also choose to predict using the original predictor \(\textbf{X}^{\varvec{*}}\) without standardization. In that case one should back transform the \(\widehat{\beta }^*_j\) to the original scale and consider

with \(\widehat{\beta }_j=\widehat{\beta }^*_j/s_j\) , \(s_j=\sqrt{n^{-1}\sum \limits _{i=1}^nx_{ij}^2}\) the standard deviation of the j-th covariate \({\textbf{X}}^*_j\) and \(\beta _0 = \bar{\textbf{y}}- {\tilde{\textbf{X}}} \widehat{\varvec{\beta }}\) , where \({\tilde{\textbf{X}}}_j=(m_j,\dots ,m_j)'\) is a vector of size n with \(m_j\) being the mean of the j -th covariate \({\textbf{X}}^*_j\) .

The primary goal of regularization methods is to reduce model complexity resulting from high dimensionality by reducing the number of predictors in the model. This is achieved by either shrinking some coefficients to become exactly zero, and so drop out of the model, or shrinking all coefficients to be close to zero and each other but not exactly zero. Ideally, a desirable estimator of \(\varvec{\beta }\) should (i) correctly select the nonzero coefficients with probability converging to 1 (i.e. with near certainty; selection consistency ) and (ii) yield estimators of the nonzero coefficients that are asymptotically normal with the same means and covariances that they would have if the zero coefficients were known exactly in advance ( asymptotic normality ). An estimator satisfying these two conditions is said to possess the oracle property [ 16 , 17 ].

For the remainder of the paper, we assume that \({\textbf {X}}\) is a \(n\times p\) marker matrix (e.g., with the genotypes \(\{aa,Aa,AA\}\) coded as \(\{0,1,2\}\) or \(\{-1,0,1\}\) for p biallelic SNPs under an additive model) with \({\textbf {X}}_j\) denoting the j -th SNP covariate and \(\varvec{\beta }=(\beta _1,\dots ,\beta _p)\) denoting the unknown vector of marker effects. Table 4 (upper half) summarizes the methods discussed in this sub-section.

Bridge-type estimators

The most popular regularization methods in genomic prediction include ridge regression (RR; [ 18 ]), the least absolute shrinkage and selection operator (LASSO; [ 19 ]) and the elastic net (ENET; [ 20 ]). All these methods are special cases of the bridge estimator [ 15 , 21 ] given by

where the regularization parameter \(\lambda\) balances the goodness-of-fit against model complexity and the shrinkage parameter \(\gamma\) determines the order of the penalty function. The optimal combination of \(\lambda\) and \(\gamma\) can be selected adaptively for each dataset by grid search using cross-validation (CV; if the focus is on predictive performance) or by information criteria (e.g., AIC or BIC; if the focus is on model fit). Bridge regression automatically selects relevant predictors when \(0<\gamma \le 1\) , shrinks the coefficients when \(\gamma >1\) and reduces to subset selection when \(\gamma =0\) . The bridge estimator reduces to the LASSO estimator when \(\gamma =1\) and to the ridge estimator when \(\gamma =2\) . Specifically,

where \(\Vert . \Vert _1\) is the \(\ell _1\) -norm, and

The bridge estimator also enjoys several other useful and interesting properties (see [ 22 , 23 ] for more details). We summarize these salient properties with emphasis on the special cases of the LASSO ( \(\gamma =1\) ) and the ridge estimators ( \(\gamma =2\) ).

The asymptotic properties of bridge estimators have been studied in detail by [ 22 ]. In particular, where \(p<n\) , with p increasing to infinity as n grows, and under appropriate regularity conditions, bridge estimators enjoy the oracle property for \(0<\gamma <1\) . This implies that neither the LASSO nor the ridge estimator possesses the oracle property [ 16 , 17 ]. If \(p>>n\) and no assumptions are imposed on the covariate matrix, then the regression parameters are generally non-identifiable. However, if a suitable structure is assumed for the covariate matrix, then bridge estimators achieve consistent variable selection and estimation [ 22 ].

Although the LASSO estimator performs automatic variable selection, it is a biased and inconsistent estimator [ 24 , 25 ]. Moreover, it is unstable with high-dimensional data because it

cannot select a larger number of predictors p than the sample size n if \(p>>n\) ;

arbitrarily selects one member of a set of pairwise highly correlated predictors and ignores the other.

The ridge estimator performs well for many predictors each of which has a small effect but cannot shrink the coefficients to become exactly zero. Moreover, the ridge estimator

prevents coefficients of linear regression models with many correlated variables from being poorly determined and exhibiting high variance;

shrinks coefficients of correlated predictors equally towards zero and towards each other;

retains all predictor variables in the model leading to complex and less interpretable models.

In addition, RR has close connections with marker-based best linear unbiased prediction (BLUP) and genomic best linear unbiased prediction (GBLUP) [ 26 ], which we clarify in what follows. The ridge estimator is given by

where, if \(\lambda\) is estimated by cross-validation as suggested above, then the ridge estimator may be denoted by RR-CV. Another way of looking at the ridge estimator is to assume in ( 1 ) that \(\varvec{\beta }\sim N({\textbf {0}},{\textbf {I}}\sigma ^2_{\beta })\) is a random vector of unknown marker effects and that \(\varvec{\varepsilon }\sim N({\textbf {0}},{\textbf {I}}\sigma ^2_{e})\) is an unknown random error term, where \(\sigma ^2_{\beta }\) and \(\sigma ^2_{e}\) are the unknown marker-effect and error variances, respectively. Model ( 1 ), written in matrix form as

is now a linear mixed model and hence, the variances can be estimated via the restricted maximum likelihood (REML) method. Observing that \({\textbf{y}}\sim N({\varvec{0}},{\textbf{K}}\sigma ^2_{\beta }+{\textbf{I}}\sigma ^2_{\varepsilon })\) , where \({\textbf{K}}={\textbf {X}}'{} {\textbf {X}}\) is the kinship or genomic relationship matrix, the BLUP solution for the marker effects under model ( 5 ) is given by ([ 27 ]; p.270)

Now defining \({\textbf {H}}={\textbf {I}} \frac{\sigma ^2_{\varepsilon }}{\sigma ^2_{\beta }}\) to simplify the notation and pre-multiplying \(\widehat{\varvec{\beta }}_{BLUP}\) with \(({\textbf {X}}'{} {\textbf {X}}+{\textbf {H}})^{-1}{} {\textbf {X}}'({\textbf {K}}+{\textbf {H}}){\textbf {K}}^{-1}{} {\textbf {X}}\) we obtain

Finally, observing that \(({\textbf {X}}'{} {\textbf {X}}+{\textbf {H}})^{-1}{} {\textbf {X}}'({\textbf {K}}+{\textbf {H}}){\textbf {K}}^{-1}{} {\textbf {X}}={\textbf {X}}'{} {\textbf {K}}^{-1}{} {\textbf {X}}\) (see Appendix ) and that \({\textbf {X}}'{} {\textbf {K}}^{-1}{} {\textbf {X}}{} {\textbf {X}}'={\textbf {X}}'\) we find that

establishing the equivalence of BLUP and RR [ 28 , 29 ] and that one can actually estimate the ridge parameter \(\lambda\) by \(\widehat{\lambda }=\frac{\widehat{\sigma }^2_{e}}{\widehat{\sigma }^2_{\beta }}\) . Because we use REML to estimate the two variance components in \(\widehat{\varvec{\beta }}_{BLUP}\) , we refer to this RR appproach as RR-REML. Our basic regression model ( 5 ) can be written as

where, \({\textbf {g}}={\textbf {X}}\varvec{\beta }\) . Making the same assumptions as for RR-REML, i.e., assuming that \(\varvec{\beta }\sim N({\textbf {0}},{\textbf {I}}\sigma ^2_{\beta })\) and \(\varvec{\varepsilon }\sim N({\textbf {0}},{\textbf {I}}\sigma ^2_{e})\) , we have that \({\textbf {g}}\sim N({\textbf {0}},{\textbf {K}}\sigma ^2_{\beta })\) . The BLUP of \({\textbf {g}}\) , also known as genomic estimated breeding values (GEBV) or gBLUP, under this model is ([ 27 ]; p.270)

Now pre-multiplying \(\widehat{{\textbf {g}}}_{BLUP}\) with \({\textbf {X}}({\textbf {X}}'{} {\textbf {X}}+{\textbf {H}})^{-1}{} {\textbf {X}}'({\textbf {K}}+{\textbf {H}}){\textbf {K}}^{-1}\) we obtain

Finally, observing that \({\textbf {X}}({\textbf {X}}'{} {\textbf {X}}+{\textbf {H}})^{-1}{} {\textbf {X}}'({\textbf {K}}+{\textbf {H}}){\textbf {K}}^{-1}={\textbf {I}}\) (see Appendix ), we find that \(\widehat{{\textbf {g}}}_{BLUP}={\textbf {X}}\widehat{\varvec{\beta }}_{BLUP}\) establishing the equivalence of RR-REML and gBLUP [ 30 , 31 ].

Due to the nature of the \(\ell _1\) penalty, particularly for high values of \(\lambda\) , the LASSO estimator will shrink many coefficients to exactly zero, something that never happens with the ridge estimator.

Elastic net estimator

The elastic net estimator blends two bridge-type estimators, the LASSO and the ridge, to produce a composite estimator that reduces to the LASSO when \(\lambda _2=0\) and to the ridge when \(\lambda _1=0\) . Specifically, the elastic net estimator is specified by

with \(k=1+\lambda _2\) if the predictors are standardized (as we assume) or \(k=1+\lambda _2/n\) otherwise. Even when \(\lambda _1,\lambda _2\ne 0\) , the elastic net estimator behaves much like the LASSO but with the added advantage of being robust to extreme correlations among predictors. Moreover, the elastic net estimator is able to select more than n predictors when \(p>>n\) . Model sparsity occurs as a consequence of the \(\ell _1\) penalty term. Mazumder et al. [ 32 ] proposed an estimation procedure based on sparse principal components analysis (PCA), which produces an even more sparse model than the original formulation of the elastic net estimator [ 20 ]. Because it blends two bridge-type estimators, neither of which enjoys the oracle property, the ENET also lacks the oracle property.

Other competitive regularization methods that are asymptotically oracle efficient ( \(p<n\) with p increasing to infinity with n ), which do not fall into the category of bridge-type estimators, are the smoothly clipped absolute deviations (SCAD [ 17 , 33 ]) and the minimax concave penalty (MCP [ 25 , 34 ]) methods. Details of the penalty functions and other important properties of both methods can be found elsewhere [ 9 , 35 ].

Adaptive regularized regression methods

The adaptive regularization methods are extensions of the regularized regression methods that allow the resulting estimators to achieve the oracle property under certain regularity conditions. Table 4 (lower half) summarizes the adaptive methods considered here.

Adaptive bridge-type estimators

Adaptive bridge estimators extend the bridge estimators by introducing weights in the penalty term. More precisely,

where \(\{{w}_j\}_{j=1}^p\) are adaptive data-driven weights. As with the bridge-type estimator, the adaptive bridge estimator simplifies to the adaptive LASSO ( a LASSO) estimator when \(\gamma =1\) and to the adaptive ridge estimator when \(\gamma =2\) . Chen et al. [ 36 ] studied the properties of adaptive bridge estimators for the particular case when \(p<n\) (with p increasing to infinity with n ), \(0<\gamma <2\) and \({w}_j=(\vert \widehat{\beta }_j^{init}\vert )^{-1}\) with \(\widehat{\varvec{\beta }}^{init}=\widehat{\varvec{\beta }}_{ols}\) . They showed that for \(0<\gamma <1\) , and under additional model assumptions, adaptive bridge estimators enjoy the oracle property. For \(p>>n\) , \(\widehat{\varvec{\beta }}_{ols}\) cannot be computed and thus other initial estimates, such as \(\widehat{\varvec{\beta }}_{ridge}\) , have to be used. Theoretical properties of the adaptive bridge estimator for \(p>>n\) do not seem to have been well studied thus far.

The adaptive LASSO estimator was proposed by [ 37 ] to remedy the problem of the lack of the oracle property of the LASSO estimator [ 16 , 17 ]. The penalty for the adaptive LASSO is given by (adaptive bridge estimator with \(\gamma =1\) )

where the adaptive data-driven weights \(\{{w}_j\}_{j=1}^p\) can be computed as \({w}_j=(\vert \widehat{\beta }_j^{init}\vert )^{-\nu }\) with \(\widehat{\varvec{\beta }}^{init}\) an initial root- n consistent estimate of \(\varvec{\beta }\) obtained through least squares (or ridge regression if multicollinearity is important) and \(\nu\) is a positive constant. Consequently,

with \(\nu\) chosen appropriately, performs as well as the oracle, i.e., the adaptive LASSO achieves the oracle property. Nevertheless, this estimator still inherits the LASSO’s instability with high dimensional data. The values of \(\lambda\) and \(\nu\) can be simultaneously selected from a grid of values, with values of \(\nu\) selected from \(\{0.5,1,2\}\) , using two-dimensional cross-validation [ 37 ].

Grandvalet [ 38 ] shows that the adaptive ridge estimator (adaptive bridge estimator with \(\gamma =2\) ) is equivalent to the LASSO in the sense that both produce the same estimate and thus the adaptive ridge is not considered further.

Adaptive elastic-net

The adaptive elastic-net ( a ENET) combines the ridge and a LASSO penalties to achieve the oracle property [ 39 ] while at the same time alleviating the instability of the a LASSO with high dimensional data. The method first computes \(\widehat{\varvec{\beta }}_{enet}\) as described above for the elastic net estimator, then constructs the adaptive weights as \(\widehat{w}_j=(|\widehat{\beta }_{j,enet}|)^{-\nu }\) , where \(\nu\) is a positive constant, and then solves

where \(k=1+\lambda _2\) if the predictors are standardized (as we assume) or \(k=1+\lambda _2/n\) otherwise. In particular, when \(\lambda _2=0\) the adaptive elastic-net reduces to the a LASSO estimator. This is also the case when the design matrix is orthogonal regardless of the value of \(\lambda _2\) [ 20 , 37 , 39 ].

Other adaptive regularization methods are the multi-step adaptive ENET ( ma ENET), the adaptive smoothly clipped absolute deviations ( a SCAD) and the adaptive minimax concave penalty ( a MCP) methods. Details of the penalty functions and noteworthy properties of the latter three methods are summarized elsewhere [ 6 , 40 ].

Regularized group regression methods

Regularized regression methods that select individual predictors do not exploit information on potential grouping structure among markers, such as that arising from the association of markers with particular Quantitative Trait Loci (QTL) on a chromosome or haplotype blocks, to enhance the accuracy of genomic prediction. The nearby SNP markers in such groups are linked, producing highly correlated predictors. If such grouping structure is present but is ignored by using models that select individual predictors only, then such models may be inefficient or even inappropriate, reducing the accuracy of genomic prediction [ 9 ]. Regularized group regression methods are regularized regression methods with penalty functions that enable the selection of the important groups of covariates and include group bridge ( g bridge), group LASSO ( g LASSO), group SCAD ( g SCAD) and group MCP ( g MCP) methods (see [ 9 , 41 , 42 , 43 , 44 , 45 , 46 ] for detailed reviews). Some grouping methods such as the group bridge, sparse group LASSO ( sg LASSO) and group MCP, besides allowing for group selection, also select the important members of each group [ 43 ] and are therefore said to perform bi-level selection, i.e., group-wise and within-group variable selection. Bi-level selection is appropriate if predictors are not distinct but have a common underlying grouping structure.

Estimators and penalty functions for the regularized grouped methods can be formulated as follows. Consider subsets \(A_1,\ldots ,A_L\) of \(\{1,\dots ,p\}\) ( L being the total number of covariate groups), representing known covariate groupings of design vectors, which may or may not overlap. Let \(\varvec{\beta }_{A_l}=(\beta _k , k \in A_l)\) be the regression coefficients in the l -th group and \(p_l\) the cardinality of the l -th group (i.e., the number of unique elements in \(A_l\) ). Regularized group regression methods estimate \(\varvec{\beta }=(\varvec{\beta }_{A_1},...,\varvec{\beta }_{A_L})'\) by minimizing

where \({\textbf{X}}_{.l}\) is a matrix with columns corresponding to the predictors in group l .

Because \(\sum \limits _{i=1}^n\Big (y_i-\sum \limits _{l=1}^L {\textbf{X}}_{il}\varvec{\beta }_{A_l}\Big )^2\) in ( 10 ) is equivalent to RSS some authors use the RSS formulation directly. It is assumed that all the covariates belong to at least one of the groups. Table 5 summarizes the methods described in this section.

Group bridge-type estimators

Group bridge-type estimators use in ( 10 ) the penalty term \(p_{\lambda }(\varvec{\beta })=\lambda \sum \limits _{l=1}^L c_l\Vert \varvec{\beta }_{A_l}\Vert _1^{\gamma }\) with \(c_l\) constants that adjust for the different sizes of the groups. The group bridge-type estimators are thus obtained as

A simple and usual choice for the \(c_l\) constants consists in considering each \(c_l\propto p_l^{1-\gamma }\) . When \(0<\gamma <1\) group bridge can be used simultaneously for group and individual variable selection. Also, note that under these assumptions, the group bridge estimator correctly selects groups with nonzero coefficients with probability converging to one under reasonable regularity conditions, i.e., it enjoys the oracle group selection property (see [ 47 ] for details). When the group sizes are all equal to one, i.e., \(p_l=1 \ \forall \ 1\le l \le L\) , then group bridge estimators reduce to the bridge estimators.

Group LASSO and sparse group LASSO

Group LASSO regression uses in ( 10 ) the penalty function \(\texttt{p}_{\lambda }(\varvec{\beta })=\lambda \sum \limits _{l=1}^L\sqrt{p_l}||\varvec{\beta }_{A_l}||_2\) . The group LASSO estimator is thus given by

Unlike the group bridge estimator ( \(0<\gamma <1\) ), g LASSO is designed for group selection, but does not select individual variables within the groups. Indeed, its formulation is more akin to that of the adaptive ridge estimator [ 47 ]. As with the group-bridge estimator, when the group sizes are all equal to one, i.e., \(p_l=1 \ \forall \ 1\le l \le L\) , the g LASSO estimator reduces to the LASSO estimator.

Because the g LASSO does not yield sparsity within a group (it either discards or retains a whole group of covariates) the sparse group lasso ( sg LASSO), which blends the LASSO and the g LASSO penalties, was proposed [ 48 , 49 ]. Specifically, the sg LASSO estimator is given by

where \(\alpha \in [0,1]\) provides a convex combination of the lasso and group lasso penalties ( \(\alpha =0\) gives the g LASSO fit, \(\alpha =1\) gives the LASSO fit). The g LASSO is superior to the standard LASSO under the strong group sparsity and certain other conditions, including a group sparse eigenvalue condition [ 50 ]. Because the sgLASSO lacks the oracle property, the adaptive sparse group LASSO was recently proposed to remedy this drawback [ 51 ].

Note that there are two types of sparsity, i.e., (i) “groupwise sparsity”, which refers to the number of groups with at least one nonzero coefficient, and (ii) “within group sparsity” that refers to the number of nonzero coefficients within each nonzero group. The “overall sparsity” usually refers to the total number of non-zero coefficients regardless of grouping.

Other group regularization methods are the hierarchical group LASSO ( h LASSO), the group smoothly clipped absolute deviations ( g SCAD) and the group minimax concave penalty ( g MCP) methods. Details of the penalty functions and salient properties of these methods can be found in [ 9 , 52 , 53 , 54 , 55 ].

Bayesian regularized estimators

The two Bayesian methods we consider are based on the Bayesian basic linear regression model [ 10 ]. They assume a continuous response \({\textbf{y}}=(y_1, \ldots , y_n)\) so that the regression equation can be represented as \(y_i = \eta _i + \varepsilon _i\) , where \(\eta _i\) is a linear predictor (the expected value of \(y_i\) given predictors) and \(\varepsilon _i\) are independent normal model residuals with mean zero and variance \(w_i^2\sigma ^2_{\varepsilon }\) , with \(w_i\) representing user defined weights and \(\sigma ^2_{\varepsilon }\) is a residual variance parameter. The model structure for the linear predictor \(\varvec{\eta }\) is constructed as follows

with an intercept \(\mu\) (equivalent to \(\beta _0\) in equation ( 1 )), design \(n\times p\) matrix \({\textbf{X}}\) for predictor vectors \({\textbf{X}}_j = (x_{ij})\) and fixed effects vectors \(\varvec{\beta }_j\) associated with the the predictors \({\textbf{X}}_j\) .

The likelihood function of the data has the following conditional distribution:

with the general parameter vector \(\varvec{\theta }\) representing the vector of all unknowns, such as the intercept, all the regression coefficients and random effects, the residual variance as well as parameters and hyper-parameters subject to inference in the hierarchical Bayesian model.

The prior distribution factorises as follows:

In the basic form of the model the following prior settings are typically chosen:

The intercept is assigned a flat prior \(p(\mu ) = \frac{1}{\sqrt{2 \cdot \pi } \sigma _M} e^{-\frac{\mu ^2}{2 \cdot \sigma _M^2}}\) with prior hyper-parameter \(\sigma _M^2\) chosen to be very large to make the prior flat.

The residual variance is assigned a scaled-inverse \(\chi ^2\) density \(p(\sigma ^2) = \chi ^{-2}(S_{\varepsilon }|\text {df}_{\varepsilon })\) with degrees of freedom parameter \(\text {df}_{\varepsilon }\) (> 0) and scale parameter \(\text {S}_{\varepsilon }\) (> 0).

The priors for the regression coefficients \(\beta _{jk}\) can be chosen in different ways, for example, as flat priors similar to the intercept, which is considered an uninformative choice. Choosing informative priors not only provides a chance to introduce information on the coefficients known from previous runs of the study, but also allows performing penalized or regularized regression, such as Ridge regression or the LASSO through the choice of suitable priors.

Those coefficients utilizing flat priors are called “fixed” effects, as the estimation of the posterior is based only on information contained in the data itself, encoded by the likelihood. This is the reference model for regularised Bayesian models.

Choosing a Gaussian prior, according to [ 18 ], yields Ridge regression shrinkage estimation. Similar to [ 10 ] we call this approach the Bayesian ridge regression. Choosing double-exponential priors corresponds to the Bayesian LASSO model [ 10 ].

Ensemble methods

Ensemble methods build multiple models using a given learning algorithm and then combine their predictions to produce an optimal estimate. The two most commonly used algorithms are bagging (or bragging) and boosting . Whereas bagging is a stagewise procedure that combines the predictions of multiple models (e.g., classification or regression trees) to yield an average prediction, boosting is a stagewise process in which each stage attempts to improve the predictions at the previous stage by up-weighting poorly predicted values. Below, we briefly discuss two popular ensemble methods, namely, random forests, an extension of bagging, and gradient boosting algorithms. Note that, although variable scaling (centering or standardizing) might accelerate convergence of the learning algorithms, the ensemble methods do not require it. Indeed, the collection of partition rules used with the ensemble methods should not change with scaling.

Random forests (RF)

The random forests algorithm is an ensemble algorithm that uses an ensemble of unpruned decision (classification or regression) trees, each grown using a bootstrap sample of the training data, and randomly selected (without replacement) subsets of the predictor variables (features) as candidates for splitting tree nodes. The randomness introduced by bootstrapping and selecting a random subset of the predictors reduces the variance of the random forest estimator, often at the cost of a slight increase in bias. The RF regression prediction for a new observation \(y_i\) , say \(\widehat{y}_i^B\) , is made by averaging the output of the ensemble of B trees \(\{T(y_i,\Psi _b)\}_{b=1,...,B}\) as [ 56 ]

where \(\Psi _b\) characterizes the b -th RF tree in terms of split variables, cut points at each node, and terminal node values. Recommendations on how to select the number of trees to grow, the number of covariates to be randomly chosen at each tree node and the minimum size of terminal nodes of trees, below which no split is attempted, are provided by [ 57 , 58 ]. We refer to [ 56 , 57 , 58 ] for further details on the RF regression.

Stochastic gradient boosting (SGB)

Boosting enhances the predictive performance of base learners such as classification or regression trees, each of which performs only slightly better than random guessing, to become arbitrarily strong [ 56 ]. As with RF, boosting algorithms can also handle interactions, nonlinear relationships, automatically select variables and are robust to outliers, missing data and numerous correlated and irrelevant variables. In regression, boosting is an additive expansion of the form

where \(\beta _1,\dots ,\beta _M\) are the expansion coefficients and the basis functions \(h({\textbf {X}};\gamma )\) , base learners, are functions of the multivariate argument \({\textbf {X}}\) , characterized by a set of parameters \(\gamma =(\gamma _1,\dots ,\gamma _M)\) . Typically these models are fit by minimizing a loss function L (e.g., the squared-error loss) averaged over the training data

We used regression trees as basis functions in which the parameters \(\gamma _m\) are the splitting variables, split points at the internal nodes, and the predictions at the terminal nodes. Boosting regression trees involves generating a sequence of trees, each grown on the residuals of the previous tree. Prediction is accomplished by weighting the ensemble outputs of all the regression trees. We refer to [ 49 , 56 , 59 ] for further details on SGB (see, e.g., [ 59 ] for the interpretation of boosting in terms of regression for a continuous, normally distributed response variable).

Instance-based methods

For the instance-based methods, scaling before applying the method is crucially important. Scaling the variables (features) prior to model fitting prevents possible numerical difficulties in the intermediate calculations and helps avoid domination of numeric variables with smaller by those with greater magnitude and range.

Support vector machines

Support vector machines (SVM) is a popular supervised learning technique for classification and regression of a quantitative response y on a set of predictors, in which case the method is called support vector regression or SVR [ 60 ]. In particular, SVR uses the model

with \({\textbf {x}}_i=(x_{i1},\dots ,x_{ip})'\) and where the approximating function \(f({\textbf {x}}_i)\) is a linear combination of basis functions \(h({\textbf {x}}_i)^T\) , which can be linear (or nonlinear) transformations of \({\textbf {x}}_i\) . The goal of SVR is to find a function f such that \(f({\textbf {x}}_i)\) deviates from \(y_i\) by a value no greater than \(\varepsilon\) for each training point \({\textbf {x}}_i\) , and at the same time is as flat as possible. This so-called \(\varepsilon\) -insensitive SVR, or simply \(\varepsilon\) -SVR, thus fits a model ( 14 ) using only those residuals which are smaller in absolute value than \(\varepsilon\) and a linear loss function for larger residuals. The choice of the loss function (e.g., linear, quadratic, Huber) usually considers the noise distribution pertaining to the data samples, level of sparsity and computational complexity.

If Eq. ( 14 ) is the usual linear regression model, i.e., \(y_i=f({\textbf {x}}_i)=\beta _0+{\textbf {x}}_i^T\varvec{\beta }\) , one considers the following minimization problem

where \(\lambda\) is the regularization parameter (cost) that controls the trade-off between flatness and error tolerance, \(\Vert .\Vert\) refers to the norm under a Hilbert space (i.e., \(\Vert \textbf{x} \Vert = \sqrt{\langle \textbf{x}{,} \textbf{x}\rangle }\) with \(\textbf{x}\) a \(p\ge 1\) dimensional vector) and

is an \(\varepsilon\) -insensitive linear loss. Given the minimizers of ( 15 ) \(\hat{\beta }_0\) and \(\hat{\varvec{\beta }}\) , the solution function has the form

where \(\hat{\alpha }^*_i, \ \hat{\alpha }_i\) are positive weights given to each observation (i.e., to the column vector \({\textbf{x}}_i\) ) estimated from the data. Typically only a subset of \((\hat{\alpha }_i^*-\hat{\alpha }_i)\) are non-zero with the observations associated to these so called support vectors , and thus the name of the method, SVM. More details on SVM can be found in [ 56 ].

Deep learning methods

Deep learning (DL) algorithms are implemented through neural networks, which encompass an assortment of architectures (e.g., convolutional, recurrent and densely connected neural networks) and depend on many parameters and hyperparameters whose careful optimization is crucial to enhancing predictive accuracy and minimizing overfitting (see [ 8 , 61 , 62 , 63 , 64 , 65 ] for further insights into DL architectures and other particulars and the supplementary materials https://github.com/miguelperezenciso/DLpipeline of [ 8 ] for a list of the main DL hyperparameters, their role and related optimization issues). It can be very challenging to achieve great improvements in predictive accuracy in genomic prediction studies with DL because hyperparameter optimization can be extremely demanding and also because DL requires very large training datasets which might not always be available [ 1 , 2 , 3 , 4 ].

After selecting a DL architecture there is usually a large set of parameters to be set in order to minimize some fitting criterion such as least squares or some measure of entropy from some training data (network training). Therefore, an optimization method must also be selected. The three top ranked optimizers for neural networks are mini-batch gradient descent, gradient descent with momentum and adaptive moment estimation (ADAM; [ 66 ]). Among the three, the mini-batch gradient descent and Adam are usually preferred, because they perform well most of the time. In terms of convergence speed, ADAM is often clearly the winner and thus a natural choice [ 67 ].

Next, we offer a few more details on the feed-forward and convolutional neural networks, which, besides being some of the most popular DL architectures, are well suited for regression problems. These models can be represented graphically as a set of inputs linked to the outputs through one or more hidden layer. Figure 1 a represents such a model (either FFNN or CNN) with a single hidden layer.

figure 1

Graphical representation of a a feed-forward neural network (FFNN) with one hidden layer; and b a convolution of a filter \((v_1,v_2,v_3)\) , with stride=2, on the Input Channel \((x_1,x_2,\dots )\) . The result is in the Output Channel \((y_1,y_2,\dots )\)

Further details on neural networks in general and FFNN and CNN in particular can be found in [ 1 , 2 , 3 , 4 , 8 , 56 ]. Note that, to avoid potential numerical difficulties, it is recommended that both the target (response variable; here assumed to be continuous and normally distributed), and the features (covariates) are standardized prior to training the network [ 8 ].

Feed-forward neural network (FFNN)

A feed-forward neural network (FFNN), also known in the literature as a multi-layer perceptron (MLP), is a neural network that does not assume a specific structure in the input features (i.e., in the covariates). This neural network consists of an input layer, an output layer and multiple hidden layers between the input and output layers.

The model for a FFNN with one hidden layer expressed as a multiple linear regression model ( 1 ) is given by

where the \(y_i\) (output) and \(x_{ij}\) (input) are defined as in model ( 1 ), \(\alpha\) is the output bias, h runs over the units of the hidden layer, \(\alpha _h\) refers to the bias of the h -th unit of the hidden layer, \(w_{jh}\) refer to the weights between the inputs and the hidden layer, \(w_h\) refer to the weights between the hidden layer and the output, \(\phi\) is the activation function of the hidden layer. The model parameters \(\alpha\) , \(\alpha _h\) , \(w_h\) and \(w_{jh}\) are unknown network parameters that need to be estimated in the network training process.

Convolutional neural network (CNN)

A convolution neural network (CNN) is a neural network that contains one or more convolution layers, which are defined by a set of filters. Although a CNN generally refers to a 2-dimensional neural network, which is used for image analysis, in this study we consider a 1-dimensional (1D) CNN. Here, the input to the 1D convolution layer is a vector \({\textbf{x}}=(x_1,\dots ,x_p)\) equal to one row of the \(n\times p\) marker matrix \(\textbf{X}\) . The 1D convolution filter is defined by a vector \({\textbf{v}}=(v_1,\dots ,v_d)\) where \(d<p\) . The convolution of a filter \({\textbf{v}}\) with \({\textbf{x}}\) , which is called a channel , is a vector \({\textbf{y}}=(y_1,y_2,\dots )\) satisfying

where s , i.e., the stride length, is the shift displacement of the filter across the input data. An activation function is applied after each convolution to produce an output. Figure 1 b depicts a 1D convolution of a filter \((v_1,v_2,v_3)\) on the input vector \((x_1,x_2,\dots ,x_9,\dots )\) , considering a stride of length \(s=2\) , which results in the output channel \((y_1,y_2,\dots )\) . Filter values \(v_1,\dots , v_d\) are model parameters that are estimated in the neural network training process.

Performance assessment

For the simulated dataset, we assessed predictive performance using predictive accuracy (PA), the Pearson correlation between the predicted (PGBVs) and the simulated true (TGBVs) breeding values. For all the three KWS empirical data sets, predictive performance was expressed as predictive ability (PA), the Pearson correlation between the PGBVs and the observed (adjusted means estimated from phenotypic analysis) genomic breeding values (OGBVs), also calculated using cross validation. The simulated true breeding values are specified in the simulation model and therefore are known exactly. In contrast, for empirical data, the true breeding values are unknown and are approximated by the observed breeding values estimated as adjusted means during phenotypic analysis. The higher the PA, the better is the relative predictive performance of a method. We additionally assessed the predictive performance of the methods using the out-of-sample mean squared prediction error (MSPE) and the mean absolute prediction error (MAPE). Specifically,

where the \(y_i\) and \(\bar{y}\) are, respectively, the TGBVs and mean TGBVs for the single simulated dataset, but the OGBVs and mean OGBVs for the empirical datasets, and the \(\hat{y}_i\) and \(\bar{\hat{y}}_i\) are, respectively, the PGBVs and mean PGBVs. 10-fold CV is used to assess the PA for each method for the simulated datasets in contrast to the 5-fold CV used with the three empirical maize datasets. Although we report both the prediction errors and the PA, breeders are primarily interested in the final ordering of the genotypes, which the PA captures better than the prediction errors.

For the cross validation, we aimed to have at least 150 individuals per fold. Accordingly, each phenotypic dataset was randomly split into k approximately equal parts. The breeding values for each of the k folds were predicted by training the model on the \(k-1\) remaining folds and a CV error (CVE) computed for each of the k folds. The method with the smallest CVE was selected to predict the breeding values for the unphenotyped genotypes for the simulated dataset, and the phenotyped genotypes in the validation sets for each of the three empirical maize datasets.

All the methods are implemented in the R software and are available in various R packages [ 10 , 32 , 40 , 43 , 48 , 54 , 58 , 68 , 69 , 70 , 71 , 72 , 73 ]. Table S1 (Additional file 5 , Section 3) lists the R packages we used to analyse the synthetic and real datasets. For the deep learning methods, and because of fine tuning requirements, we used the Python software and packages Numpy, Pandas and Tensorflow [ 74 , 75 ]. All R and Python codes referring to the simulated data are provided in Additional files 2 & 3 .

Noteworthy details of model fitting are available in the supplementary materials (Additional file 5 , Section 2).

Although we did not fully quantify the computational costs of the different methods, the computational burden increased strikingly from the simple regularized through the adaptive to the grouped methods. A similar trend was also apparent from the ensemble, through the instance-based to the deep learning methods. Computational time may be reduced greatly by parallelizing the estimation or optimization algorithms, but this strategy may not always be available and can be challenging to implement for some methods.

The relative performances of the various methods on the simulated data varied with the target trait and with whether performance was assessed in terms of predictive accuracy or prediction error. Performance also varied in terms of computational cost with some methods requiring considerably more time than others. Results of genomic prediction accuracy for the simulated data are displayed in Figs. 2 , 3 and 4 and Tables S2-S5 (Additional file 5 , Section 3). Tables S6 & S7 (Additional file 5 , Section 3) report the calibration details for the fitted feed-forward and convolutional neural networks.

figure 2

Prediction accuracy (PA) of the regularized, adaptive regularized and Bayesian regularized methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. The choice of \(\lambda\) , where applicable, was based on the 10-fold CV. The mean squared and absolute prediction errors are also provided. See Table S 2 for details

figure 3

Prediction accuracy (PA) of the group regularized methods (mean and range values of PA across the different groupings), computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. Choice of \(\lambda\) was based on the 10-fold CV. Display refers to the mean, max and min values of PA across all the 10 grouping schemes. The mean squared and absolute prediction errors are also provided. See Table S 3 for details

figure 4

Prediction accuracy (PA) of the ensemble, instance-based and deep learning methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. See Tables S 4 -S 5 for details

Table 6 displays the range of the observed predictive accuracies across all the classes of the regularized methods for traits \(T_1-T_3\) . Neither the adaptive, group, nor Bayesian regularized methods seem to improve upon the results of their regularized counterparts, although group regularized methods do provide some slight improvement upon the results of the adaptive regularized methods. Even though all the regularized regression methods had comparable overall performance, the best compromise between high PA ( \(\ge 0.77\) for \(T_1\) , 0.82 for \(T_2\) and 0.81 for \(T_3\) ) and small prediction errors was achieved by the LASSO, ENET, sENET and SCAD (Fig.  2 and Table S 2 ; first half). Within the class of adaptive regularized methods, the best compromise was achieved by aLASSO and aENET (Fig.  2 and Table S 2 ; second half; PA \(\ge 0.72\) for \(T_1\) , 0.78 for \(T_2\) and 0.80 for for \(T_3\) ). For the group regularized methods, a good compromise was achieved by the gLASSO and gSCAD (Fig.  2 and Table S 2 ; mean PA values \(\ge 0.76\) for \(T_1\) , 0.82 for \(T_2\) and 0.81 for \(T_3\) ). Whereas the worst performing group regularized methods in terms of the estimated PAs were the cMCP and gel for \(T_1\) (PA \(<0.7\) ), sgLASSO and gel for \(T_2\) (PA \(<0.8\) ) and hLASSO and gel for \(T_3\) (PA \(<0.8\) ), the worst performing methods in terms of prediction errors were the gel ( \(T_1\) & \(T_2\) only) and sgLASSO ( \(T_3\) only). Of all the group regularized methods, the most time consuming were the sgLASSO and hLASSO, with sgLASSO requiring several more months to compute results for trait \(T_1\) than for traits \(T_2\) or \(T_3\) . In the comparisons between the two Bayesian regularized methods, Lasso Bayes consistently outperformed the Ridge Bayes method across all the three traits, demonstrating superior predictive accuracy and generally smaller prediction errors.

The ensemble, instance-based and deep learning methods did not improve upon the results of the regularized, the group or the Bayesian regularized methods (Fig.  4 and Tables S 4 & S 5 ). Among the ensemble and instance-based groups of methods, RF provided the best compromise between high PA and small prediction errors. For the deep learning methods, the FFNN provided consistently higher PA values than CNN across all the three traits from the simulated data.

Predictive performance varied not only among the methods but also with the target quantitative traits. Specifically, trait \(T_3\) had the highest predictive accuracies for the adaptive methods, whereas trait \(T_2\) was generally top ranked across all the remaining methods.

The ridge regression methods plus the overall best performing methods (high PA values and low prediction errors) for each class of methods based on the analysis of the simulated dataset, were applied to each of the three KWS empirical maize datasets. The specific methods fitted to the KWS maize datasets comprised RR-CV, RR-REML, sENET, aENET (enet penalty), gLASSO, RF, FFNN and lBayes.

Results are displayed in Fig.  5 and Table S8 (Additional file  5 , Section 3). Across the three real maize datasets, the highest predicitive abilities were obtained for the 2010 dataset. The estimated predictive abilities (PA) are under 0.7 for the 2010 dataset but under 0.6 for the 2011 dataset and generally under 0.6 for the 2012 dataset (RR-REML and lBayes excluded with estimated PAs of 0.616 and 0.624, respectively), regardless of the method used. The lBayes and RR-REML (2011 & 2012 datasets) and RF, RR-REML and lBayes (2010 dataset) are evidently the best performing methods (higher PA values and lower prediction errors). On the other hand, aENET \(^e\) (2010 & 2011 datasets) and RF (2012 dataset) are the worst performing methods (lower PA and higher prediction errors). Interestingly, the RF performed both the best (2010 dataset) and the worst (2012 dataset), clearly emphasizing that the methods are strongly data dependent.

figure 5

Predictive ability (PA; mean and range values computed across the 5-fold validation datasets and 10 replicates) of the regularized and adaptive regularized methods, computed as the Pearson correlation coefficient between the observed breeding values (OBVs) and the predicted breeding values (PBVs), for the KWS datasets. The choice of \(\lambda\) , where applicable, was based on 4-fold CV. See Table S 8 for details

We have investigated the predictive performance of several state-of-the art machine learning methods in genomic prediction via the use of one simulated and three real datasets. All the methods showed reasonably high predictive performance for most practical selection decisions. But the relative predictive performance of the methods was both data and target trait dependent, complicating and precluding omnibus comparative evaluations of the genomic prediction methods, thus ruling out selection of one procedure for routine use in genomic prediction. These results broaden the findings of earlier studies (e.g. [ 9 ]) to encompass a wider range of groups of methods. If reproducibility of results, low computational cost and time are important considerations, then using the regularized regression methods comes highly recommended because they consistently produced, with relatively lower computational cost and computing time, reasonably accurate and competitive predictions relative to the other groups of methods for the simulated and the three real datasets. Even among the regularized regression methods, increasing model complexity from simple through the adaptive to grouped or even the Bayesian regularized methods, generally only increased computing time without clearly improving predictive performance.

The ensemble, instance-based and deep-learning ML methods need the tuning of numerous hyperparameters thus requiring considerable computing time to adequately explore the entire hyperparameter space. This will not always be possible in most applications because of limiting time and computational resources leading to potentially less than optimal results and may well partly explain why these methods did not clearly outperform the classical ML methods. Indeed, the computational costs of the ensemble, instance-based and deep learning methods can quickly become prohibitive, if all the parameters are tuned by searching over the often large grid of values. This will typically require not only proficiency in programming and algorithm parallelization and optimization, but excellent computing resources. These constraints, plus the growing size of phenotypic and genomic data, make it difficult to identify methods for routine use in genomic prediction and call for greater focus on and investment in enhancing the computational efficiencies of algorithms and computing resources.

We have considered only well tested and established off-the-shelf machine learning methods and one simulated and three real datasets. We are extending this work to cover the following four objectives. (1) Comparing the performance of methods that use advanced techniques for feature selection or dimensionality reduction on multiple synthetic datasets simulated using different configurations or scenarios. (2) Exploring how the methods generalize based on different training/test splits across simulations/real-world datasets, individuals/samples, or chromosomes. (3) Evaluating the sensitivity of the different methods to hyperparameter selection. (4) Assessing the training and testing complexity for the different methods.

Machine learning methods are well suited for efficiently handling high dimensional data. Particularly, supervised machine learning methods have been successfully used in genomic prediction or genome-enabled selection. However, their comparative predictive accuracy is still poorly understood, yet this is a critical issue in plant and animal breeding studies given that increasing methodological complexity can substantially increase computational complexity or cost. Here, we showed that predictive performance is both data and target trait dependent thus ruling out selection of one method for routine use in genomic prediction. We also showed that for this reason, relatively low computational complexity and competitive predictive performance, the classical linear mixed model approach and regularized regression methods remain strong contenders for genomic prediction.

Availability of data and materials

The simulated animal data from the QTLMAS workshop 2012 is provided in the supplementary materials together with the annotated R and Python codes used to analyse these data. The KWS data is proprietary data and cannot be shared publicly for confidentiality reasons. These can only be shared upon reasonable request and with KWS' express consent. This notwithstanding, we provide a synthetic dataset that mimics the KWS data, which can be used with our codes to illustrate the implementation of the ML methods.

Abbreviations

Adaptive moment estimation

Best linear unbiased prediction

Cross-validation

Deep learning

Elastic net

Feed-forward neural network

  • Genomic prediction
  • Genomic selection

Least absolute shrinkage and selection operator

Mean absolute prediction error

Minimax concave penalty

Machine learning

Multi-layer perceptron

Mean squared prediction error

Ordinary least squares

Predictive accuracy/ability

Principal component analysis

Predicted genomic breeding value

Quantitative trait loci

Restricted maximum likelihood

Random forests

Ridge regression

Residual sum of squares

Smoothly clipped absolute deviation

Stochastic gradient boosting

Single nucleotide polymorphism

True genomic breeding value

Support vector machine

Support vector regression

Montesinos-López A, Montesinos-López OA, Gianola D, Crossa J, Hernández-Suárez CM. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 Genes Genomes Genet. 2018;8(12):3813–3828.

Montesinos-López OA, Montesinos-López A, Crossa J, Gianola D, Hernández-Suárez CM, Martín-Vallejo J. Multi-trait, multi-environment deep learning modeling for genomic-enabled prediction of plant traits. G3 Genes Genomes Genet. 2018;8(12):3829–3840.

Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Philomin J, Singh R. A benchmarking between deep learning, support vector machine and Bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3 Genes Genomes Genet. 2019;9(2):601–618.

Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh R. New deep learning genomic-based prediction model for multiple traits with binary, ordinal, and continuous phenotypes. G3 Genes Genomes Genet. 2019;9(5):1545–1556.

Ogutu JO, Piepho H-P, Schultz-Streeck T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc. 2011;5(3):1-5.

Ogutu JO, Schulz-Streeck T, Piepho H-P. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc. 2012;6(2):1-6.

Heslot N, Yang HP, Sorrells ME, Jannink JL. Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012;52:146–60.

Article   Google Scholar  

Pérez-Enciso M, Zingaretti LM. A Guide on Deep Learning for Complex Trait Genomic Prediction. Genes. 2019;10(7):553.

Article   PubMed   PubMed Central   Google Scholar  

Ogutu JO, Piepho H-P. Regularized group regression methods for genomic prediction: Bridge, MCP, SCAD, group bridge, group lasso, sparse group lasso, group MCP and group SCAD. BMC Proc. 2014;8(5):1-9.

Pérez P, de los Campos G. Genome-wide regression and prediction with the BGLR statistical package. Genetics. 2014;198:483–495.

Usai MG, Gaspa G, Macciotta NP, Carta A, Casu S. XVIth QTLMAS: simulated dataset and comparative analysis of submitted results for QTL mapping and genomic evaluation. BMC Proc. 2014;8(5):1–9.

Estaghvirou SBO, Ogutu JO, Schulz-Streeck T, Knaak C, Ouzunova M, Gordillo A, Piepho HP. Evaluation of approaches for estimating the accuracy of genomic prediction in plant breeding. BMC Genomics. 2013;14(1):1–21.

Google Scholar  

Estaghvirou SBO, Ogutu JO, Piepho HP. How genetic variance and number of genotypes and markers influence estimates of genomic prediction accuracy in plant breeding. Crop Sci. 2015;55(5):1911–24.

Article   CAS   Google Scholar  

Xie L. Randomly split SAS data set exactly according to a given probability Vector. 2009. https://silo.tips/download/randomly-split-sas-data-set-exactly-according-to-a-given-probability-vector . Accessed 15 Mar 2021.

Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion). Technometrics. 1993;35:109–48.

Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.

Article   MathSciNet   Google Scholar  

Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat. 2004;32:928–61.

Hoerl AE, Kennard RW. Ridge regression: biased estimation for non-orthogonal problems. Technometrics. 1970;12:55–67.

Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58:267–88.

MathSciNet   Google Scholar  

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Assoc B. 2005;67:301–20.

Fu WJ. Penalized regressions: The bridge versus the lasso. J Comput Graph Stat. 1998;7:397–416.

Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Stat. 2008;36:587–613.

Knight K, Fu W. Asymptotics for Lasso-type estimators. Ann Stat. 2000;28:356–1378.

Zhang C-H, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat. 2008;36:1567–94.

Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894–942.

Meuwissen TH, Hayes BJ, Goddard M. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–29.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Searle SR, Casella G, McCulloch CE. Variance components. New York: Wiley; 1992.

Book   Google Scholar  

Piepho H-P, Ogutu JO, Schulz-Streeck T, Estaghvirou B, Gordillo A, Technow F. Efficient computation of ridge-regression best linear unbiased prediction in genomic selection in plant breeding. Crop Sci. 2012;52:1093–104.

Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. Cambridge: Cambridge University Press; 2003.

Hayes BJ, Visscher PM, Goddard ME. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res. 2009;91(1):47–60.

Piepho H-P. Ridge regression and extensions for genomewide selection in maize. Crop Sci. 2009;49:1165–76.

Mazumder R, Friedman JH, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. J Am Stat Assoc. 2011;106(495):1125–38.

Article   MathSciNet   CAS   PubMed   PubMed Central   Google Scholar  

Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation on high dimensions. J Am Stat Assoc. 2008;103(484):1665–73.

Article   MathSciNet   CAS   Google Scholar  

Zhang C-H. Penalized linear unbiased selection. Department of Statistics and Bioinformatics, Rutgers University, Technical Report #2007-003. 2007.

Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat. 2011;5:232–53.

Article   MathSciNet   PubMed   PubMed Central   Google Scholar  

Chen Z, Zhu Y, Zhu C. Adaptive bridge estimation for high-dimensional regression models. J Inequalities Appl. 2016;1:258.

Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–29.

Grandvalet Y. Least absolute shrinkage is equivalent to quadratic penalization. International Conference on Artificial Neural Networks. London: Springer; 1998. p. 201–206.

Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Stat. 2009;37(4):1733–51.

Xiao N, Xu QS. Multi-step adaptive elastic-net: reducing false positives in high-dimensional variable selection. J Stat Comput Simul. 2015;85(18):3755–65.

Huang J, Breheny P, Ma S. A Selective Review of Group Selection in High-Dimensional Models. Stat Sci. 2012;27(4). https://doi.org/10.1214/12-STS392 .

Bach F. Consistency of the group lasso and multiple kernel learning. J Mach Learn. 2008;9:1179–225.

Breheny P, Huang J. Penalized methods for bi-level variable selection. Stat Interface. 2009;2:369–80.

Park C, Yoon YJ. Bridge regression: adaptivity and group selection. J Stat Plan Infer. 2011;141:3506–19.

Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc B. 2006;68:49–67.

Breheny P, Huang J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat Comput. 2015;25(2):173–87.

Article   MathSciNet   PubMed   Google Scholar  

Huang J, Ma S, Xie H, Zhang C-H. A group bridge approach for variable selection. Biometrika. 2009;96:339–55.

Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Stat. 2013;22:231–45.  https://doi.org/10.1080/10618600.2012.681250 .

Friedman J, Hastie T, Tibshirani R. A note on the group lasso and sparse group lasso. 2010. arXiv preprint arXiv:1001.0736.

Huang J, Zhang T. The benefit of group sparsity. Ann Stat. 2010;38:1978–2004.

Poignard B. Asymptotic theory of the adaptive Sparse Group Lasso. Ann Inst Stat Math. 2020;72(1):297–328.

Percival D. Theoretical properties of the overlapping groups lasso. Electron J Stat. 2011;6:269–88.

Zhou N, Zhu J. Group variable selection via a hierarchical lasso and its oracle property. Stat Interface. 2010;3:557–74.

Lim M, Hastie T. Learning interactions via hierarchical group-lasso regularization. J Comput Graph Stat. 2015;24(3):627–54.

Bien J, Taylor J, Tibshirani R. A lasso for hierarchical interactions. Ann Stat. 2013;41:1111–41.

Hastie TJ, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer; 2009.

Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Schonlau M. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata J. 2005;5(3):330–54.

Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1995.

Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinforma. 2017;18(5):851–69.  https://doi.org/10.1093/bib/bbw068 .

Yue T, Wang H. Deep learning for genomics: A concise overview. 2018. arXiv preprint arXiv:1802.00810.

Bengio Y. Practical recommendations for gradient-based training of deep architectures. In: Neural Networks: Tricks of the trade. Berlin, Heidelberg: Springer; 2012. p. 437–478.

Eraslan G, Avsec Z̆, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20(7):389–403.

Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet. 2019;51(1):12–8.  https://doi.org/10.1038/s41588-018-0295-5 .

Kingma DP, Ba JL. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980.  https://arxiv.org/pdf/1412.6980.pdf .

Ruder S. An overview of gradient descent optimization algorithms. 2016. arXiv preprint arXiv:1609.04747.

Breheny P. The group exponential lasso for bi‐level variable selection. Biometrics. 2015;71(3):731-40.

Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. 2011;4(3):250–55.

Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.

Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N, Qian J. Package ‘glmnet’. J Stat Softw. 2022;2010a:33(1).

Greenwell B, Boehmke B, Cunningham J. Package ‘gbm’. R package version. 2019;2(5).

Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. "Package ‘e1071’." R Software package. 2009. Avaliable at https://cran.r-project.org/web/packages/e1071/index.html .

Agrawal A, et al. TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning. Proc Mach Learn Syst. 2019;1:178–89.

McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. California: O’Reilly Media, Inc.; 2012.

Download references

Acknowledgements

We thank KWS for providing the maize datasets. We thank the Centre for Mathematical Analysis, Geometry, and Dynamical Systems, from Instituto Superior Técnico (IST) of the University of Lisbon, for granting access to their computing resources to run the Deep Learning Models.

Open Access funding enabled and organized by Projekt DEAL. This work is funded by national funds through the FCT - Fundação para a Ciência e a Tecnologia, I.P., under the scope of the projects UIDB/00297/2020 and UIDP/00297/2020 (Center for Mathematics and Applications). The German Federal Ministry of Education and Research (BMBF) funded this research within the AgroClustEr “Synbreed - Synergistic plant and animal breeding” (Grant ID: 0315526). JOO was additionally supported by the German Research Foundation (DFG, Grant # 257734638). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and affiliations.

Center for Mathematics and Applications (NOVA Math) and Department of Mathematics, NOVA SST, 2829-516, Caparica, Portugal

Vanda M. Lourenço & Rui A.P. Rodrigues

Institute of Crop Science, Biostatistics Unit, University of Hohenheim, Fruwirthstrasse 23, 70599, Stuttgart, Germany

Joseph O. Ogutu & Hans-Peter Piepho

Research Unit of Computational Statistics, Vienna University of Technology, Wiedner Hauptstr. 8-10, 1040, Vienna, Austria

Alexandra Posekany

You can also search for this author in PubMed   Google Scholar

Contributions

VML, JOO and HPP conceived the project. RAPR wrote the Python code, selected and trained the deep learning models. AP selected and programmed the Bayesian models and wrote the corresponding theory. VML and JOO wrote the R code, performed the simulations and all the other analyses. VML wrote the initial draft of the manuscript. JOO, RAPR and HPP contributed to writing and revising the manuscript. All authors read and approved the final version of the manuscript.

Authors’ information

The authors declare no conflict of interests.

Corresponding authors

Correspondence to Vanda M. Lourenço or Joseph O. Ogutu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1..

Simulated (animal breeding) dataset. Includes four txt files: one for the grouping schemes, one for the QTLMAS prediction data, one for the QTLMAS training data, and one for the validation trait values.

Additional file 2.

R codes used to fit the ML algorithms to the simulated (animal breeding) dataset. Includes six R files: one for the simple regularized methods, one for the adaptive regularized methods, one for the group regularized methods, one for the Bayesian regularized methods, one for the ensemble methods, and one for the instance-based methods.

Additional file 3.

Python codes used to fit the deep learning (FFNN & CNN) algorithms to the simulated (animal breeding) dataset. Includes six py and three pnz files: three of the py files refer to the FFNN fits and the other three to the CNN fits; each of the three pnz files include six npy files referring to the training of the FFNNs for traits 1, 2 & 3, respectively.

Additional file 4.

Includes SAS code for (i) the phenotypic data analysis (S1 Text.doc); (ii) SNP grouping schemes (S2 Text.doc); and (iii) the 5-fold data split (S3 Text.doc & S4 Text.doc) for the KWS \(2010-2012\) data sets.

Additional file 5.

Includes the RR-BLUP model used to estimate variance components for the KWS real maize data (Section 1), the Noteworthy details of model fitting (Section 2) plus the additional Tables of results (Section 3). Table S1. List of R and Python packages used in this paper. Table S2. Prediction accuracy (PA) of the regularized, adaptive regularized and Bayesian regularized methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. The choice of \(\lambda\) , where applicable, was based on the 10-fold CV. The mean squared and absolute prediction errors are also provided. Table S3. Prediction accuracy (PA) of the group regularized methods (mean and range values of PA across the different groupings), computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. Choice of \(\lambda\) was based on the 10-fold CV. Display refers to the mean, max and min values of PA across all the 10 grouping schemes. The mean squared and absolute prediction errors are also provided. Table S4. Prediction accuracy (PA) of the ensemble and instance-based methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. Table S5. Prediction accuracy (PA) of the deep learning methods, computed as the Pearson correlation coefficient between the true breeding values (TBVs) and the predicted breeding values (PBVs), for the simulated dataset, where \(T_1-T_3\) refer to three quantitative milk traits. Table S6. Best FFNN model calibration parameters selected for each of the three quantitative milk traits \(T_1-T_3\) . Table S7. Best CNN model calibration parameters (Number of epochs/Learning rate) selected for each of the three quantitative milk traits \(T_1-T_3\) . Table S8. Predictive ability (PA; mean and range values computed across the 5-fold validation datasets and 10 replicates) of the regularized, adaptive regularized, group regularized, Bayesian regularized, ensemble, instance-based and deep learning methods, computed as the Pearson correlation coefficient between the observed breeding values (OBVs) and the predicted breeding values (PBVs), for the KWS datasets. The choice of \(\lambda\) , where applicable, was based on 4-fold CV.

Observation for \(\widehat{\varvec{\beta }}_{BLUP}\) derivation:

Observation for \(\widehat{\textbf{g}}_{blup}\) derivation:, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lourenço, V., Ogutu, J., Rodrigues, R. et al. Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data. BMC Genomics 25 , 152 (2024). https://doi.org/10.1186/s12864-023-09933-x

Download citation

Received : 21 March 2023

Accepted : 20 December 2023

Published : 07 February 2024

DOI : https://doi.org/10.1186/s12864-023-09933-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Breeding value
  • Predictive accuracy
  • Predictive ability
  • High-dimensional data
  • Supervised machine learning methods

BMC Genomics

ISSN: 1471-2164

an empirical review

Do Empirical Spirits Serve Any Purpose Beyond Fueling Viral Headlines?

Do Empirical Spirits Serve Any Purpose Beyond Fueling Viral Headlines?

words: Aaron Goldfarb

Published: February 8, 2024

illustration: Danielle Grinberg

Even from the get-go Empirical seemed like a brand designed to be a viral sensation. One of the earliest articles I can find about the company — then known as Empirical Spirits — is a Medium post written by one of its investors. In “ Why This Danish Startup Is My First Official ‘Pre-Seed’ Investment ,” J.R. Johnson writes generically of the brand’s “innovative approach” and “intellectual property” and, of course, mentions that co-founders Lars Williams and Mark Emil Hermansen “met while working at the famed restaurant Noma, which has been voted #1 restaurant in the world four times and changed the way the world views Nordic cuisine.”

It’s all very tantalizing, esoteric, somewhat mysterious stuff and it’s no wonder so many people (myself included) were eager to read more about this far-flung company that was making neither whiskey nor gin nor vodka but, instead, “freeform spirits.”

That same month, September 2017, Vice Munchies was perhaps the first alcohol industry publication to write about Empirical in an article titled “This Once-Abandoned Warehouse Might Contain the Future of Booze.”

As a fellow journalist reading these articles, I couldn’t help but wish it was me breaking the news on this exciting, upstart company. As an adventurous drinker, I likewise couldn’t help but wish I could actually get my hands on some of these oddball releases to try them myself. Releases with names like Easy Tiger, infused with Douglas fir; Fallen Pony, produced from quince tea kombucha; and Charlene McGee, smoked on juniper wood and rested in sherry casks.

The only issue then, which is not much different than today, is that Empirical products weren’t exactly easy to buy or sample or even taste as a part of any cocktails, even in New York, where I live.

It didn’t seem to matter early on for getting press, however. That’s because Empirical inherently offered countless things for journalists to write about. Like the distillery’s custom-built vacuum still that looked “like half Formula 1 engine, half Breaking Bad lab equipment” (according to Vice Munchies) or its rotary evaporator designed to “gently extract the distillate without destroying the delicate flavours” as Wired U.K. explained.

The brand also had tons of bonafides for journalists to seize on, like the fact that Michelin-starred restaurants like Noma and top cocktails bars like London’s (now shuttered) Dandelyan were already stocking the releases.

There were likewise tons of great images available to accompany any story, especially that of the handsome, tattooed Williams straddling stainless- steel equipment or sitting on oddball machinery .

This wasn’t anything the spirits world had ever seen before and it seemingly didn’t even matter how the releases tasted, nor the fact that average readers would never be able to easily taste them.

“We started Empirical with a few simple ideas: question everything, and anything was possible. That inception led us down a somewhat different path than a traditional distillery, and I imagine that attitude was what piqued people’s interest,” says co-founder Williams. “We never really strove to be ‘viral,’ but the realization of a hypothetical question, ‘What would we make if we’d never heard of or tasted spirits?’ seems to be something that resonates with people.”

And eventually, Empirical finally got press for an actual release. Well, sorta.

Fuck Trump, Go Viral

In the late summer of 2018, two years into his presidency, and only one year into the company’s founding, Empirical again made viral headlines with a release bluntly dubbed “ Fuck Trump and His Stupid Fucking Wall .” The 27 percent ABV spirit was, like all Empirical releases, uncategorizable. It was made from a base of barley, soaked and injected with Aspergillus oryzae fungi to ferment into koji, and Belgian saison yeast, and infused with habanero peppers and habanero vinegar. Like all Empirical releases it was packaged in a simple clear bottle with a small, printed-out, all-text sticker label.

If that sounded intriguing it didn’t really matter, as virtually no journalists actually ever tasted it (myself included) and it mostly wasn’t even available in American retail or bars. (It sold out in 45 minutes online.) Yet I was stunned how many writers and online influencers — again, many in America — positively covered this release.

“[T]he bottling became an instant phenomenon — a lightning rod shared across social media that encapsulated a moment of fury within the spirits industry and the world at large,” wrote Kara Newman . It even crossed over to political writers, with The Daily Wire angrily deriding it.

Was a childish insult and a couple of curse words really enough to go viral in the booze world? And, if so, why didn’t Wild Fucking Turkey just change its name?

Kat Kinsman was the rare journalist who actually tasted it before writing about it; she regarded it favorably in an article for Extra Crispy titled: “ The Best New Spirit Is Called F*** Trump and His Stupid F***ing Wall .” Kinsman wrote that the spirit was a “smooth, warm, vegetal liquor that is simultaneously familiar and elusive, and endlessly sippable,” while noting it was not ideal for cocktails.

A second incarnation of Fuck Trump would arrive in 2019, and by Joe Biden’s election day in 2021, one final batch was released, receiving yet more glowing press in the process. (In total, 12 different batches would be created during Trump’s four-year term.)

In 2022, on the heels of the Russian-Ukrainian War, a Fuck Putin and His Stupid Fucking War bottled cocktail would, believe it or not, go viral as well.

Getting Started

Over the next few years, after reading about them endlessly, I finally got my hands on Empirical bottles, and have occasionally seen Empirical spirits appear in cocktails at top bars. When London’s acclaimed Lyaness did a pop-up in Manhattan in 2019, I found myself really enjoying a Daiquiri riff made with Onyx, a spirit produced from koji, maple, birch, kombucha, and hops, custom-made by Empirical specifically to be used in Lyaness cocktails.

“Explaining our story, ethos, the why and how of what we are doing that makes us different, that was easy (in a sense) because it was simple to us and just the story of what we were doing. But getting people to actually understand that on a visceral level has proved extremely difficult.”

In fact, I praised the cocktail so effusively that bar owner Ryan “Mr. Lyan” Chetiyawardana even sent me my own bottle of Onyx. But over the next four years I barely touched it and it gathered dust on my spirits shelf. It didn’t quite work for me as a neat sipper and I was unable to recreate any cocktail magic at home. I’m a decent bartender, but no Mr. Lyan.

I’d try other new Empirical releases seemingly every year when I went to Bar Convent Brooklyn, the country’s top trade show. These spirits, like The Plum, I Suppose, made of distilled marigold kombucha and plum stones, were always interesting, never bad, and sometimes even good, but I never exactly knew what to do with any of them. I never liked any enough to spring for a bottle that might run $75 or more. Nor could I ever visualize how I might use any of them for a cocktail prepared on my kitchen counter.

Perhaps the problems I faced are the same ones that Empirical continues to face when trying to sell its spirits to home consumers.

“Explaining our story, ethos, the why and how of what we are doing that makes us different, that was easy (in a sense) because it was simple to us and just the story of what we were doing,” says Williams. “But getting people to actually understand that on a visceral level has proved extremely difficult, and we are still always trying to be better at the conveying of what we do.”

Williams thinks that part of the problem is that most people outside the industry don’t even know how spirits are made. Perhaps they don’t even know what, for example, technically defines a bourbon or a tequila and why Empirical is different from that.

Fair enough, but in some ways, Empirical likewise doesn’t even seem clear what its products are or how to use them themselves. One of the four links on the top navigation bar of the brand’s website is “ Getting Started ,” as if the bottles of esoteric spirits are instead pieces of technology the average consumer wouldn’t be able to figure out how to set up and turn on without a manual.

Other top bartenders likewise seem to prefer Empirical spirits in more baroque, complex cocktails where a simple gin or tequila simply won’t do.

“We know — diving into the world of uncategorized spirits can be a bit daunting,” reads the webpage. “Since you made it this far, it must be because you’re a little curious. And curiosity is the first critical step.”

The advice goes on to mostly tell the home user to either try Empirical spirits with tonic or turn them into well-worn classics, recommending a Tom Collins, Old Fashioned, and Mojito to start.

One of the brand’s greatest challenges, Williams says, is getting people to figure out how to “Make it Empirical.” “Luckily when people taste Empirical they get it; the proof is in the pudding,” he says.

But it remains only a small portion of people even capable of purchasing Empirical. If it’s hard to sell these products online, it’s even harder to sell them in smaller markets where the company can’t deploy a door-to-door, boots-on-the-ground salesforce ready and willing to explain the left-field spirits. Outside of spirits-focused communities in New York and Northern California, it seems, few people have even heard of the company.

Yet Empirical continues to get gobs of press.

Brain Farts

More and more, I am encountering Empirical spirits at top New York bars these days. They are rarely offered as sipping spirits and, more often, have found their way into signature cocktails. At Superbueno — VinePair’s Next Wave Awards Bar Program of the Year — one such Empirical offering appears in the bar’s Salted Plum & Tamarind Milk Punch.

“What Empirical is doing makes our jobs as bartenders and hospitality professionals even easier, in my opinion, as the best guest experience we can give is one where they are opened to new experiences and possibilities.”

“The whole inspiration is a tamarind and plum candy you get in Mexico called saladitos,” says co-owner Nacho Jimenez.

To create the sweet, tart, and spicy flavor profile, head bartender Kip Moffit cooks red plums, tamarind, coriander, lime peels, and sugar at a really low temperature for two hours before adding it to an Ecuadorian tea blend called Horchata Lojana. That syrup is then mixed with charanda, a Mexican cane spirit, as well as Empirical Ayuuk , a purple wheat and pilsner malt spirit macerated with smoky Pasilla Mixe chile and matured in oloroso casks before the entire concoction is milk-washed.

It’s a very unusual cocktail, very savory and a bit spicy from the Ayuuk, though quite good and balanced. Perhaps something like this could only be made with something as unusual as an Empirical spirit. At the least, it’s not exactly a cocktail a home consumer would typically make.

At Double Chicken Please, voted the No. 1 bar in North America, co-owner GN Chan uses a variety of Empirical products in his cocktails. Chan deploys Soka , a sorghum distillate, with curry, pumpkin, and coffee in a cocktail called Brain Fart. For a cocktail called Little Fucking Brain, he combines Symphony 6 (a distillate made with lemon leaf, tangerine, fig, coffee, vetiver, ambrette seeds, black currant buds, citric acid, and carmine) with banana, tana (Japanese blue honeysuckle berry), walnut, and Riesling.

“Symphony 6 is a very unique spirit, … bright and citrusy with a hint of musk, which creates an interesting and inspiring dynamic to play with,” Chan says. “We’ve utilized the product on a drink that falls somewhere between a Cosmopolitan and apple Martini.”

Jonathan Adler of Shinji’s Bar, a Japanese-style cocktail bar in the Flatiron District of Manhattan, likes using The Plum, I Suppose, in place of Luxardo, a maraschino liqueur he finds the Empirical spirit reminiscent of. This allows him to make more interesting riffs on Last Words and Tuxedo No. 2s.

“This is why it is so exciting to use the distillates that Empirical is producing; [they are] yet another tool in our toolbox that opens up more possibilities when creating drinks,” says Adler, who also use Soka in a carbonated tropical drink , Soka Punch, and Ayuuk in a Latte Martini.

“What Empirical is doing makes our jobs as bartenders and hospitality professionals even easier, in my opinion, as the best guest experience we can give is one where they are opened to new experiences and possibilities,” he says. “It’s an amazing synergy between distillery and consumer!”

Doritos Locos

In mid-December, Empirical yet again got viral press, perhaps more than ever before, with some 13 pages of Google News returns and well over 100 unique articles — many from major outlets. The centerpiece of this virality was Empirical’s latest spirit, a collaboration with Doritos, distilled from real nacho cheese chips.

The Washington Post declared: “ Doritos nacho cheese liquor sounds like a stunt, but it’s actually good ,” while USA Today more matter-of-factly noted “ Doritos releases nacho cheese-flavored liquor that tastes just like the chip .”

The spirit likewise got a spot on NPR and, on the “Today” show,” Hoda and Jenna cautiously sipped it.

At a certain point, even we had to cover it.

Once again, Empirical had managed to bend the press to its will, with not a single journalist so far as I can tell — save, perhaps, Hoda! — offering any sort of critical analysis of the release.

“That collaboration was about two companies who both spend a lot of time thinking about flavor getting together and creating something novel between them,” says Williams. “One of the two companies is a bit more well known, and that ubiquity was certainly the main reason for the buzz.”

But plenty of junk-food-flavored spirits collabs have been released over the last few years — Eggo Waffle liqueur , Arby’s Curly Fry Vodka , Taco Bell Jalapeno Noir wine to name a few.

So why did this one go so massively viral? And why did no one even question Williams’ supposed origin story for creating the spirit, in which he claims he was looking at a Doritos bag during lunch one day and “curiosity led me to turn this snack into a spirit”?

Maybe, it’s because being critical does no good in the viral news industrial complex. When Mike Vacheresse of Travel Bar, a laid-back whiskey joint in the Carroll Gardens neighborhood of Brooklyn, posted on Instagram that he had bottles of the “Nacho Cheese Spirit” now in stock and ready to pour, the post got more engagement than is typical for the neighborhood bar. When I posted the bottle in my Instagram stories, I likewise receives dozens of responses.

Oddly, or perhaps not, literally two days after this insane 24 hours of virality, came the news that Empirical had just filed for bankruptcy in Copenhagen.

Only one publication covered the story.

  • Do Empirical Spirits Serve Any Purpose Beyond Fueling Viral Headlines? | VinePair
  • https://vinepair.com/articles/empirical-spirits-press-coverage/
  • wbs_cat Spirit, wbs_brand Empirical Spirits, brands, distillery, Spirits, trends
  • We Asked 11 Bartenders: What’s the Best New Bourbon That’s Earned a Spot On Your Bar? (2024) | VinePair
  • https://vinepair.com/articles/wa-bartenders-best-new-bourbon-2024/
  • Rich Manning
  • product_recommendation
  • wbs_cat Spirit, wbs_type Bourbon, wbs_brand Blue Run Spirits, wbs_brand Brother\'s Bond, wbs_brand FEW Spirits, wbs_brand Four Roses, wbs_brand Jack Daniel\'s, wbs_brand Leopold Bros., wbs_brand Milam & Greene, wbs_brand Widow Jane, bartenders, bourbon, spirit recommendation, we asked

IMAGES

  1. 15 Empirical Evidence Examples (2024)

    an empirical review

  2. Empirical Research: Definition, Methods, Types and Examples

    an empirical review

  3. Empirical Research: Definition, Methods, Types and Examples

    an empirical review

  4. Empirical Research

    an empirical review

  5. Empirical Research

    an empirical review

  6. (DOC) Empirical Review

    an empirical review

VIDEO

  1. Review Empirical Formula AICE Chemistry

COMMENTS

  1. Module 2 Chapter 3: What is Empirical Literature & Where can it be

    These literature reviews are not considered to be empirical evidence sources themselves, although they may be based on empirical evidence sources. One reason is that the authors of a literature review may or may not have engaged in a systematicsearch process, identifying a full, rich, multi-sided pool of evidence reports.

  2. The Empirical Research Paper: A Guide

    Bem, D. J. (2003). Writing the empirical journal article. In Darley, J. M., Zanna, M. P., & Roediger III, H. L. (Eds.), The Complete Academic: A Practical Guide for the Beginning Social Scientist 2nd Edition. ... Write a Literature Review in order to frame your question and to summarize the current state of knowledge related to your topic.

  3. Empirical Research in the Social Sciences and Education

    Empirical research is based on observed and measured phenomena and derives knowledge from actual experience rather than from theory or belief. How do you know if a study is empirical? Read the subheadings within the article, book, or report and look for a description of the research "methodology."

  4. PDF Writing Empirical Papers Beginners Guide

    A well-written empirical paper should be shaped like an hourglass. That is, the Introduction begins very broadly by introducing the topic and defining terms, and then begins to narrow to more specifically focus on the variables in your study.

  5. Empirical research

    Empirical research is research using empirical evidence. It is also a way of gaining knowledge by means of direct and indirect observation or experience. Empiricism values some research more than other kinds. Empirical evidence (the record of one's direct observations or experiences) can be analyzed quantitatively or qualitatively.

  6. Full article: An Empirical Review of Research Methodologies and Methods

    In addition, this review only focused on empirical or evidence-based research, which was defined as "a systematic attempt to collect information about an identified problem or question, the analysis of that information, and the application of the evidence to confirm or refute some prior statement(s) about the problem or question under study ...

  7. Empirical Research

    Research: Overview & Approaches Introduction to Empirical Research Introductory Video This video covers what empirical research is, what kinds of questions and methods empirical researchers use, and some tips for finding empirical research articles in your discipline. Databases for Finding Empirical Research

  8. Population Studies at 75 years: An empirical review

    Introduction. For 75 years, the journal Population Studies has published work advancing our knowledge of demography and population, from substantive topics in the areas of fertility, mortality, migration, and families to innovations in theory, methods, policy, and practice. Demographic topics, theories, and methods have drawn from multiple ...

  9. PDF Writing the literature review for empirical papers

    will be used in the empirical sections of the text. Originality: Most papers and books focus on literature review as full articles (systematic reviews, meta analyses and critical analyses) or dissertation, chapters, this paper is focused on literature review for an empirical article. Research method: It is a theoretical essay.

  10. Literature Reviews and Empirical Research

    Empirical Research is research that is based on experimentation or observation, i.e. Evidence. Such research is often conducted to answer a specific question or to test a hypothesis (educated guess). How do you know if a study is empirical?

  11. Module 2 Chapter 4: Reviewing Empirical Articles

    Steps in reviewing different sections of an empirical article The importance of maintaining a critical perspective on what you are reading—being an "active reader" The seven steps considered in this chapter relate to the structure of typical journal articles published in social work and allied discipline journals.

  12. Writing the literature review for empirical papers

    The literature review plays the fundamental role of unveiling the theory, or theories, that underpin the paper argument, sets its limits, and defines and clarifies the main concepts that will be...

  13. Understanding Peer Review and Empirical Studies

    Empirical studies are unlikely to be found in popular magazines or newspapers, and if they are, they are unlikely to be reported with sufficient detail. Essays, textbooks, reviews of existing research or what is known in the field, and practitioner articles are all useful, but, they are not empirical research. How to find an Empirical Study

  14. Empirical Research: Definition, Methods, Types and Examples

    Empirical research: Definition Empirical research: Origin Types and methodologies of empirical research Steps for conducting empirical research Empirical research methodology cycle Advantages of Empirical research Disadvantages of Empirical research Why is there a need for empirical research?

  15. Empirical Articles Vs. Review Articles

    Empirical Articles Review Articles What is a Peer Reviewed Article? Peer review is a process that many, but not all, journals use. Article manuscripts submitted to peer-reviewed journals are not automatically accepted and published.

  16. Difference between theoretical literature review and empirical

    An empirical review can be described as the review of many aspects of an empirical study that hold some levels of significance to the study being conducted. It is a direct or indirect observation ...

  17. Empirical & Review Articles

    Review articles can lead you to empirical articles. There are several types. narrative: a literature review that describes and discusses the state of the science of a specific topic or theme. systematic: a comprehensive review of all relevant studies on a particular topic/question. The systematic review is created by following an explicit ...

  18. Project Chapter Two: Literature Review and Steps to Writing Empirical

    From a definition standpoint, an empirical review can be described as the review of many aspects of an empirical study that hold some levels of significance to the study being conducted. An...

  19. How to empirically review the literature?

    The empirical review is structured to answer specific research questions within a research paper. Therefore, it enables the researcher to find answers to questions like; What is the problem? The methodology used to study the problem? What was found? What do the findings mean? Components of empirical review

  20. Emotional intelligence and public relations: An empirical review

    1. Emotional intelligence and public relations: an empirical review. Emotional intelligence (EI) or emotional quotient (EQ), broadly defined as a wide range of competencies to know and manage one's emotions, self-motivate, recognize emotions in others, and handle social relationships (Goleman, 1995), has attracted popular and scholarly attention in business, organizational behavior ...

  21. What is Empirical Research?

    Another hint: some scholarly journals use a specific layout, called the "IMRaD" format, to communicate empirical research findings. Such articles typically have 4 components: Introduction: sometimes called "literature review" -- what is currently known about the topic -- usually includes a theoretical framework and/or discussion of previous studies

  22. An Empirical Review of Research and Reporting Practices in

    An Empirical Review of Research and Reporting Practices in Psychological Meta-Analyses. Richard E. Hohn https ... Morgan Z., Hill T., Alonso J., Jones D. R. (2012). Methodology and reporting of systematic reviews and meta-analyses of observational studies in psychiatric epidemiology: Systematic review. British Journal of Psychiatry, 200(6), 446 ...

  23. Prosocial Behavior and Well-Being: An Empirical Review of the Role of

    A systematic review of the existing empirical literature is conducted in this article to summarize and synthesize the relationship between prosocial behavior and well-being, with a special emphasis on the multifaceted role of BPNS (i.e. mediation, moderation, and concurrent mediation and moderation). Nineteen articles have been identified that ...

  24. How to Write Empirical Review for Undergraduate Project Topics

    The empirical review is the last section of the chapter two of undergraduate project topics. But there some undergraduate project topics that does not require empirical review. Some of these undergraduate project topics are those under the following departments: mathematics physics mechanical engineering civil engineering chemical engineering

  25. How do we study resilience? A systematic review

    The results of our review serve to explain the lack, still, of a clear consensus on the empirical usefulness of the concept, considering that 51% of the studies we assessed did not meet minimum operationalization criteria derived from Carpenter et al. . Instead, many used resilience as a way to frame their conceptual and discussion sections.

  26. Rethinking corruption in international business: An empirical review

    In our empirical review, we could not fully replicate the findings of any of these four studies with consistency in strict reproductions and in generalization tests. As such, we conclude that IB scholars studying corruption need to (1) harmonize theorizing with empirical modelling on mechanism at the firm level by overcoming the primary data ...

  27. Prosocial Behavior and Well-Being: An Empirical Review of the Role of

    A systematic review of the existing empirical literature is conducted in this article to summarize and synthesize the relationship between prosocial behavior and well-being, with a special emphasis on the multifaceted role of BPNS (i.e. mediation, moderation, and concurrent mediation and moderation). Nineteen articles have been identified that ...

  28. Genomic prediction using machine learning: a comparison of the

    The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in ...

  29. Do Empirical Spirits Serve Any Purpose Beyond Fueling Viral ...

    Empirical products aren't exactly easy to buy or sample, but that doesn't seem to matter for the hoards of press the brand receives. ... Spirits Reviews; Wine Reviews; VP Pro. Next Wave Awards ...