STAT 4380 Final Project

Timeline

Project Brainstorm due Thursday, February 5, 11:59pm

Choose project teams by Tuesday, February 10, 4pm

Proposal due Saturday, February 21, 11:59pm

Data cleaning & preliminary EDA due Thursday, March 12, 11:59pm

Preliminary Data Story due Tuesday, April 7, 11:59pm

Presentation rough draft due Thursday, April 16, 11:59pm

Written report rough draft due Tuesday, April 21, 4pm

Final presentations: April 28 & 30 (classtime) + Monday, May 11, 2:30 - 5pm

Final written report due Monday, May 4, 11:59pm

Introduction & grading summary

TL;DR: Analyze data for social good. That is your final project.

The purpose of the final project is to apply what you’ve learned throughout the semester to investigate an interesting data-driven question about a social issue you care about.

The project will be completed in self-assigned teams of 2. You should choose a dataset for your project based on your group’s interests. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a dataset to analyze it in a meaningful way.

Choosing a topic (Project Brainstorm)

Below are some reflection questions to help you identify a topic you might want to explore.

  • What are the biggest challenges you and your peers are facing in your lives?
  • What needs are your neighbors facing? What barriers prevent them from flourishing? Consider physical, financial, educational, health, nutritional, transportation, legal, relational, employment, and spiritual needs. Consider the Greater Philadelphia area, Pennsylvania, the U.S., your hometown, and/or another place of significance to you.
  • What’s broken in the world that you would like to see healed?
  • What social or political issue do you want to understand better?
  • What is a cause you feel inspired to volunteer your time for or donate money towards?
  • What’s one dream you have for contributing good to the world?

You will submit three potential topics along with a brief reflection on Blackboard by Thursday, February 5. This can help you identify areas of common interest when forming your project team.

Logistics

You should sign up for a team of 2 on Blackboard no later than Tuesday, February 10.

The three primary deliverables for the final project are

  • A written, reproducible report detailing your analysis
  • An RStudio project repository corresponding to your report
  • An oral presentation

Grading summary

The grade breakdown is as follows:

Total 100 pts
Project brainstorm 1 pts
Project proposal 3 pts
Preliminary EDA & Cleaning 3 pts
Preliminary Data Story 3 pts
Presentaiton Rough Draft 3 pts
Written Rough Draft 3 pts
Peer feedback 5 pts
Written report 35 pts
Project repo & reproducibility 15 pts
Oral presentation 30 pts

Note: No late projects are accepted.

Data sources

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. Your dataset must have at least 200 observations and at least 8 variables. At least 6 of the variables must be useful and unique explanatory variables.

  • Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
  • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be ok; use these numbers as guidance for a successful proposal, not as minimum requirements.

Data sets that can’t be used:

  • Data sets that have been used for class examples or assignments.
  • Data sets analyzed in another course.

No two groups can analyze the same dataset, so I encourage you to be creative!

Some resources that may be helpful:

All analyses must be done in RStudio, and your final written report and analysis must be reproducible. This means that you must create a Quarto document attached to a RStudio project repository that will create your written report exactly upon rendering.

Project proposal

There are two main purposes of the project proposal:

  • To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
  • To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will help you be successful for this project.

Choose three substantially different datasets you are interested in analyzing. For each dataset, include the following 3 sections in the proposal:

Data description & background

  • Identify the source of the data (including URL to where you accessed it) and who originally collected or curated it
  • Answer the three Ws (Table 2.2 of Communicating Data)
    • Who: Who are the subjects of the study?
    • When: When were the data collected, e.g. when did the subjects participate in the study?
    • Where: Where were the data collected, e.g., where were the subjects under study located?
  • Provide a brief description of what it contains:
    • How many observations
    • How many variables
    • Summary of what info the variables provide about the observatioanl units

Research questions

  • Describe the research topic along with 2-3 research questions that could be explored with the data
    • make sure to indicate which variable(s) in the data you could use as response variables and explanatory variables to answer the research question

Data glimpse

  • Use the glimpse function to provide a glimpse of the data set.

  • Place the file containing your data in the data folder of the project repo.

Submission

Submit the PDF of your proposal on Blackboard by midnight on Saturday, February 21. I will provide feedback on your proposal to help you determine a data set to use for the project.

Notes

  • Project proposals should have no more than 1-2 pages of text (not including the output from glimpse). That is, be concise!
  • You must use one of the data sets in the proposal for the final project, unless instructed otherwise by Dr. Fitz.

Proposal grading

The project proposal will be graded as follows:

Total 3 pts
Data description/background 1 pts
Research questions 1 pts
Data glimpse 1 pts

Data cleaning & Preliminary EDA

The purpose of this step is to help scaffold steady progress and give you an opportunity to get early feedback on your analysis plan. The cleaning-eda.qmd file in your RStudio project should include reproducible data cleaning steps that begin with reading in your original data file and outputting the data file to be used for analysis. You should also include preliminary exploratory data analysis and a brief description of your analysis plan. You should submit a zip file with the .Rproj file, the .qmd file, and any relevent data or other supplementary files required to Render your Quarto report.

At the top of your document, you should copy, paste the relevant information from your proposal for the final dataset you chose. Additionally, provide the following information about Data Provenance (see Section 3.1 of book):

  • What procedures were used to select subjects and collect measurements?
  • Link to the codebook / documentation

Your cleaning & EDA steps should include:

  • Reading in raw data

  • Cleaning variable names

  • Visualizing each variable individually (or at least 10 identified for analysis if you have a large number), noting any issues or notable features

  • Converting to appropriate data types (factors, dates, etc)

  • Re-ordering and/or re-labeling factor levels as needed

  • Investigating and handling missing data - report how much is missing for each variable and how you intend to handle it

  • Documenting any cases excluded from the analyses (including filtering code and justification)

  • Joining multiple datasets as needed

  • Describing your analysis plan for answering your research questions

    • What are your primary outcomes of interest? What explanatory variables will you use to understand these outcomes?

    • What types of visualizations or summary tables will you create?

    • What types of models or inference procedures will you run?

  • You should save the clean data as a .RDS file.

Your zip file submitted on Blackboard should contain:

  • .Rproj file
  • cleaning-eda.qmd
  • data folder with original and clean data
  • proposal files
  • any other supplementary files needed to Render your .qmd reproducibly

This part of the project will be graded as follows:

Total 3 pts
Original & cleaned datasets included 1 pt
Reproducible data cleaning 0.5 pts
EDA 1 pt
Future analysis plan 0.5 pt

Preliminary Data Story

This portion of the project loosely follows the “storyboarding” process described in Chapter 6 of Communicating with Data. Detailed instructions regarding the process and deliverables will be provided in class.

Rough Draft report

The purpose of the rough draft and peer review is to give you an opportunity to get feedback on your analysis before the final product.

Your team will write the rough draft in a written-report.qmd file in your project repo and submit upload the pdf to the shared Google Drive (link provided on Blackboard homepage, at bottom)

See “Written Report” section below for a description of each of the sections expected in your report.

As you work on the draft, the focus should be on the analysis and less on perfecting the presentation of the final report. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, inference or modeling, and deriving initial results and conclusions.

This part of the project will be graded as follows:

Total 5 pts
Introduction 1 pt
Methodology 1 pts
Results 2 pt
Neatness & Organization 1 pt

Peer review

Critically reviewing others’ work is a crucial part of the scientific process, and giving constructive feedback is an important skill that must be practiced. The process can enhance your ability to self-assess and improve your own work as well.

You will be assigned a team to review. Time will be spent on peer review in class in Week 14, and your team will have until Saturday in Week 14 to provide a detailed critique about the written report and data analysis. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.

Peer feedback will be graded on the extent to which it comprehensively and constructively addresses the components of the partner team’s report: the research context and motivation, exploratory data analysis, and any inference, modeling, or conclusions.

You will also engage in (ungraded) peer review of poster drafts and oral presentation materials in class in Week 14, in preparation for the final presentations during Week 15 & finals week.

Written report

Your final report must be written using Quarto. All team members must contribute meaningfully to the analysis and are responsible for what’s contained in the final report. Before you finalize your report, make sure the printing of code chunks is turned off with the option #| echo: false.

Submit the final report on Blackboard under the Final Report & Repo assignment. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you must comprehensively address all of the aspects mentioned below.

The written report is worth 45 points, broken down as follows:

Total 45 pts
Introduction & Research Questions 5 pts
Methodology 10 pts
Results 15 pts
Discussion 10 pts
Formatting 5 pts

Introduction (AKA Background & Significance)

The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses.

In this section you are providing the background of the research area and arguing why it is interesting and significant. This section relies heavily on literature review (prior research done in this area and facts that argue why the research is important). This whole section should provide the necessary background leading up to a presentation (in the last few sentences of this section) of the research questions that you will be investigating in your analysis. Well-accepted facts and/or referenced statements should serve as the majority of content of this section. Typically, the background and significance section starts very broad and moves towards the specific area/hypotheses you are testing.

Assessment:

  • Does the background and significance have a logical organization? Does it move from the general to the specific?
  • Has sufficient background been provided to understand the paper? How does this work relate to what else is known about this topic?
  • Has a reasonable explanation been given for why the analysis was done? Why is the work important? Why is it relevant?
  • Does this section end with statements about the research questions/goals of the paper?

Methodology

Data collection: Identify the source of the data, when and how it was originally collected, and how you obtained it. State what the observational units are. Additionally, you should provide information on the units that were included to assess representativeness. Non-response rates and other relevant data collection details should be mentioned here if they are an issue. However, you should not discuss the impact of these issues here—save that for the limitations section.

Variable description / creation: Detail the variables in your analysis and how they are defined (if necessary). If you created a combined or transformed variable you should describe how. This section should also include visualizations and summary statistics of key variables relevant to your research question.

Analytic Methods: Explain the statistical procedures that will be used to analyze your data. E.g. Boxplots are used to illustrate differences in GPA across gender and class standing. Correlations are used to assess the impacts of gender and class standing on GPA.

Assessment:

  • Could the analysis be repeated based on the information given here? Is the material organized into logical categories (like the one’s above)?

Results

Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous).

Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not necessarily better.

Typically, results sections start with descriptive statistics, e.g. what percent of the sample is male/female, what is the mean GPA overall, in the different groups, etc. Figures can be nice to illustrate these differences! However, information presented must be relevant in helping to answer the research question(s) of interest. Typically, inferential (e.g. hypothesis tests, confidence intervals) statistics come next. Tables can often be helpful for results from multiple regression. Do not give computer output here! This should look like a peer-reviewed journal article results section. Tables and figures should be labeled, embedded in the text, and referenced appropriately. The results section typically makes for fairly dry reading. It does not explain the impact of findings, it merely highlights and reports statistical information.

Assessment:

  • Is the content appropriate for a results section? Is there a clear description of the results?
  • Are the results/data analyzed well? Given the data in each figure/table is the interpretation accurate and logical? Is the analysis of the data thorough (anything ignored?)
  • Are the figures/tables appropriate for the data being discussed? Are the figure legends and titles clear and concise?

Discussion

This section is a conclusion and discussion. This will require a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. Restate your objective and draw connections between your analyses and objective. In other words, how did (or didn’t) you answer/address your objective. Place these all in the larger scope of previous research on your topic (i.e. what you found from the literature review), that is, how do your findings help the field move forward? You should critique your own methods and discuss the limitations of your findings. Any potential issues pertaining to the reliability and validity of your data and the appropriateness of the statistical analyses should also be discussed. Provide a brief paragraph with suggestions for future research to better investigate your research question.

Assessment:

  • Do the authors clearly state whether the results answer the question (e.g. support or disprove the hypothesis)?
  • Were specific data cited from the results to support each interpretation? Does the author clearly articulate the basis for each conclusion they draw?
  • Does the author adequately relate the results of the current work to what is previously known about the topic?

Formatting

This is an assessment of the overall presentation and formatting of the written report.

Slides + Oral Presentation

Slides

In addition to the write-up, your team must also create slides to summarize and showcase your project. Introduce your research question and dataset, showcase visualizations, and provide some conclusions. These slides should serve as a visual “elevator pitch” to accompany to your write-up and will be graded for content and quality. The slides are due on Blackboard no later than the assigned time of your presentation.

Here is a suggested outline as you think through what should be included in your slides; you do not have to use these exact categories:

  • Title / Catchy summary statement of what you found
  • Background / motivation
  • Research questions investigated
  • The data
  • (Visual) highlights from EDA (if it contributes to your overall story)
  • (Visual) highlights of inference / modeling results
  • Conclusions + future work

You will be expected to design your slides using effective communication best practices learned in class.

Oral presentation

You will sign up for a time slot to present during Week 15 or Finals Week. Details will be provided during the semester.

Project repository

All written work (with exception of slides) should be reproducible, and the RStudio project repo should be neatly organized, submitted as a .zip file.

The repo should have the following structure:

  • README: Short project description and data dictionary
  • .Rproj file
  • written-report.qmd & written-report.pdf
  • presentation.pptx, presentation.pdf, or other similar slide format
  • project-proposal.qmd & project-proposal.pdf
  • cleaning-eda.qmd & cleaning-eda.html
  • /data: Folder that contains the data set for the final project.

Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Peer teamwork evaluation

You will be asked to fill out a survey where you provide feedback on your team dynamic & your team members contributions, and self-assess your own contributions.

Grading details

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Students understand how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Students understand most of the concepts, put together an adequate argument, identify some weaknesses of their argument, and communicate most results clearly to others.
  • 70%-79%: Passing effort. Students have misunderstanding of concepts in several areas, have some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Students are making some effort, but have misunderstanding of many concepts and are unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Students are not making a sufficient effort.

Late work policy

There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.

Additional notes and tips

The project is very open ended. For instance, in creating a compelling visualization(s) of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

Tips

  • Ask questions if any of the expectations are unclear.

  • Code: In your write up your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However, your document should include all your code such that if I re-render your .qmd file I should be able to obtain the results you presented.

    • Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
  • Make sure each team member is contributing, both in terms of quality and quantity of contribution.

  • All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade adjusted accordingly. While different teams members may have different backgrounds and strengths, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

  • Finally, pay attention to details in your write-up and presentation. Neatness, coherency, and clarity will count.

Formatting + communication

Suppress Code, Warnings, & Messages

  • Include the following code in a code chunk at the top of your .qmd file to suppress all code, warnings, and other messages. Use the code chunk header {r set-up, include = FALSE} to suppress this set up code.
knitr::opts_chunk$set(echo = FALSE,
                      warning = FALSE, 
                      message = FALSE)

Headers

  • Use headers to clearly label each section. Make sure there is a space between the last # and the title, so the header renders correctly. For example, ###Section Title will not render as header, but ### Section Title will.

References

  • Include all references in a section called “References” at the end of the report.
  • This course does not have specific requirements for formatting citations and references.
  • See Section 4.5 of the R Markdown Cookbook to learn about the citation functionality in R Markdown and Quarto.

Appendix

  • If you have additional work that does not fit or does not belong in the body of the report, you may put it at the end of the document in section called “Appendix”.
  • The items in the appendix should be properly labeled.
  • The appendix should only be for additional material. The reader should be able to fully understand your report without viewing content in the appendix.

Resize figures

  • Resize plots and figures, so you have more space for the narrative.
    • Resize individual figures: Use the code chunk header {r plot1, fig.height = 3, fig.width = 5}, replacing plot1 with a meaningful label and the height and width with values appropriate for your write up.
    • Resize all figures: Include the fig_width and fig_height options in your YAML header as shown below:
---
title: "Your Title"
author: "Team Name + Group Members"
output: 
  pdf_document:
    fig_width: 5
    fig_height: 3
---

Replace the height and width values with values appropriate for your write up.

Arranging plots

Arrange plots in a grid, instead of one after the other. This is especially useful when displaying plots for exploratory data analysis and to check assumptions.

  • If you’re using ggplot2 functions, the patchwork package makes it easy to arrange plots in a grid. See the documentation and examples here.

  • If you’re using base R function, i.e. when using the emplogit functions, put the code par(mfrow = c(rows,columns)) before the code to make the plots. For example, par(mfrow = c(2,3)) will arrange plots in a grid with 2 rows and 3 columns.

Plot titles and axis labels

Be sure all plot titles and axis labels are visible and easy to read.

  • Use informative titles, not variable names, for titles and axis labels.
  • Use coord_flip() to flip the x and y axes on the plot. This is useful if you a bar plot with an x-axis that is difficult to read due to overlapping text.
  • Put in the extra effort to make your plots looks more professional

Tables and model output

  • Use the kable function from the knitr package to neatly output all tables and model output. This will also ensure all model coefficients are displayed.
    • Use the digits argument to display only 3 or 4 significant digits.
    • Use the caption argument to add captions to your table.

Guidelines for communicating results

  • Don’t use variable names in your narrative! Use descriptive terms, so the reader understands your narrative without relying on the codebook.
    • ❌ There is a negative linear relationship between mpg and hp.
    • ✅ There is a negative linear relationship between a car’s fuel economy (in miles per gallon) and its horsepower.
  • Know your audience: Your report should be written for a general audience who has an understanding of statistics at the level of STAT 4380.
  • Avoid subject matter jargon: Don’t assume the audience knows all of the specific terminology related to your subject area. If you must use jargon, include a brief definition the first time you introduce a term.
  • Tell the “so what”: Your report and poster should be more than a list of interpretations and technical definitions. Focus on what the results mean, i.e. what you want the audience to know about your topic after reading your report or viewing your poster.
  • Tell a story: All visualizations, tables, model output, and narrative should tell a cohesive story!
  • Use one voice: Though multiple people are writing the report, it should read as if it’s from a single author. At least one team member should read through the report before submission to ensure it reads like a cohesive document.

Additional resources