Stats learning Week 2

Linxichen
3 min readMar 31, 2021

--

In week 2, we learned about several useful packages in R, for example, ggplot2, dplyr, readr, sgringr, janitor, forcats, and tydyverse. Generally, these packages are very useful when doing data wrangling and data visualization. One of the important points in this week’s study is principle of tidy data, which has the following three rules:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

The following picture showing the rules of the tidy data are easier to understand.

In data wrangling, we transform the original data frames into other forms to make it easier to answer some questions and also, the package “dplyr” is generally used in data wranggling. Also, in data visualization, the package “ggplot” is generally used.

Also, the pipe operator (%>%), which takes the output of one function and passes it into another function as an argument, allows us to link a sequence of analysis steps.

In class, we also learned some useful functions in data wrangling process, for example, mutate(), arrange(), ifelse(), filter(), and so on, which helps us to prepare for plotting graphs.

In this week’s reading, we read an article about the misconceptions generally encountered when learning stats. The article “Common misconceptions about data analysis and statistics” written by Harvey J. Motulsky has shown me several misunderstandings while processing with data and statistics and also given me a deeper insight about doing experiments and analyzing data.

The first misconceptions mentioned in the article is about P-value hacking, which gave me a deep impression because while doing a hypothesis test, P-value is an important indicator for rejecting our hypothesis or not. P hacking occurs when researchers carry out several experiments and then selectively report those that produce significant results as they previously expected. In order to reduce p-hacking, Harvey J. Motulsky advised us to clearly state whether or not the methods we chose to analyze the data is as we planned in the experiment protocol and also, label the conclusions with “preliminary” when we use any forms of p-hacking.

The second misconceptions that impressed me is about the standard error of the mean quantifies variability. Standard error of mean is a term that I used frequently when writing reports in course MAT302, which is a crucial indicator showing the precision of the estimates of the population mean. As stated in the article by Harvey J. Motulsky, although standard deviation and standard error of mean sounds similar, we need to be aware that they are totally different things. Also, sometimes it is better to use graphs and plots to show the variability of the raw data and use confidence intervals to interpret the precisions of the estimates.

This article has provided me with lots of useful suggestions and also reminded me to be careful with some misconceptions while carrying out an experiment and doing data analysis.

--

--