Reshaping though frequently required in data analysis, so often it remains confusing even if you are frequent user ofreshape
function. Here I have provided an simple example to elaborate more on each argument ofreshape
.
Reshape data from wide to long
Your data in wide form if the multiple observations of item, place or person (i.e. units) has been recorded in single row (but in multiple column). This multiple observations may be of repeated measure type (observation are made repeatedly at different time point) or multiple characteristics of some unit (eg. height, length and width of square).
Let’s take an example of wide data of repeated measure type. Person A is visiting clinic every month for their blood pressure check, and nurse note down the value for each month in single row which belong to this specific person A. See example below,
# wide data
df <- data.frame(matrix(data = NA, nrow = 3, ncol = 5, dimnames = list(NULL, paste0(c("id","name","jan","feb","march")))))
df[1,] <- c(1,"A",123,120,125)
df[2,] <- c(2,"B",140,150,155)
df[3,] <- c(3,"C",96,86,97)
Our “wide” data look like
knitr::kable(df, caption = "wide data")
Table 1:wide data1 | A | 123 | 120 | 125 |
2 | B | 140 | 150 | 155 |
3 | C | 96 | 86 | 97 |
We want a data in which each new observation in new row but within same column i.e. long data. To achieve that we need to use standardreshape
command inR
. Thereshape
has following argument:
idvar
: unique identifier for person,place or object on which observations(measurements) are made at different time points or repeatedly. ExampleCase ID
varying
: if observation for specific individuals are made at different time points, in which columns values are recorded i.e time-varying columns. ExampleJan
,Feb
,March
timevar
: what will be the name of column once the time-varying columns above has been staked in rows. ExampleMonth
times
: what will be the values (of time) once the time-varying columns above has been staked in rows. ExampleJan
,Feb
,March
v.names
: what will be the values (of observations) once the time-varying columns above has been staked in rows. ExampleBP
direction
: data needs to converted from wide tolong
format.
df_long <- reshape(df,
idvar = "id",
#[unique identifier for person,place or object on which observations(measurments) are made at different time points or repeatedly]
varying = c("jan","feb","march"),
# [if observation for specific individulas are made at different time points, in which columns values are recorded i.e time-varying columns ]
timevar = "month",
# [what will be the name of column once the timevarying columns above has been staked in rows]
times = c("jan","feb","march"),
# [what will be the values (of time) once the timevarying columns above has been staked in rows]
v.names = "BP",
# [what will be the values (of observations) once the timevarying columns above has been staked in rows]
direction = "long")
# [we want to convert wide df into long one])
Our “long” data look like
df_long <- arrange(df_long, id)
kable(df_long,format = "pandoc", caption = "long data")
Table 2:long data1 | A | jan | 123 |
1 | A | feb | 120 |
1 | A | march | 125 |
2 | B | jan | 140 |
2 | B | feb | 150 |
2 | B | march | 155 |
3 | C | jan | 96 |
3 | C | feb | 86 |
3 | C | march | 97 |
Sometime, not only one type of measurement (BP) but also other types (such as heart rate-HR) are measured and recorded row wise. For example,
# wide data
df <- data.frame(matrix(data = NA, nrow = 3, ncol = 8, dimnames = list(NULL, paste0(c("id","name","BP_jan","BP_feb","BP_march","HR_jan","HR_feb","HR_march")))))
df[1,] <- c(1,"A",123,120,125,72,70,71)
df[2,] <- c(2,"B",140,150,155,85,82,86)
df[3,] <- c(3,"C",96,86,97,65,52,59)
kable(df,format = "pandoc", caption = "wide data- multiple category")
Table 3:wide data- multiple category1 | A | 123 | 120 | 125 | 72 | 70 | 71 |
2 | B | 140 | 150 | 155 | 85 | 82 | 86 |
3 | C | 96 | 86 | 97 | 65 | 52 | 59 |
This data can be converted into “long” by usinglist
for group of time-varying columns forvarying
df_long <- reshape(df,
idvar = "id",
varying = list(c("BP_jan","BP_feb","BP_march"),c("HR_jan","HR_feb","HR_march") ),
timevar = "month",
times = c("jan","feb","march"),
v.names = c("BP","HR"),
direction = "long")
df_long <- arrange(df_long, id)
kable(df_long,format = "pandoc", caption = "long data")
Table 4:long data1 | A | jan | 123 | 72 |
1 | A | feb | 120 | 70 |
1 | A | march | 125 | 71 |
2 | B | jan | 140 | 85 |
2 | B | feb | 150 | 82 |
2 | B | march | 155 | 86 |
3 | C | jan | 96 | 65 |
3 | C | feb | 86 | 52 |
3 | C | march | 97 | 59 |
Reshape data from long to wide
To make data “wide” from long, thereshape
function will need only two main arguments
idvar
: unique identifier of unit on which measurement are made
timevar
: which column representthe timingof the observations ( so thatreshape
function associate it with the value for given time for each ID )
If you do not specify above two arguments, function will drop an error-
Error in [.data.frame (data, , idvar) : undefined columns selected
If you read above error carefully, it already specifying which arguments were missing. Here in above case missing argument wasidvar
.
You can optionally provide,
v.names
: which column representvaluesof the observations in long data (so thatreshape
function can transform these values into rows for each ID)
sep
: column names in wide format are going to be created using value oftimesvar
andintegers
. Specify how both will be seperated in column names.
Here is the example
df_wide <- reshape(df_long,
idvar = "id",
# unique identifier
timevar = "month",
# the column represent the timing of the observations
v.names = c("BP","HR"),
# the columns represent the value of the observation (BP,HR)
direction = "wide",
sep = "_"
)
Here is the wide data
Table 5:wide data1 | 1 | A | 123 | 72 | 120 | 70 | 125 | 71 |
4 | 2 | B | 140 | 85 | 150 | 82 | 155 | 86 |
7 | 3 | C | 96 | 65 | 86 | 52 | 97 | 59 |