Writing Your Own Functions
In the field of data analysis, functions play a crucial role. This section is dedicated to exploring how to write functions in R.
The main goal of writing functions in R is to improve the efficiency, maintainability, and clarity of your code. Here are the primary objectives:
Modularity: Functions allow you to break your code into smaller, more manageable parts. Each function focuses on a specific task, which makes the code easier to understand and maintain.
Reusability: Once created, functions can be reused in different parts of your analysis or even in future projects. This reduces redundant code and promotes consistency.
Readability: Functions with clear names and proper documentation serve as self-contained units that enhance the overall readability of your code.
Efficiency: By isolating specific operations into functions, you can optimize code execution and avoid redundant calculations.
Prototype for Writing a Function
Here is the typical prototype for creating a function in R:
function_name <- function(argument1, argument2, ...) {
# Start of the function body
instruction 1 that uses (or not) argument1, argument2, or...
instruction 2 that uses (or not) argument1, argument2, or...
...
return(result)
# End of the function body
}
Explanations:
function_name: Replace this with the name you want to give to your function.
argument1, argument2, …: These are the arguments that your function accepts as input. You can have zero or more arguments. Specify them by separating them with commas. You can also give them default values (e.g., argument1=NULL, argument2=10…).
The function body: This is where you place the code that performs the desired task. This code will use the variables argument1, argument2, (…).
return(result): Use the
return()
function to send the result back to the user. The result can be a single value or a list of values. If it’s a list, you should return a list in the form:list(result1, result2...)
.
What does ‘return a result’ mean?
When a function “returns a result,” it means that the result
can be assigned to an object (below
an_object
).
an_object <- a_function_returning_a_value(arguments)
Only functions that contain a return()
in their
definition (i.e., in the function body) allow returning a result. A
function that cannot return a result returns NULL
by
default. This is the case, for example, with cat()
, which
prints a message to the screen or a file. It is not designed to return a
result (cat()
does not contain a return()
function). So, below, even though cat()
printed ‘toto’ on
the screen, the value of an_object
is
NULL
.
an_object <- cat("toto\n")
toto
print(an_object)
NULL
Example: Centering a Variable
Wikipedia: “Centering a variable involves subtracting its mean from each of its original values, that is, subtracting the mean from each data point (this is called centering). It is simply a change of origin, placing the mean of the distribution at the zero point on the x-axis. […] A centered variable has a mean of zero.”
This results in data that are independent of the chosen unit or scale. This operation is very useful in statistical analysis where you are dealing with variables that have very different scales.
We would like to create a function that allows centering a given vector. We will implement the function as follows:
centering <- function(x){return(x - mean(x))}
- Calculate the mean of x and store it in the object
mean_x_before
. - Print x.
- Use the
centering()
function to center the variablex
(overwrite the variable x). - Print x. Have the values of x changed?
- Check if the variable is centered by calculating the mean and
storing it in the object
mean_x_after
.
set.seed(123)
x <- rnorm(20, mean=10)
set.seed(123)
x <- rnorm(20, mean=10)
print(x)
mean_x_before <- mean(x)
x <- centering(x)
print(x)
mean_x_after <- mean(x)
Adding Arguments
Trimming the Mean
One issue with the mean is that it is sensitive to extreme values. We
could therefore choose to center the data by calculating a trimmed mean.
Luckily, the mean()
function in R has an argument dedicated
to this.
Below, we declare that for our centering()
function,
there is a new argument mean_trim
, and its default value
(i.e., if not specified) is 0 (we use all values to calculate
the mean).
centering <- function(x, mean_trim=0){return(x - mean(x, trim=mean_trim))}
- Calculate the mean of x and store it in the object
mean_x_before
. - Print x.
- Use the
centering()
function withmean_trim=0.2
to center the variablex
(overwrite the variable x). - Print x. Have the values of x changed?
- Check if the variable is centered by calculating the mean and
storing it in the object
mean_x_after
.
set.seed(123)
x <- c(rnorm(20, mean=10), 1000)
set.seed(123)
x <- c(rnorm(20, mean=10), 1000)
print(x)
mean_x_before <- mean(x)
x <- centering(x, mean_trim=0.2)
print(x)
mean_x_after <- mean(x)
Testing Argument Values
Although we can center by using the mean or the trimmed mean, it is possible to center a variable by subtracting the median. Indeed, the median is less sensitive to extreme values, and can therefore be considered a robust estimator of position.
We could thus add an argument centering_method
that
would take the values “mean” or “median”. If “mean” is chosen, the
mean()
function is called, and the mean is subtracted.
Otherwise, the median()
function is called, and the median
is subtracted.
One problem is that we need to ensure the user chooses correctly
between “mean” or “median”. In R, the principle is to directly offer the
possible values in the function declaration and retrieve the user’s
selected value using match.arg()
. Additionally,
match.arg()
will raise an error if the value of
centering_method
is neither “mean” nor “median”.
Furthermore, if the user does not specify centering_method
,
the first value in the vector c("mean", "median")
will be
used by default.
centering <- function(x,
centering_method=c("mean", "median"),
mean_trim=0
){
centering_method <- match.arg(centering_method)
if(centering_method == "mean"){
return(x - mean(x, trim=mean_trim))
}else{
return(x - median(x))
}
}
Exercises
Standardized Variable
Wikipedia: In probability theory and statistics, a standardized variable is the transformed version of a random variable by an affine transformation, such that its mean is zero and its standard deviation is one.
The general method for calculating for a vector v consists of: (i) calculating the mean of v, (ii) subtracting the mean from each element of v, then, after subtracting the mean, (iii) calculating the standard deviation, and (iv) dividing each of the centered values by the standard deviation.
In Anglo-Saxon countries, the term standardization is commonly used, and in some contexts, this transformation is referred to as the z-score.
\[z = \frac{x - \bar{x}}{\sigma}\]
With \(x\) as the vector, \(\bar{x} = \text{average}(x)\) is the mean of the vector, and \(\sigma\) is the standard deviation.
A standardized variable has:
- A mean of zero;
- A variance of 1;
- A standard deviation of 1.
Thus, we obtain:
- Data independent of the chosen unit or scale;
- Variables with the same mean and dispersion.
This makes it easier to compare variations. Standardizing variables is very useful in data analysis. It is equivalent to a change of unit and has no impact on variation profiles. The correlation coefficients between standardized variables remain the same as they were before the centering and scaling operation.
Implement a standardization()
function that
centers and scales a variable.
- The function should have a
centering_method
argument (see previous exercises). - The function should have a
disp_method
argument. This argument will control how the dispersion is calculated: “sd” for standard deviation (usingsd()
), “mad” for median absolute deviation (usingmad()
), or “iqr” for interquartile range (usingIQR()
).
# To be modified and completed
standardization <- function(x,
centering_method=c("mean", "median"),
mean_trim=0
){
centering_method <- match.arg(centering_method)
if(centering_method == "mean"){
return(x - mean(x, trim=mean_trim))
}else{
return(x - median(x))
}
}
standardization <- function(x,
centering_method=c("mean", "median"),
disp_method=c("sd", "mad", "iqr"),
mean_trim=0
){
centering_method <- match.arg(centering_method)
disp_method <- match.arg(disp_method)
if(centering_method == "mean"){
x <- x - mean(x, trim=mean_trim)
}else{
x <- x - median(x)
}
if(disp_method == "sd"){
x <- x / sd(x)
}else if(disp_method == "mad") {
x <- x / mad(x)
}else{
x <- x / IQR(x)
}
return(x)
}
- Given the
standardization()
function developed earlier, evaluate its effect on the columns of theiris
matrix using theboxplot
function. Test several parameters for thestandardization()
function.
data(iris)
iris <- iris[, 1:4]
boxplot(iris, horizontal=TRUE, las=1)
iris_std <- apply(iris, 2, standardization, centering_method="mean", disp_method="sd")
boxplot(iris_std, horizontal=TRUE, las=1)
par(mfrow=c(2,3))
data(iris)
iris <- iris[, 1:4]
boxplot(iris, horizontal=TRUE, las=1)
iris_std <- apply(iris, 2, standardization, centering_method="mean", disp_method="sd")
boxplot(iris_std, horizontal=TRUE, las=1, main="mean(x) / sd(x)")
iris_std <- apply(iris, 2, standardization, centering_method="mean", disp_method="mad")
boxplot(iris_std, horizontal=TRUE, las=1, main="mean(x) / mad(x)")
iris_std <- apply(iris, 2, standardization, centering_method="mean", disp_method="iqr")
boxplot(iris_std, horizontal=TRUE, las=1, main="mean(x) / iqr(x)")
iris_std <- apply(iris, 2, standardization, centering_method="median", disp_method="mad")
boxplot(iris_std, horizontal=TRUE, las=1, main="median(x) / mad(x)")
iris_std <- apply(iris, 2, standardization, centering_method="median", disp_method="iqr")
boxplot(iris_std, horizontal=TRUE, las=1, main="median(x) / iqr(x)")
End of the section
Thank you for following this tutorial.