1-04: Vectors and Data Frames

0.1 Changes…

  • extension about why I added arguments when reading the CSV file

  • add some more images for early examples?

1 Purpose

  • Introducing vectors

  • Opening a CSV file and saving data to a data frame

  • Pulling columns from a data frame

  • Pulling individual values from a column

  • Pulling multiple values for a column

 

If you have any questions about the material in this lesson, feel free to email them to the instructor, Charlie Belinsky, at belinsky@msu.edu.

1.1 Files for this lesson

From this point on most lessons will have script and/or data files to download.

 

The script for this lesson is here

  • Save the linked script file to the scripts folder in your Class Project
  • The script file contains the code in this lesson in the order it appears in this lesson. Source the script to see the results from this lesson.

 

The data used in this lesson, twoWeekWeatherData.csv

  • Save the linked data file to the data folder in your Class Project

    • If you save the data file to the wrong folder, the script file will not be able to read it.
  • Trap: Opening CSV files in Excel

2 Standalone scripts vs. scripts within a Project

In the last two lessons, everything we did was contained to one script file – no outside file was needed.  The scripts in those lessons were standalone scripts and could be executed inside or outside your RStudio Project with the same results.

 

Standalone scripts are rare and are generally only created for testing purposes.

 

The script for this lesson will not execute correctly if it is treated as a standalone script (e.g., it is executed outside of the RStudio Project). This is because the file looks for a data file and it needs to know where to look for the data file.

2.1 Set Working Directory (a cautionary statement)

Many people still treat all R scripts as standalone scripts (i.e., they do not use RStudio Project). This causes all sorts of problems when script file look for other files.  The old way to get around this is to set the Working Directories using setwd().  This is a mess because Working Directories change when file are moved between computers or even within a computer.  Scripts that set working directories are much harder to share and maintain.  This author strongly suggests that you use RStudio Projects and avoid the working directory mess.

3 Vector Variables

Vector variables, or vector for short, are variables that hold multiple related values.  For instance if we have temperature values for 5 days, we could create 5 variables and assign each one a temperature value:

tempHigh1 = 65;
tempHigh2 = 57;
tempHigh3 = 61;
tempHigh4 = 64;
tempHigh5 = 59;

And, in the last lesson’s application, we averaged the 5 values like this:

tempAvg = (tempHigh1 + tempHigh2 + tempHigh3 +
           tempHigh4 + tempHigh5) / 5;

3.1 Grouping values

But, in programming we want to group related values together.  So, instead of creating 5 different variables, we create one vector that has 5 values:

tempHigh = c(65, 57, 61, 64, 59);

One of the main reasons to create a vector is that the values in the vector are all associated and we can perform mathematical operations on the whole vector.  For instance, the vector might consist of multiple temperatures in Fahrenheit and want to convert all the temperatures to Celsius:

tempHighCel = (5/9) * (tempHigh - 32)

Or, we can use one of R’s many, many statistical functions, mean(), on the vector:

tempMean = mean(tempHigh); 

After executing the code in this section, we can see in the Environment tab that tempHighCel has 5 converted temperatures and that tempMean is the same as tempAvg:

tempAvg:     61.2
tempHigh:    num [1:5] 65 57 61 64 59
tempHighCel: num [1:5] 18.3 139 16.1 17.5 15
tempMean:    61.2
Figure 1: View of temperature values in Environment

Extension: Statistical functions on vectors

3.2 Indexing values in a vector

Putting values into vectors also allows us to associate values between vectors.  If we have a vector of high temperature values, then we could also have a vector of low temperature values.

tempLow = c(45, 40, 55, 37, 42);

We can directly subtract tempLow values from tempHigh to get the difference in temperature for all values:

tempDiff = tempHigh - tempLow;

We can index individual values.  You could think of the vector as 5 days.  And if you want, you can find the change in temperature for any day using the index 1 through 5:

tempDiff_3 = tempHigh[3] - tempLow[3];
tempDiff_5 = tempHigh[5] - tempLow[5];

The Environment shows:

tempHigh:    num [1:5] 65 57 61 64 59
tempLow:     num [1:5] 45 40 55 37 42
tempDiff:    num [1:5] 20 17 6 27 17
tempDiff_3:  6
tempDiff_5:  17

3.3 Vectors and columns in a data frame

Data frames are made up of vectors that all have the same length.  The vectors are the columns in the data frame and the number of values in the vector is determined by the number of rows in the data frame.

 

In Figure 2 there is a weather data frame (note: we are using this data frame in future lessons, not this one) where each column has a measured daily weather value (e.g., minTemp, maxTemp, relHum).  Each row gives the measured weather values for a specific day (e.g., row 1 is Jan 1, row 2 is Jan 2).

Figure 2: A data frame viewed in RStudio with 23 variables (i.e., the 23 columns, which are vector variables)

4 CSV files

In the first couple of lessons we were using data from inside the script but most of the data you will use will come from outside the script – often from a data frame. The data frames themselves often come from a CSV file. CSV stands for Comma Separated Values, meaning that the values in the file are separated by commas.

 

Note: CSV is really just a text file that can be read in any text editor and CSV file do not have to use commas as separators.

 

Let’s look at a small data set we downloaded in the Section 1.1 section called twoWeekWeatherData.csv, which contains two weeks of weather data from Lansing, Michigan. 

4.1 Opening a CSV file

Note: CSV files are often opened by default in Excel and this can problematic: Trap: Opening CSV files in Excel

 

CSV files are just text files and CSV files can be opened in any text editor (e.g., notepad, notepad++, textEdit).

 

RStudio is also a text editor so we can open and edit twoWeekWeatherData.csv in RStudio by either:

  • Clicking File -> Open File… and navigate to twoWeekWeatherData.csv and press OK.

  • Using the Files tab: navigate to the data folder, click on twoWeekWeatherData.csv, choose View File (Figure 3)

Figure 3: The twoWeekWeatherData.csv file opened in RStudio

 

The CSV file has 15 lines (rows).  The first row contains the 5 column headers and the other 14 rows contain the 5 values for each date – all separated by a comma.

4.2 Reading in a CSV file

We can see that the CSV file has a header row and each row has five values representing: date, highTemp, lowTemp, precipitation, noonCondition.

 

To use the data in a script we need to read the data into a script file – and for that we will use the read.csv() function.

# Get the data from the twoWeekWeatherData CSV file
weatherData = «read.csv»(file="data/twoWeekWeatherData.csv", 
                       sep=",",
                       header=TRUE);

read.csv() reads in the data from the CSV file – and saves the data to a variable called weatherData.

 

weatherData is a data frame variable and it appears in the Environment tab under Data as 14 obs. of 5 variables. (i.e., 14 rows and 5 columns), which matches what we saw in the CSV file (Figure 3). 

  • Double-clicking on weatherData opens the data frame in the File Viewer window.  This is a convenient way to visualize the data frame.

  • Clicking on the blue arrow to the left of weatherData provides information about each of the columns

Figure 4: Viewing the data frame in the File Viewer

4.3 Column value type

In the Environment, when you click on the arrow next to the left of weatherData (Figure 2) you see some information about the columns in the data frame. The first piece of information is the name of the column (i.e., the header – date, highTemp, etc.).  The second is the type of value the column contains.

 

In weatherData there are three types of values:

  1. int (e.g., highTemp and lowTemp): integer – or non-decimal numbers

  2. num (e.g., precipitation): numeric, or numbers with decimals

  3. chr: (e.g, noonCoondition) this stands for character, and character is really short for string of characters, and this in turn is typically shortened to string.  Whatever you call it, it is the catch-all category for values R cannot recognize as anything else. 

 

Note: the date column has date values but it is not formatted properly so R does not recognize it as a Date – something we will cover much later.

5 Functions and arguments

read.csv() is a function that tells R to open a file and treat it like a CSV file. A function is a set of code separate from your main script that performs a common task (e.g., plotting values, finding a standard deviation, reading in a file).

 

There is a lot going on in our first line of code and that is because there are a lot of options, called arguments, when opening up a CSV file!  We will talk a little bit about these arguments here and save the rest of the discussion for the next lesson.

 

Inside the function read.csv(), I included three arguments (file, sep, and header).  Arguments are like the knobs and dials for a function.  The function reads a CSV file and the arguments tweak how the CSV file is read.

5.1 file

file = “data/twoWeekWeatherData.csv” means that the script opens the file called twoWeekWeatherData.csv located in the data folder of your RStudio Project

 

This only works if you are in the RStudio Project because the RStudio Project sets the Working Directory to your Project Folder.

 

If you execute the script file outside your RStudio Project (Figure 5), you will get a cannot open file error in the Console because the script does not know to find the CSV file.

Figure 5: You get a cannot open file error if you try to execute the script outside of your RStudio Project

5.2 sep

sep = “,” means that the values in the file are separated by commas.  This is not always true – even for CSV files.

 

The most common alternate separator values are spaces ( sep = ” “ ) and dots ( sep = “.” )

5.4 Default values for arguments

By default, read.csv() assumes sep=“,” and header=TRUE. For the data in this example, sep and header are not needed and this code would work:

weatherData = read.csv(file="data/twoWeekWeatherData.csv");

But I like to keep sep and header in my code because you often do need to change them (e.g., you have data without a header or spaces between values instead of commas) and it is easier to deal with if you already have sep and header when copying code from one script to another.

6 Getting data

We now have a data frame with weather information stored in a variable called weatherData.  But, we usually do not want to deal with the whole data frame at once.  So, our next trick will be to extract parts of the data from the data frame.  We might want to work with one column in the data frame (e.g., highTemp or precipitation).  We can see all the columns in the data frame by expanding the data frame in the Environment or looking at the headers in the File Viewer (both views are in Figure 4). 

6.1 Pulling out a column

Columns in a data frame are vectors (i.e., a variable with multiple values) and we can save the columns to a vector.  We can access the columns from a data frame using the $ operator.

 

We start with weatherData and, use the dollar sign operator ( $ ) to access the highTemp column, and save the results to the vector called highTemps:

highTemps = weatherData$highTemp;

In the Environment square brackets ([ ] ) indicates the variable is a vectorhighTemps appears in the Environment as a vectorhighTemps has 14 ( [1:14] ) integer (int) values, representing the temperature for each of the 14 days. 

Figure 6: The highTemps vector variable in the Environment tab

6.2 Pulling out (subsetting) individual values

Perhaps we want just one value – specifically, we want to know the high temperature on April 2 (the seventh day in weatherData).

 

There are multiple ways to access that individual value.  I will present three, but for the rest of this lesson, we are going to use the first method.
 

We could subset using the subset operator ( [ ] ) the 7th value from highTemps:

highTempDay7a = highTemps[7];

We could go back to the original data frame (weatherData) and subset the 7th value from the column highTemp:

highTempDay7b = weatherData$highTemp[7];

Or, we can use a row/column (e.g., X/Y) operation on weatherData to get the value:

highTempDay7c = weatherData[7, "highTemp"];  # think of this as [X,Y] notation

The last method is functionally the same as the previous two – but the X,Y notation is the most robust as it can best handle more complex subsetting situations (e.g., subsetting multiple rows and columns), which will be discussed in a future lesson.

6.3 The Environment and the L value

I will be referring quite often to the Environment tab as it looks after the script has been sourced (as in Figure 6) .  Instead of taking a screenshot of the Environment tab (which is hard to maintain), I will display the variables I am focused on like this:

highTemp7a:  60L
highTemp7b:  60L
highTemp7c:  60L

The Environment tab shows that highTemp7a, highTemp7b, and highTemp7c are all equal to 60L

 

The L indicates that the number is explicitly an integer, not just a numeric values that happens to have all zeros in its decimal.

7 Pulling out multiple values

We have pulled out a column from a data frame and we have pulled out individual values.  In this section, we will show different ways to pull out multiple values with:

  1. An index vector created using c()

  2. A regular (or, simple) sequence using the colon ( : ) operator

  3. A more complex sequence using the seq() function

7.1 Using a numeric vector to pull multiple values

highTemps is a vector with 14 values and each of these values can be accessed by their numeric placement:

val_01 = highTemps[1];
val_05 = highTemps[5];
val_12 = highTemps[12];

Or, we might want to combine the temperatures from these three days into one variable.  c() creates a vector of the three values (1, 5, and 12) and that vector is used to subset highTemps:

valComb = highTemps[c(1,5,12)];  # a vector with 3 values

valComb is a vector with 3 values, the high temperature for days 1, 5, and 12:

val_01:    57L
val_05:    39L
val_12:    54L
valComb:   int [1:3] 57 39 54
Figure 7: The three values we pulled from highTemps individually saved and combined (valComb)

7.2 Pulling out consecutive values

If we want to get the high temperature for consecutive days, we can use the simple sequencing ( : ) operator – simple refers to the fact that we are sequencing by 1:

consecVals = highTemps[3:11];  # same as highTemps[c(3,4,5,6,7,8,9,10,11)]

consecVals is a vector with 9 values – the high temperatures from days 3 through 11Note: 9 values, not 8 – we include both the 3 and the 11.

 

The ( : ) operator also works in reverse

consecValsRev = highTemps[11:3];

consecValsRev is a vector with 9 values – the high temperatures from days 11 down to 3.

consecVals:     int [1:9] 54 40 39 ...
consecValsRev:  int [1:9] 39 44 55 ...
Figure 8: int [1:9] means 9 values are in each vector – the first few values are shown

7.3 Using the Console to view values in a vector

The Environment tab is good for quickly looking at values, but it only shows a few values from each vector.  If you want to see all the values then you can use the Console tab (bottom-left corner).  In the Console tab, type in the variable you want to see and press Enter:

Figure 9: Using the Console tab to view variables

 

These two values, consecVals and consecValRev, are shown in the Console window in Figure 9. In the future I will present the Console like this:

> consecVals 
[1] 54 40 39 58 60 53 55 44 39 
> consecValsRev 
[1] 39 44 55 53 60 58 39 40 54

7.4 Using a sequence to get values

We can use the seq() to create a more complex sequence to subset a vector.

 

This is a two step process:

  1. Create a sequence. 

  2. Use the sequence to subset the vector. 

7.4.1 Creating a sequence

seq() is a function with three arguments:

  • from: the starting point

  • to: the ending point

  • by: the iteration

 

We can create a sequence that lists every other number from 1-14

seq1 = seq(from=1, to=14, by=2);

We can reverse the sequence by switching the to and from and inverting the by

seq2 = seq(from=14, to=1, by=-2);

We can get every 3rd number for the first 10 numbers:

seq3 = seq(from=1, to=10, by=3);

The 3 sequences we created are just vectors of numbers:

seq1:   1 3 5 7 9 11 13
seq2:   14 12 10 8 6 4 2
seq3:   1 4 7 10

7.4.2 Using the sequences as indexes

We can use the 3 sequences above as index vectors to index the highTemps vector:

 

Get every other high temperature:

highTempSeq1 = highTemps[seq1];

Get every other high temperature in reverse:

highTempSeq2 = highTemps[seq2];

Get every 3rd high temperature value in the first 10 values:

highTempSeq3 = highTemps[seq3];
highTempSeq1:  int [1:7] 57 54 39 60 ...
highTempSeq2:  int [1:7] 75 54 44 53 ...
highTempSeq3:  int [1:4] 57 40 60 44
Figure 10: There are 7 values in the first two sequences and 4 in the third

7.4.3 Combining the above 2 steps

You can combine the above two steps in 1 command.

 

The following command will give you the same high temperature values:

highTempSeq1b = highTemps[seq(from=1, to=14, by=2)];  
highTempSeq2b = highTemps[seq(from=14, to=1, by=-2)];  
highTempSeq3b = highTemps[seq(from=1, to=10, by=3)];

Extension: Square brackets [ ] vs. Round Brackets ( )

7.4.4 Note about combining steps in R

While it might seem like combining steps is a good idea because it reduces the amount of code, this needs to be balanced with the fact that combining code usually makes it harder to read.  One of the big problems I see when looking at people’s code is that they cram too much code into one command and it becomes really difficult to debug if anything goes wrong.  If you are going to err on one side then this author recommends you err on the side of breaking up code rather than combining it.

8 Application

A) Create a file named app1-04.r in your scripts folder inside your RStudio Project and save the information from twoWeekWeatherData.csv to a variable named weatherData (just like in this lesson).

 

B) Sequencing

Use the seq() function to create this sequence: (13, 9, 5, 1, -3, -7). 

Note: this sequence could not be used to index a vector because it mixes positive and negative numbers. 

C) Subsetting:

  1. Pull out two columns precipitation and lowTemp from weatherData and save each to a variable

  2. Save the 4th value of precipitation to a variable

    • Repeat the previous step for the other two methods given in Section 6.2
  3. Save the 9th, 12th, 5th, and 1st values of lowTemp to one variable (in that order)

  4. Create a variable that holds the precipitation for the last 5 days (of the 14 days).

  5. Create a variable that holds the lowTemp for the days 12 down to 6.

  6. Create a variable that holds the first 3 days and last 3 days of precipitation

 

D) Negative subsetting:

Subset precipitation using the sequence c(-1,-3,-5). Explain in comments what happened.

 

E) Vector Math:

  • Convert the precipitation column from inches to centimeters (2.54cm = 1in).

 

F) Opening Data Frames:

  1. Save the file DowJones1980_2012.csv to the data folder in your RStudio Project

    • To save: right-click on link and choose Save link as…
  2. Using read.csv(), save the data in DowJones1980_2012.csv to a data frame named dowJonesData

    •  there is no header in this file and the separator is a space
  3. Plot the data using the line: plot(dowJonesData$V2)

    • If you opened the file correctly the plot will be in the Plots tab and look like this:

      Figure 11: Plot of the Dow Jones data

 

 

Save the script as app1-04.r in your scripts folder and  email your Project Folder to Charlie Belinsky at belinsky@msu.edu.

 

Instructions for zipping the Project Folder are here.

 

If you have any questions regarding this application, feel free to email them to Charlie Belinsky at belinsky@msu.edu.

8.1 Questions to answer

Answer the following in comments inside your application script:

  1. What was your level of comfort with the lesson/application?

  2. What areas of the lesson/application confused or still confuses you?

  3. What are some things you would like to know more about that is related to, but not covered in, this lesson?

9 Trap: Opening CSV files in Excel

If you have Excel on your computer, then it is probable that Excel is the default program for opening CSV files.  This means double-clicking on a CSV file will open it in Excel.  The danger with opening CSV files in Excel is that Excel will sometimes reformat the data (especially date and money columns) and then give a message saying how the data needs to be saved to be preserved.  Do not save the file – it will change the format of your data.  If you are editing a CSV file in Excel then I would save the results to a test file first to make sure Excel did not do anything untoward.

Figure 12: Opening a CSV file in Excel – the possible data loss warning is not something you want to heed

10 Extension: Long and Short Integers

L stands for Long integerlong meaning that this variable can hold a number larger than 256.  In the old days, when space on computers was VERY expensive, programmers would save space by having a short integer and long integer designation.  So, if a programmer knew a number was small, they would assign it to a short integer to take up less space on the computer.  Since temperatures in Lansing are generally going to be less than 256 (unless you are using the Kelvin system), it is safe to assign temperatures to short integers.

 

Nowadays, space is so cheap that we rarely care that we are using a tiny bit more memory than we need – so L has become mostly a relic of the past.

11 Extension: Statistical functions on vectors

One huge advantage to using vectors is that most statistical function take vectors.  So, you can easily perform statistical functions on a vector:

tempHigh = c(65, 57, 61, 64, 59);

tempMean = mean(tempHigh);
tempSD = sd(tempHigh);
tempVar = var(tempHigh);
tempMean:   61.2
tempSD:     3.346...
tempVar:    11.2

We will be doing more with this in later lessons…

12 Extension: Square brackets [ ] vs. Round Brackets ( )

Square brackets are used when you are subsetting or indexing a variable.

Round brackets are used when you are calling a function and contain the arguments for that function.

… and the two types of brackets are often used in the same command

lowTemps [seq(from=12, to=2, by=-2)]

says you are going to subset lowTemps using the function seq()

> lowTemps [seq(from=12, to=2, by=-2)]
[1] 43 40 50 45 38 43

seq() is a function and produces a vector of numbers:

> seq(from=12, to=2, by=-2)
[1] 12 10 8 6 4 2

c() is also a function – c stands, unintuitively, for concatenate and will also produce a vector of numbers:

> c(12,10,8,6,4,2)
[1] 12 10 8 6 4 2

lowTemps [c(12,10,8,6,4,2)] and lowTemps [seq(from=12, to=2, by=-2)] produce the same results.

 

You can also subset lowTemps without calling a function:

> lowTemps[5]
[1] 36
> lowTemps[4:8]
[1] 38 36 45 32 50

Technically, 4:8 still calls the sequence seq(from=4, to=8, by=1) in the background – but that is really getting into the weeds!

 

read.csv(), ggplot(), plot(), rm() are all functions.  There will be a lot more about functions in this class starting with next lesson.