2-05: Which and Substrings
0.1 To-do
1 Purpose
use which() on string values
Regular Expressions to find substrings within a string
union() and intersect() and setDiff() to allow for complex conditions
2 Script and data for this lesson
The script for this lesson can be downloaded here
The data for this lesson can be downloaded here
3 Which on string data
Like last lesson we are going to use the the data from Lansing2016NOAA-3.csv:
= read.csv(file="data/Lansing2016NOAA-3.csv"); weatherData
In the last lesson we used which() to find numeric values in a vector that met a specific condition:
= which(precipNum > 1); # precip greater than 1" highPrecip
And there were 8 days with precipitation greater than 1 inch:
> highPrecip
1] 73 91 225 228 229 273 290 307 [
3.1 Identifying values using which()
In this lesson we will be searching for values that meet a condition within string (chr) columns.
In weatherData, the string column weatherType contains information about the type of weather that occurred during that day (e.g., Fog, Snow, Rain, etc.) :

We can use which() to find which values in weatherType are equal to a specific string, for instance, which values are equal to snow, “SN”?
= which(weatherData$weatherType == "SN"); snowyDays
The above code:
Cycles through each value in the weatherType column
Check if the value is equal to “SN”
Create a vector that has the index of all the values in the vector equal to “SN”
Save that vector to snowyDays
The result is a vector, snowyDays, with 11 values, so 11 days exclusively had snow:
: int [1:11] 3 16 17 29 ... snowyDays
There were many more days with snow but many were not counted in the search because those days had other weather condition. For instance, the first and fourth rows in Figure 1 had snow (SN) and breeze (BR). We will deal with those situations later in this lesson.
4 Using the indexed value generated by which()
We can use the vector of indexed values (snowyDays) to index other vectors like avgTemp and avgWind.
Let’s find the temperatures and wind speeds on days that it was snowy
= weatherData$avgTemp[snowyDays];
snowyDayTemps = weatherData$windSpeed[snowyDays]; snowyDayWinds
since snowyDays = c(3, 16, 17, 29, 39, 45, 136, 351, 354, 364, 365)
avgTemp[snowyDays] is equivalent to avgTemp[c(3, 16, 17, 29, 39, 45, 136, 351, 354, 364, 365)];
This gives the temperature on the 11 days indexed in the snowyDays vector
avgWind[snowyDays] is equivalent to avgWind[c(3, 16, 17, 29, 39, 45, 136, 351, 354, 364, 365)];
This gives the winds on the 11 days indexed in the snowyDays vector
The average temperatures and average wind speeds for the 11 days with snow were:
> snowyDayTemps:
1] 29 30 18 24 34 9 40 12 11 34 28
[> snowyDayWinds
1] 10.0 13.8 15.5 10.7 6.8 8.1 10.9 9.4 11.5 16.7 14.0 [
4.1 Multiple conditions on string values
Just like with if-else conditional statements, we can create more complex conditions by using the and ( & ), or ( | ), not ( ! ), and not equal (!=) operators.
Let’s find all days that has rain or snow using the or operator ( | ):
= which(weatherData$weatherType == "SN" |
rainOrSnow $weatherType == "RA"); weatherData
There were 39 days that had rain or snow. Since we know 11 of those days had snow, the other 28 must be rain:
> rainOrSnow
1] 3 16 17 29 39 45 69 77 82 87 90 110 120 134 136 137 150 158 165 175
[21] 178 193 197 211 224 258 264 271 272 281 282 294 304 313 337 351 354 364 365 [
Note: these results also do not include the days that had rain along with other conditions.
4.2 Inverting logic: the not operator
We can reverse the condition to look for days that did not have “RA” nor “SN” using the not ( ! ) operator.
The original condition was:
$weatherType == "SN" | weatherData$weatherType == "RA" weatherData
You can invert a condition using the not operator ( ! ) and parentheses around the original condition:
!(»weatherData$weatherType == "SN" | weatherData$weatherType == "RA"«)» «
And then put the inverted condition into the which() statement. Here we are saying, “which days did not have (snow or rain)”
= which(«!(»weatherData$weatherType == "SN" |
noRainNorSnow $weatherType == "RA"«)»); weatherData
And we see that noRainNorSnow has 327 values, which is 366-39:
: int [1:327] 1 2 4 5 6 7 ...
noRainNorSnow: int [1:39] 3 16 17 29 39 ... rainOrSnow
An equivalent way to reverse the logic is to say, “which days that did not have snow and days that did not have rain”
= which(weatherData$weatherType «!=» "SN" «&»
noRainNorSnow2 $weatherType «!=» "RA"); weatherData
The result is the same:
: int [1:327] 1 2 4 5 6 7 ... noRainNorSnow2
4.3 Incorrectly invert logic
In this section, we will look at two common ways people incorrectly invert logic.
4.3.1 Forgetting the parenthesis:
If you take this condition:
$weatherType == "SN" | weatherData$weatherType == "RA" weatherData
And put a not operator in front without parentheses:
!weatherData$weatherType == "SN" | weatherData$weatherType == "RA"
Then you have the logic: not SN or rain.
So, you will get all days that:
did not have snow or
did have rain
4.3.2 Forgetting to flip the or/and operator
If you choose to use the not equal operators on each component of the condition, then you need to flip the operator from or to and (or vice versa).
Here we did not flip the operator and kept the conditional operator or:
= which(weatherData$weatherType != "SN" |
noRainNorSnow_bad $weatherType != "RA"); weatherData
And the result is that all 366 values meet the condition:
: int [1:366] 1 2 3 4 5 6 7 ... noRainNorSnow_bad
In the application for this lesson, I will let you explain why saying “not snow or not rain” gives you every value in the column.
4.4 Conditions from different columns
We also can combine conditions from two different columns. For examples, let’s get all days that were raining and 70 degrees or more:
= which(weatherData$weatherType == "RA" &
rainAndHighTemps $avgTemp >= 70); weatherData
There are 7 days where both these condition (rain and temperatures 70 or more) occurred:
> rainAndHighTemps
1] 150 178 193 197 211 224 264 [
5 Multiple conditions on multiple variables
Similar to if(), we can use which() to create more complex condition. The trick is to make sure you have the parentheses correct, because, like Algebra, parentheses determine the order of operations.
So, let’s say we want days where there was rain or snow and temperatures between 25 and 40:
= which( (weatherData$weatherType == "RA" |
complexCond $weatherType == "SN") &
weatherData$avgTemp >= 25 &
(weatherData$avgTemp <= 40) ); weatherData
There were 7 days that meet all the conditions above:
: int[1:7] 3 16 39 136 337 364 365 complexCond
5.1 Combining conditional statements
For the above code, we have the inner parentheses with these two conditional statements:
$weatherType == "RA" | weatherData$weatherType == "SN")
(weatherData$avgTemp >= 25 & weatherData$avgTemp <= 40) (weatherData
And an outer parentheses that ands the two conditional statements together:
$weatherType == "RA" | weatherData$weatherType == "SN") «&»
«(» (weatherData$avgTemp >= 25 & weatherData$avgTemp <= 40) «)» (weatherData
In the Console we can break down the steps:
> which(weatherData$weatherType == "RA" | weatherData$weatherType == "SN")
1] «3» «16» 17 29 «39» 45 69 77 82 87 90 110 120 134 «136» 137 150
[18] 158 165 175 178 193 197 211 224 258 264 271 272 281 282 294 304 313
[35] «337» 351 354 «364» «365»
[> which(weatherData$avgTemp >= 25 & weatherData$avgTemp <= 40)
1] 1 2 «3» 6 7 8 9 10 14 15 «16» 25 26 27 28 30 32
[18] 33 34 35 36 37 38 «39» 40 47 52 53 54 55 56 57 58 60
[35] 65 66 78 79 80 81 83 84 85 86 88 89 93 94 95 96 97
[52] 98 99 100 101 103 «136» 300 301 317 324 325 326 327 328 330 331 332
[69] 336 «337» 338 339 340 341 342 343 344 346 347 356 357 358 359 360 362
[86] 363 «364» «365» 366 [
And see that 3, 16, 39, 136, 337, 364, and 365 are the only indexs values in both conditional statements.
6 Finding patterns within strings
The elephant in the room is that we are only capturing days that exclusively had snow or rain using which().
But we know from looking at the weatherType column (Figure 1) that many values in weatherType have “SN” or “RA” with other weather conditions – and these are not being captured because which() only checks for values that are exactly “SN”.
We really want every value that contains “SN”. This cannot be easily done using which() but it can be done easily using grep().
6.1 A quick grep() example
grep() is a very powerful tool for finding patterns within strings (the process is known as Regular Expressions, which might become another lesson in this class). In this lesson, we are only using the most basic functionality in grep() which is finding a substring within a string.
Let’s start with this vector:
= c("one fish", "two fish", "one llama", "two llama",
grepTestVec "red fish", "blue fish");
grep() return the indices of the vectors value that contain the substring specified:
= grep("two", grepTestVec); # values that have the substring "two"
test1 = grep("lla", grepTestVec); # values that have the substring "lla"
test2 = grep("fish", grepTestVec); # values that have the substring "fish" test3
And we can match the index values below with the values in grepTestVec:
: int [1:2] 2 4
test1: int [1:2] 3 4
test2: int [1:2] 1 2 5 6 test3
6.2 Using grep() to find substrings within a dataframe
We are going to use grep() to find the substring (“SN”) contained within a vector of string values (weatherType):
= grep(weatherData$weatherType, pattern="SN"); snowyDaysGrep
grep(),like which(), produces a vector of the index values that meets the condition (i.e., index of the weatherType values that contain “SN”).
And we can see there were actually 65 days that had snow – a much more reasonable result for Lansing, Michigan!
: int [1:65] 1 3 4 10 11 12 13 14 16 17... snowyDaysGrep
6.3 Subsetting vectors
snowyDaysGrep is a vector with 65 indexed values, representing the 65 days in Lansing that had snow.
Just like before, we can use the results of a grep() to index another column. We will use snowyDaysGrep to index the avgTemp and windSpeed columns in weatherData:
= weatherData$avgTemp[snowyDaysGrep];
snowDaysGrepTemp = weatherData$windSpeed[snowyDaysGrep]; snowDaysGrepWind
And we can see the average temperature and wind speeds for the 65 days that had snow:
> snowDaysGrepTemp
1] 26 29 17 27 13 18 15 26 30 18 11 13 17 19 22 36 26 30 24 38 26 26 34 28
[25] 16 14 9 20 27 20 30 29 34 22 19 18 20 28 34 32 36 28 35 33 28 34 40 40
[49] 34 33 28 26 23 26 28 18 15 10 12 21 14 11 36 34 28
[> snowDaysGrepWind
1] 15.5 10.0 7.7 17.8 12.2 18.2 13.2 5.6 13.8 15.5 13.1 7.6 4.4 5.5
[15] 6.3 18.2 13.3 14.1 10.7 13.5 7.9 7.3 6.8 12.9 13.8 14.5 8.1 6.9
[29] 3.8 5.9 18.9 16.5 16.2 13.1 9.7 3.7 2.2 5.4 10.2 10.4 10.9 11.6
[43] 10.7 7.9 9.4 13.8 10.9 19.2 5.9 8.6 16.7 10.2 8.6 10.3 13.8 10.8
[57] 16.7 11.3 9.4 4.1 8.1 11.5 7.3 16.7 14.0 [
note: if you want to sort the values, you can use sort(snowDaysGrepWind)
7 grep() with the or condition:
Similar to which() statements, we can use the or ( | ) operator to find all values in weatherType that contain either SN or RN:
= grep(weatherData$weatherType, pattern="RA|SN"); daysWithPrecipGrep
And we get a vector with 179 values in it, representing the 179 days that had rain or snow (or both):
: int[1:179] 1 3 4 8 9 10 ... daysWithPrecipGrep
An equivalent result can be obtained by getting the grep() results for RA and SN separately and then combining them using union():
= grep("RA", weatherData$weatherType);
rainyDaysGrep = grep("SN", weatherData$weatherType);
snowyDaysGrep = union(rainyDaysGrep, snowyDaysGrep); daysWithPrecipUnion
Looking at the days with rain, snow, and the combination of the two:
: int[1:124] 8 9 10 15 25 26 ...
rainyDaysGrep: int[1:65] 1 3 4 10 11 12 13 ...
snowyDaysGrep: int[1:179] 1 3 4 8 9 10 ... daysWithPrecipUnion
Note: the rainy days and snowy days adds up to 189 whereas the union has only 179 days. This is because there were 10 days that had both rain and snow… like day 10.
7.1 More complex union() scenario
Using union() seems more complicated than just using the or operator ( | ) in grep(), but union() allows for more complex situations.
For instance if we want to combine results from different columns like days that were hazy (HZ) or had temperatures above 85 (maxTemp > 85):
= grep(weatherData$weatherType, pattern="HZ");
hazyDays = which(weatherData$maxTemp > 85);
hotDays = union(hazyDays, hotDays); hazyOrHotDays
And we see there were 44 hazy days, 40 hot days, and 82 days that were hazy or hot (note: the union of hazy and hot days is not quite the addition of the two):
1:44] 12 13 14 19 20...
hazyDays int[: int[1:40] 149 162 163 167...
hotDays: int[1:82] 12 13 14 19... hazyOrHotDays
7.2 An alternative to & when using grep()
With grep() we can find days with rain or snow using the or operator ( | ) but we cannot use grep() to return days with both rain and snow (i.e., there is no & operator in grep())
#### the and operator does not work in grep()
# daysWithRainAndSnow = grep(weatherData$weatherType, pattern="RN&SN");
The reason for this requires a lengthy explanation of how grep() works (and regular expressions, in general), which is beyond the scope of this class.
What we want is the index values that are in both rainyDaysGrep and snowyDaysGrep.
There is a function for that called intersect(). intersect() is the sibling function to union(), intersect() takes two vectors and produces a vector of values that occur in both the original vectors.
In this case we will use intersect() on the vectors rainyDaysGrep and snowDaysGrep:
= intersect(rainyDaysGrep, snowyDaysGrep); daysWithRainAndSnow
daysWithRainAndSnow has 10 values, representing the 10 values that are in both rainyDaysGrep and snowyDaysGrep
> daysWithRainAndSnow
1] 10 26 34 35 84 94 98 101 352 359 [
8 setdiff() – the closest to a not operator
union() is functionally similar to the or operator, intersect() is functionally similar to the and operator.
The equivalent to a not operator is a bit more complex – the function to use is setdiff().
setdiff() takes two vectors as argument:
a vector with all possible values (so, if there are 366 values, then a vector from 1:366)
the vector of values to remove from the first vector
setdiff() returns the the values in the first vector that are not in the second vector effectively subtracting out values from the second vector that are in the first vector.
8.1 quick example
If the are 10 values in a vector and 4 of them (3,5,6,9) meet a condition:
= c(3,5,6,9);
meetsCondition = setdiff(1:10, meetsCondition); invertCondition
And we see that invertCondition is all values from 1 to 10 not represented in meetsCondition:
: int[1:6] 1 2 4 7 8 10 invertCondition
note: values in meetsCondition that are not in 1:10 (e.g., 11,12) are ignored.
8.2 Example using a column
If we want to find the days without rain or snow, we can take the days with rain or snow (daysWithPrecipGrep) and use setDiff() to find the difference between that and a vector with all 366 values:
= setdiff(1:366, daysWithPrecipGrep); daysWithoutPrecip
setdiff() removes from 1:366 all the index values in daysWithPrecipGrep:
: int [1:179] 1 3 4 8 9 10 ...
daysWithPrecipGrep: int [1:187] 2 5 6 7 22 24 ... daysWithoutPrecip
Adding up the daysWithPrecipGrep (179) and daysWithoutPrecip (187), we get 366 days.
9 Application
A) In comments at the top of your script answer the following:
Why do all values in the weatherType column meet the condition of not rain or not snow (Figure 3)?
How many days are both hazy and hot (Figure 6)?
B) Give two ways to change this logic (Figure 4) to rain and temperatures less than 70:
one by changing the conditional operator on the temperature
one by adding a not operator
Note: the answer for both will have the same 21 values
C) Use which() to:
find days that were RA, SN, or HZ with temperatures between 50 and 70
find days that were RA or SN, with temperatures between 50 and 70, and wind speed between 10 and 30
D) Use grep() (and some of union/intersect/setdiff) to find days with snow (SN) but not blowing snow (BLSN)
E) Use grep() (and some of union/intersect/setdiff) to find days that meets both these conditions:
at least one of: Fog (FG) or Haze (HZ)
at least one of: snow (SN) or Rain (RA)
Challenge: Use grep() on the precip column in weatherType to (note: precip is a string column):
find all precipitation values greater than or equal to 1.0 (hint: what is a unique substring for values >= 1)
find all precipitation values greater than or equal to 0.8
find all days that have precipitation values greater than 0.8 and have maximum temperatures between 70-80
- use which() to find the temperatures
Note: if you are looking for characters that are not alphanumeric then you might need to set the argument fixed in grep() to TRUE.
Save the script as app2-05.r in your scripts folder and email your Project Folder to Charlie Belinsky at belinsky@msu.edu.
Instructions for zipping the Project Folder are here.
If you have any questions regarding this application, feel free to email them to Charlie Belinsky at belinsky@msu.edu.
9.1 Questions to answer
Answer the following in comments inside your application script:
What was your level of comfort with the lesson/application?
What areas of the lesson/application confused or still confuses you?
What are some things you would like to know more about that is related to, but not covered in, this lesson?