Homework 3

Question #1

Find: \s{2,}
Replace: ,

In this problem I needed to remove a series of spaces, not just all spaces, and replace them with a comma. The regular expression find spaces, and adding {2,} found 2 or more consecutive spaces.

Question #2

Find: ([A-Za-z]+), \s([A-Za-a]+), \s(.+)
Replace: \2\1(\3)

To remove the commas I needed to identify the groups of words separately from the commas themselves. In order to achieve this I placed the commas outside the parentheses to ensure they were not a part of my group. The command ([A-Za-z]+) identifies a consecutive string of letter characters containing capitalized and lowercase letters. (.+) Identified every character except a new line, grouping the university titles together regardless of word count, allowing me to put parenthenses around that group in the replace function.

Question #3

Find: (?<=.mp3)(\s)
Replace: \n

The initial text was a single string of text that needed to be broken into lines using the regular expression to create a new line. To identify all characters before “.mp3” I used ?<=.and created the new line following the chunk of text .mp3

Question #4

Find: (\d{4})([^.mp3]*)
Replace:\2_\1

Similar to the last problem I needed to identify .mp3, but this time I wanted to exclude everything before that chunk of text using ^. The *outside of the brackets identified everything following. () made a group of the first four digits in the line of text. I then placed the second group before the first group identified in the regular expression and put a _ between them to achieve the desired result.

Question #5

Find: (\w)(\w+),(\w+),.*?,(\d)
Replace: \1_\3, \4

The objective here was to creates groups that identified the first character of the first word () then the rest of the first word before the comma (+), the second word in its entirety (+), any single character with a period and * means zero or more occurrences of any character and the ? identifies as few characters as possible since some values had more digits than others… compiling to .*? before the next comma, and the the group of digits (. I then needed to remove the second group, leaving only the first character of the first word, add an underscore before the third group, omit the string of random digits (which is why they werent in a group), and then place the fourth group after a comma.

Question #6

Find: (\w)\w+,(.{4})\w+,\d+.\d+,(\d+)
Replace: \1_\2,\3

Similarly to the last problem I had to identify specific characters in the string of plain text. In this problem however I needed to identify the first four letter in the second word using (.{4}). The period indicates any single character and the {4} limits those characters to a total of 4. Again I needed to remove the rest of the first word, the rest of the second word, and random digits so I didn’t put them in a group.

Question #7

Find: (.{3})\w+,(.{3})\w+,(\d+.\d+),(\d+)
Replace: \1\2, \4, \3

Using the orginial code I needed to identify the first three letters of the two words using (.{3})+. The first three letters were inside parentheses to identify them as a group I wanted to keep. Unlike the last two problems, I want to keep all of the digits in this regular expression so I placed them both in groups using parentheses. I then reordered the groups to achieve the desired plain text in the new order.

Question #8

Pathogen Binary

In the “pathogen_binary” column I would expect to see only binary data (1s and 0s), however there are entries that are NA indicating that the data was not collected for whatever reason so I will leave it as it. If I were to replace it I would do:

Find what: \bNA\b
Replace with: 0

bombus_spp and host_plant

There are excess characters in this column that are not part of the species name. I used ^ to identify all characters that are not a single character, space, or line break. Then we needed to remove the underscores because they are not considered special characters

Find: [^\w\s\r\n]
Replace: nothing

Find:_
Replace: nothing

bee_caste

The main issue with this column is that a male bee is a “drone” whereas a female is a “worker”, so the data is technically just inconsistent with it’s nomenclature.

Find: male
Replace: drone