Tools Deep Dive: Regex
Previously we talked about how there are several tools of the trade when it comes to Cybersecurity. Many of these are free, built-in and extremely versatile. In fact, every tool I am going to be going over in this deep dive series is open source.
One of these Swiss army knives is regular expressions, also known as regex.
We’ll go over several ways in which you can utilize regex in security use cases.
Data Cleaning
One area where regex is particularly useful is in text data cleaning.
For example, say you may have a dataset of text that contains unwanted characters, such as extra spaces or line breaks, or that needs to be standardized in some specific way.
With regex, you can write patterns to identify and clean up these unwanted characters.
Here's an example of a regex pattern that can be used to remove extra spaces from a string of text:
^ +| +$
For a pattern specific to tabs, we can use the following.
^[ \t]+|[ \t]+$
The idea is you have data that needs small changes, but across the board. Regex takes care of this.
For context, I previously worked on a project that involved the same idea on a bigger scale to normalize fields in our logs. In other words, it pays to learn this stuff.
You can see more about normalizing logs in this post by Julie Sparks.
Text Extraction
Another area where regex can be useful is in text extraction.
Here’s an example using regex101
Regex101 is pretty much the defacto platform for testing your regex patterns.
Let’s say you have a dataset that contains data you want to extract and analyze, such as dates, phone numbers, or email addresses.
With regex, you can write patterns to identify and extract this information, making it much easier to analyze and work with. (This beats cltrl+F/cmd+F)
Here's an example of a regex pattern that can be used to extract phone numbers from a string of text:
"^\d{3}-\d{3}-\d{4}$"
Note that this would work with phone numbers being in the format of 3 digits (area code), 3 digits, 4 digits. So something like 305-555-5555.
The same above pattern wrapped in a bit of Python code could like the following
import re
text = "string with a phone number: 555-555-4321"
phone_num = re.search("\d{3}-\d{3}-\d{4}", text)
print(phone_num.group(0))
# Output: "555-555-4321"
This utilizes the re
module, which will allows you to leverage regular expressions.
These examples showed text extraction. In addition to these use cases, regex can be used to perform many other text processing tasks such as search and replace, and text manipulation.
Search and Replace
Sed allows you to read specified files, or the standard input, and modifies the input as specified. It allows for regular expressions which makes it that much more powerful.
sed 's/”msg”/”message”/' file.log > fixed_file.log
What this is saying is from the log, we want to replace the word “msg” with “message”. This is the substitute command in sed that replaces the first occurrence of "”msg” in each line with "”message”". To take this further, and replace all occurrences of “msg” with “message” we would do the following.
sed 's/”msg”/”message”/g' file.log > fixed_file.log
The g/ means global replace, so this will replace ALL occurrences of “msg” with “message” for that specific file.
For a good reference on what each character represents take a look at this cheat sheet from The Linux Foundation.
Final Thoughts
In conclusion, regex is a powerful tool as shown in these examples.
It can be used for text data cleaning, data extraction, validation, search and replace, and many other tasks.
Now you can start using regex to make your work much more efficient and effective.