Regex: A Deep Dive
Previously we talked about how there are several tools of the trade when it comes to Cybersecurity. Many of these are free, built-in and extremely versatile. One of these Swiss army knives is regular expressions, or regex.
We’ll go over several ways in which you can utilize regex in security workflows.
Data Cleaning
One area where regex is particularly useful is in text data cleaning.
For example, say you may have a dataset of text that contains unwanted characters, such as extra spaces or line breaks, or that needs to be standardized in some specific way. With regex, you can write patterns to identify and clean up these unwanted characters.
Here's an example of a regex pattern that can be used to remove extra spaces from a string of text:
"^ +| +$"
For a pattern specific to tabs, we can use the following.
"^[ \t]+|[ \t]+$"
For context, I worked on a project that involved the same idea on a bigger scale to normalize our logs. In other words, it pays to learn this stuff.
Text Extraction
Another area where regex can be useful is in text data extraction.
Here’s an example using regex101
Let’s say you have a dataset that contains information you want to extract and analyze, such as dates, phone numbers, or email addresses. With regex, you can write patterns to identify and extract this information, making it much easier to analyze and work with.
Here's an example of a regex pattern that can be used to extract phone numbers from a string of text:
"^\d{3}-\d{3}-\d{4}$"
Note that this would work with phone numbers being in the format of 3 digits (area code), 3 digits, 4 digits. So something like 305-555-5555.
The same above pattern wrapped in a bit of Python code could like the following
import re
text = "string with a phone number: 555-555-4321"
phone_num = re.search("\d{3}-\d{3}-\d{4}", text)
print(phone_num.group(0))
# Output: "555-555-4321"
This utilizes the re
module, which will allow you to leverage regular expressions.
These examples showed text cleaning and text extraction. In addition to these use cases, regex can be used to perform many other text processing tasks such as search and replace, and text manipulation.
Search and Replace
Sed allows you to reads specified files, or the standard input, and modified the input as specified. It allows for regular expressions which makes it that much more powerful.
sed 's/”msg”/”message”/' file.log > fixed_file.log
What this is saying is from the log, we want to replace the word “msg” with “message”. This is the substitute command in sed that replaces the first occurrence of "”msg” in each line with "”message”". To take this further, and replace all occurrences of “msg” with “message” we would do the following.
sed 's/”msg”/”message”/g' file.log > fixed_file.log
The g/ means global replace, so this will replace ALL occurrences of “msg” with “message”.
Final Thoughts
In conclusion, regex is a powerful tool as shown in this previous Regex series.
It can be used for text data cleaning, data extraction, validation, and many other tasks. For a good reference on what each character represents take a look at this cheat sheet from Linux Foundation.
With the right knowledge and practice, you can start using regex to make your work much more efficient and effective.