Applications of Regex and Python in data transformation for masking of sensitive information and extraction of date details from free text
Introduction
There are many useful applications of Regex. In this article, I would like to cover two of them commonly used for my projects in Singapore. They are
- Masking of sensitive Singapore National ID information
- Extraction of relevant date details from a text field
The full set of codes in a Jupyter Notebook format can be found in the following link https://github.com/ZS-Weng/Data_Engineering/tree/main/Regex
The full code for the various functions can also be found at the end of the article.
1. Masking of Singapore National ID information
In Singapore, the National ID starts with one of the alphabets, S,T,F or G followed by 7 digits and ends with another alphabet e.g. S1234567A. With a simple Regex function, we can detect the occurrence of text matching the pattern and mask the information automatically to a format such as SXXXX567A.
import re
def mask_nric(text):
= re.compile('([STFG])\d{4}(\d{3}[A-Z])')
nric_regex return nric_regex.sub(r'\1XXXX\2', text)
Below is the example of the original text and processed text where the sensitive information have been masked:
2. Extraction of date details from Service Text Field
There are several formats which can represent date and time and I will be highlighting the common scenarios for date representation. The code can be customized to the different date formats accordingly.
2a. Extraction of date details where month is represented by abbreviated or long month name
In this section, the date pattern to be extracted is where the month is an abbreviated or full month name e.g. 22 Dec 2021, 18 October 22 etc. The Regex Pattern for matching the date and output with some additional processing are shown below.
= re.compile(r'''(\d{1,2})?[\s-]?((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)))[\w]?['\s-]?['\s-]?(20\d{2}|\d{2})(?:\s|[A-Za-z]|[\(\)\.\]]|[-_/]|$)''', re.VERBOSE | re.IGNORECASE) string_date_pattern
2b. Extraction of date details where the date is represented in numeric format
Dates are also commonly represented entire in numbers and there are various formats where the date can start with the year, month or day e.g. YYYY-MM-DD, MM.DD.YY, DD/MM/YY etc. For the example, I am using a day-first date pattern and the Regex code and sample output are as below.
= re.compile(r'''([0-3]?[0-9])[./]([01]?[0-9])[./](20\d{2}|\d{2})(?:\s|[A-Za-z]|[\(\)\.\]]|[-_]|$)''', re.VERBOSE | re.IGNORECASE) day_first_pattern
2c. Extraction of date details where month is represented by short and long spelling
There are other instances where date are represented in a more abstract level, in this example by Quarter and Year. In this case, the first date of the quarter is used to represent the date extracted from the Quarter Year format. The Regex code and sample output are as shown below.
= re.compile(r'''Q([1-4])[\s-]?(20\d{2}|\d{2})''', re.VERBOSE | re.IGNORECASE) q_year_pattern
Thanks for reading and hope the information was useful in some way!
Full Codes including Regex Pattern and Functions
### String Date Pattern Regex
= re.compile(r'''
string_date_pattern (\d{1,2})?
[\s-]?
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)))
[\w]?
['\s-]?
['\s-]?
(20\d{2}|\d{2})
(?:\s|[A-Za-z]|[\(\)\.\]]|[-_/]|$)
''', re.VERBOSE | re.IGNORECASE)
### Function to apply String Date Pattern Regex
def check_string_date(compiled_pattern, text):
# List to contain output date format
= []
list_output_date
# Use text matching to match details
= compiled_pattern.findall(text)
list_dates
for record in list_dates:
= record
day, month, year = (len(day)==0)
flag_blank_day #Putting the default date to 1 if only Month and Year details are present
if flag_blank_day:
= "01"
day if len(year) == 2:
= '20' + year
year = month.capitalize()
month
= ' '.join([day,month,year])
str_date #Handle Scenario where the month is is 3 Letter Short Form
if len(month) == 3:
try:
"%d %b %Y"))
list_output_date.append(datetime.strptime(str_date, #In the event that the date is keyed in out of range
except ValueError:
= ' '.join(["01",month,year])
str_date "%d %b %Y"))
list_output_date.append(datetime.strptime(str_date,
#Handle Scenario where the month is is in long Form
else:
try:
"%d %B %Y"))
list_output_date.append(datetime.strptime(str_date, #In the event that the date is keyed in out of range
except ValueError:
= ' '.join(["01",month,year])
str_date "%d %B %Y"))
list_output_date.append(datetime.strptime(str_date,
if len(list_output_date) > 0:
return (list_output_date)
else:
return [pd.NaT]
### Numeric Date Pattern Regex
= re.compile(r'''
day_first_pattern ([0-3]?[0-9])
[./]
([01]?[0-9])
[./]
(20\d{2}|\d{2})
(?:\s|[A-Za-z]|[\(\)\.\]]|[-_]|$)
''', re.VERBOSE | re.IGNORECASE)
### Function to apply Numeric Date Pattern Regex
def check_numeric_date(compiled_pattern, text, match_type='day_first'):
# List to contain output date format
= []
list_output_date
# Use text matching to match details
= compiled_pattern.findall(text)
list_dates
for record in list_dates:
if match_type == 'day_first':
= record
day, month, year = int(day)
day = int(month)
month if len(year) == 2:
= int('20' + year)
year else:
= int(year)
year
try:
list_output_date.append(datetime(year,month,day))#In the event that the date is keyed in out of range
except ValueError:
print(f"Invalid date: {record}")
if match_type == 'month_first':
= record
day, month, year = int(day)
day = int(month)
month if len(year) == 2:
= int('20' + year)
year else:
= int(year)
year
try:
list_output_date.append(datetime(year,month,day))#In the event that the date is keyed in out of range
except ValueError:
print(f"Invalid date: {record}")
if len(list_output_date) > 0:
return (list_output_date)
else:
return [pd.NaT]
### Quarter Year Date Pattern Regex
= re.compile(r'''
q_year_pattern Q
([1-4])
[\s-]?
(20\d{2}|\d{2})
''', re.VERBOSE | re.IGNORECASE)
### Function to apply Quarter Year Date Pattern Regex
def check_quarter_year(compiled_pattern, text):
# List to contain output date format
= []
list_output_date
# Use text matching to match details
= compiled_pattern.findall(text)
list_dates
for record in list_dates:
= record
quarter, year = int(quarter)
quarter #Get the starting month of the quarter
= 1 + (quarter-1) * 3
month if len(year) == 2:
= int('20' + year)
year else:
= int(year)
year
1))
list_output_date.append(datetime(year,month,
if len(list_output_date) > 0:
return (list_output_date)
else:
return [pd.NaT]
Recommended Resource to learn more about Regex
Out of the different resources I have used to learn about Regex, I found the materials from Al Sweigart most engaging. You can find the free chapter covering Regex from his book here: https://automatetheboringstuff.com/2e/chapter7/