Python basics (15)-regular expressions

Python basics (15)-regular expressions

re.match()

re.match (regular expression, the string to be matched), match the regular expression from the beginning of the string, if it matches, return the match object (Match Object), otherwise return None (note that it is not an empty string "").

import re
print(re.match('hello','hello world'))
print(re.match('world','hello world'))
# <_sre.SRE_Match object; span=(0, 5), match='hello'>
#None

Single character match

character

Description

[]

Match any character in the brackets

[^]

Except for the characters in the brackets

\s

Any space

\S

Non-space

\d

Match numbers 0-9

\D

Match non-digits, which means characters other than digits

\w

Any character [A-Za-z0-9]

\W

Match non-word characters

Dot match

import re

result = re.match("AC", "ABC")
print(result.group())
result = re.match("AC", "A1C")
print(result.group())

Bracket match

import re

result = re.match("A[B1D]C", "ABC")
print(result.group())
result = re.match("A[BCD]C", "AGC")
if result:
    print(result.group())

Vertical line matching

text = "123test"
regexStr = "(123|txt)"
matchOjc = re.match(regexStr,text)
if matchOjc:
   print(matchOjc.group(1))

Start and end match

character

Description

^

Start with the character after the arrow

$

End with the character before the dollar sign

text = "test123"
regexStr = "^t.*3$"
if re.match(regexStr,text):
   print("match")

Multiple characters match

character

Description

*

Match the previous character 0 or unlimited times, it can be optional

+

Match the preceding character appears 1 time or unlimited times, that is, at least 1 time

?

Match the previous character once or 0 times, that is, either once or not, and at the same time make greedy into non-greedy mode

{m}

Match the previous character m times

{m,n}

Match the previous character from m to n times

import re

regStr ='[A-Za-z_]+[\w]*'
result = re.match(regStr, "test_1")
if result:
    print(result.group())
result = re.match(regStr, "1_test")
if result:
    print(result.group())

Match grouping

character

Features

|

Match any one of the left and right expressions

(ab)

Use the characters in the brackets as a group

\num

Quote the string matched by the numth group

(?P<name>)

Group aliases

(?P=name)

Quote the string matched by the name group by alias

import re

regStr ='[\w]{4,20}@(163|qq|gmail)\.com'
result = re.match(regStr, "test@qq.com")
if result:
    print(result.group())

'r' is to prevent character escaping. If'\t' appears in the path, if r is not added,/t will be escaped. After adding'r','\t' can retain its original appearance.

import re

regStr = r'<(\w*)><(\w*)>.*</\2></\1>'
result = re.match(regStr, "<html><head>test</head></html>")
if result:
    print(result.group())
result = re.match(regStr, "<html><head>test</body></html>")
if result:
    print(result.group())

Group alias

import re

regStr = r'<(?P<label1>\w*)><(?P<label2>\w*)>.*</(?P=label2)></(?P=label1)>'
result = re.match(regStr, "<html><head>test</head></html>")
if result:
    print(result.group())
result = re.match(regStr, "<html><head>test</body></html>")
if result:
    print(result.group())

Match sub-expression

Use parentheses () to enclose the content you want to extract.

import re
content ='hello 12345 python'
result = re.match('^h\w{4}\s(\d+)\s\w+',content)
print(result.group())
print(result.group(1))
print(result.span())

Greedy match

The dot (.) can match any character (except the newline character). The asterisk ( ) represents an unlimited number of times to match the preceding character. The dot star (. ) combination can match any character, but the dot star (. ) will match as many characters as possible, which is considered a greedy match. The greedy matching expression ^h. (\d+)\s\w+ causes group(1 ) Will only get the number 7, because the dot star (.*) will take the matching characters as much as possible, so it swallows 1234, leaving only the number 5.

import re
content ='hello 12345 python'
result = re.match('^h.*(\d+)\s\w+',content)
print(result.group())
print(result.group(1))
print(result.span())
# hello 12345 python
# 5
# (0, 18)

Non-greedy match

Does the non-greedy matching pattern put a question mark after the dot? That is, the dot star (.*?) is a non-greedy match, and matches as few characters as possible

import re
content ='hello 12345 python'
result = re.match('^h.*?(\d+)\s\w+',content)
print(result.group())
print(result.group(1))
print(result.span())

Match modifier

Modifier

Description

re.I

Not case sensitive

re. L

Do localized identification matching

re. M

Multi-line matching, affects ^ and $

re. S

Match all characters including newline

import re
content ='''hello 12345
python'''
result = re.match('^h.*?(\d+).*?n$',content,re.S)
print(result.group())
print(result.group(1))
print(result.span())

Match the entire string, until a matching object is found, and no matching object is found at the end of the match before returning to None

import re

result = re.search('\d+', "Search number: 1245")
if result:
    print(result.group())

findall

Match all content that conforms to the law and return a list containing the results

import re

rList = re.findall('\d+', "Find the number: 1 in 11112 when registering 1245")
for r in rList:
    print(r)
#1245
#11112
#1

sub

re.sub(pattern, repl, string, count=0, flags=0) Use repl to replace each matched substring in string and return the replaced string. When repl is a string, you can use/id or/g<id>,/g<name> to quote the group, but you cannot use the number 0. When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (groups cannot be quoted in the returned string). Count is used to specify the maximum number of replacements, if not specified, replace all.

import re
content ='hello12345python'
result = re.sub('\d',"",content)
print(result)
#hellopython
import re


def func(matchObj):
    if matchObj:
        return "python"

print(re.sub(r"\d+", func,'hello 123'))

split

Cut the string according to the match and return a list

import re

rList = re.split(r':|', "Find the number: 1245 1 in 11112 when registered")
for r in rList:
    print(r)
Reference: https://cloud.tencent.com/developer/article/1437127 Python Basics (15)-Regular Expressions-Cloud + Community-Tencent Cloud