7、正则表达式中的量词和重复

范先生2000/1/7大约 4 分钟

正则表达式中的量词和重复

量词是正则表达式中用来指定匹配次数的特殊字符。通过使用量词，你可以控制一个模式应该匹配多少次。本文将详细介绍正则表达式中的量词及其用法。

基本量词

`*` 星号

星号表示匹配前面的表达式 0 次或多次：

const regex = /ab*c/;

console.log(regex.test('ac')); // true - 'b' 出现 0 次
console.log(regex.test('abc')); // true - 'b' 出现 1 次
console.log(regex.test('abbc')); // true - 'b' 出现 2 次

`+` 加号

加号表示匹配前面的表达式 1 次或多次：

const regex = /ab+c/;

console.log(regex.test('ac')); // false - 'b' 必须至少出现 1 次
console.log(regex.test('abc')); // true - 'b' 出现 1 次
console.log(regex.test('abbc')); // true - 'b' 出现 2 次

`?` 问号

问号表示匹配前面的表达式 0 次或 1 次：

const regex = /colou?r/; // 匹配 'color' 或 'colour'

console.log(regex.test('color')); // true - 'u' 出现 0 次
console.log(regex.test('colour')); // true - 'u' 出现 1 次
console.log(regex.test('colouur')); // false - 'u' 出现超过 1 次

精确的数量

`{n}` 精确匹配 n 次

const regex = /a{3}/; // 匹配正好 3 个连续的 'a'

console.log(regex.test('aa')); // false - 只有 2 个 'a'
console.log(regex.test('aaa')); // true - 正好 3 个 'a'
console.log(regex.test('aaaa')); // true - 包含 3 个连续的 'a'

`{n,}` 至少匹配 n 次

const regex = /a{2,}/; // 匹配至少 2 个连续的 'a'

console.log(regex.test('a')); // false - 只有 1 个 'a'
console.log(regex.test('aa')); // true - 有 2 个 'a'
console.log(regex.test('aaa')); // true - 有 3 个 'a'

`{n,m}` 匹配 n 到 m 次

const regex = /a{2,4}/; // 匹配 2 到 4 个连续的 'a'

console.log(regex.test('a')); // false - 只有 1 个 'a'
console.log(regex.test('aa')); // true - 有 2 个 'a'
console.log(regex.test('aaaa')); // true - 有 4 个 'a'
console.log(regex.test('aaaaa')); // true - 包含 4 个连续的 'a'

贪婪与非贪婪匹配

默认情况下，量词是贪婪的，这意味着它们会尽可能多地匹配字符：

const text = '<div>内容</div>';
const greedyRegex = /<.*>/; // 贪婪匹配

console.log(text.match(greedyRegex)[0]); // '<div>内容</div>' - 匹配整个字符串

通过在量词后添加 ? 可以使其变成非贪婪（或称为懒惰）匹配，它会尽可能少地匹配字符：

const text = '<div>内容</div>';
const lazyRegex = /<.*?>/; // 非贪婪匹配

console.log(text.match(lazyRegex)[0]); // '<div>' - 只匹配第一个 HTML 标签

所有量词都有对应的非贪婪版本：

*? - 非贪婪版本的 *
+? - 非贪婪版本的 +
?? - 非贪婪版本的 ?
{n}? - 非贪婪版本的 {n}
{n,}? - 非贪婪版本的 {n,}
{n,m}? - 非贪婪版本的 {n,m}

常见使用场景

电话号码验证

// 匹配形如 (123) 456-7890 的美国电话号码
const phoneRegex = /^\(\d{3}\) \d{3}-\d{4}$/;

console.log(phoneRegex.test('(123) 456-7890')); // true
console.log(phoneRegex.test('123-456-7890')); // false - 格式不匹配

密码强度检查

// 密码必须包含至少一个数字，一个小写字母，一个大写字母，且长度至少为8
const passwordRegex = /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$/;

console.log(passwordRegex.test('Password123')); // true
console.log(passwordRegex.test('password')); // false - 缺少数字和大写字母
console.log(passwordRegex.test('Pass1')); // false - 长度不足

提取 HTML 标签内容

const html = '<p>第一段</p><p>第二段</p>';
const regex = /<p>(.*?)<\/p>/g;
let match;
const contents = [];

while ((match = regex.exec(html)) !== null) {
  contents.push(match[1]);
}

console.log(contents); // ['第一段', '第二段']

量词的性能考虑

使用量词时需要注意以下性能问题：

回溯问题：过度使用贪婪量词可能导致大量回溯，影响性能。例如 /a*a*a*a*a*b/ 对于长字符串可能导致灾难性回溯。
避免嵌套量词：如 /(a+)+b/ 这样的嵌套量词可能导致指数级的回溯。
使用原子组：在支持的正则表达式引擎中，可以使用原子组 (?>...) 来防止回溯。
合理使用非贪婪量词：非贪婪量词在某些情况下可以提高性能，但不是所有情况都如此。

量词是正则表达式中非常强大的功能，可以让你用简洁的表达式匹配复杂的模式。在下一篇文章中，我们将探讨正则表达式中的分组和捕获。