5、正则表达式中的分组和捕获

范先生2000/1/5大约 4 分钟

正则表达式中的分组和捕获

分组是正则表达式中的一个重要概念，它允许你将多个字符视为单个单元，以便应用量词或者从匹配结果中提取特定部分。本文将详细介绍正则表达式中的分组和捕获技术。

基本捕获组

使用小括号 () 可以创建一个捕获组，捕获组有两个主要作用：

将括号内的模式视为一个单元，可以对整个组应用量词
捕获匹配的文本，以便后续使用

const regex = /(\w+) (\w+)/;
const text = 'John Doe';

const matches = regex.exec(text);
console.log(matches[0]); // 'John Doe' - 完整匹配
console.log(matches[1]); // 'John' - 第一个捕获组
console.log(matches[2]); // 'Doe' - 第二个捕获组

命名捕获组

ES2018 引入了命名捕获组功能，使用 (?<name>...) 语法：

const regex = /(?<firstName>\w+) (?<lastName>\w+)/;
const text = 'John Doe';

const matches = regex.exec(text);
console.log(matches.groups.firstName); // 'John'
console.log(matches.groups.lastName); // 'Doe'

命名捕获组提高了代码的可读性，尤其是在有多个捕获组的情况下。

非捕获组

有时你可能只想要将一些模式分组而不需要捕获它们，这时可以使用非捕获组 (?:...)：

const regex = /(?:\d{3})-(\d{3})-(\d{4})/;
const text = '123-456-7890';

const matches = regex.exec(text);
console.log(matches[0]); // '123-456-7890' - 完整匹配
console.log(matches[1]); // '456' - 第一个捕获组
console.log(matches[2]); // '7890' - 第二个捕获组
// 注意：没有捕获区号 '123'

非捕获组的主要优点是性能略好（因为正则表达式引擎不需要存储匹配结果），同时可以让捕获组的索引更加可预测。

捕获组的引用

1. 在正则表达式中使用反向引用

你可以在同一个正则表达式内使用 \n（其中 n 是捕获组的编号）来引用之前的捕获组：

const regex = /<(\w+)>.*?<\/\1>/; // \1 引用第一个捕获组
const text = '<div>内容</div>';

console.log(regex.test(text)); // true - 开闭标签匹配
console.log(regex.test('<div>内容</span>')); // false - 开闭标签不匹配

使用命名捕获组的反向引用：

const regex = /<(?<tag>\w+)>.*?<\/\k<tag>>/;
const text = '<div>内容</div>';

console.log(regex.test(text)); // true

2. 在替换字符串中使用引用

在 String.prototype.replace() 方法中，你可以使用 $n 来引用捕获组：

const text = 'John Doe';

// 交换名字和姓氏
const swapped = text.replace(/(\w+) (\w+)/, '$2, $1');
console.log(swapped); // 'Doe, John'

使用命名捕获组：

const text = 'John Doe';

const swapped = text.replace(/(?<firstName>\w+) (?<lastName>\w+)/, '$<lastName>, $<firstName>');
console.log(swapped); // 'Doe, John'

分组嵌套

捕获组可以嵌套，内部组的编号按照左括号的顺序从左到右：

const regex = /((a+)b+)c/;
const text = 'aabbc';

const matches = regex.exec(text);
console.log(matches[0]); // 'aabbc' - 完整匹配
console.log(matches[1]); // 'aabb' - 第一个捕获组
console.log(matches[2]); // 'aa' - 第二个捕获组

实际应用示例

1. 解析 URL

const urlRegex = /^(?<protocol>https?:\/\/)?(?<domain>[\w\.-]+)(?<port>:\d+)?(?<path>\/.*)?$/;
const url = 'https://example.com:8080/path/to/resource';

const matches = urlRegex.exec(url);
console.log(matches.groups.protocol); // 'https://'
console.log(matches.groups.domain); // 'example.com'
console.log(matches.groups.port); // ':8080'
console.log(matches.groups.path); // '/path/to/resource'

2. 日期格式转换

// 将 MM/DD/YYYY 格式转换为 YYYY-MM-DD 格式
const dateRegex = /(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4})/;
const date = '9/23/2025';

const formattedDate = date.replace(dateRegex, '$<year>-$<month>-$<day>');
console.log(formattedDate); // '2025-9-23'

// 添加前导零
const paddedDate = date.replace(dateRegex, (match, month, day, year) => {
  return `${year}-${month.padStart(2, '0')}-${day.padStart(2, '0')}`;
});
console.log(paddedDate); // '2025-09-23'

3. 提取 HTML 属性

const html = '<a href="https://example.com" target="_blank" rel="noopener">链接</a>';
const attrRegex = /(?<attribute>[\w-]+)="(?<value>[^"]*)"/g;

let match;
while ((match = attrRegex.exec(html)) !== null) {
  console.log(`${match.groups.attribute}: ${match.groups.value}`);
}
// 输出:
// href: https://example.com
// target: _blank
// rel: noopener

注意事项和最佳实践

适度使用捕获组：过多的捕获组会影响性能和代码可读性，如果不需要捕获内容，请使用非捕获组。
优先使用命名捕获组：命名捕获组比数字索引更具可读性和自文档性。
避免过度嵌套：复杂的嵌套可能导致难以理解的正则表达式，考虑将复杂模式拆分为多个步骤。
注意捕获组的索引：当修改正则表达式时，捕获组的索引可能会改变，这可能影响使用这些索引的代码。
使用 RegExp.$1-$9 的风险：这些静态属性依赖于最近的正则表达式操作，容易受到代码顺序的影响，最好避免使用。

掌握分组和捕获技术可以大大提高正则表达式的功能和灵活性，使你能够处理更复杂的文本分析和转换任务。在下一篇文章中，我们将探讨正则表达式的性能优化技巧。