C++变位词问题分析_C/C++

在《编程珠玑》一书的第二章提到了一个变位词问题，变位词指的是一个单词可以通过改变其他单词中字母的顺序来得到，也叫做兄弟单词，如army->mary。由变位词可以引申出几个算法问题，包括字符串包含问题，比较两个字符串是否是变位词，以及找出字典中变位词集合的问题。

一、字符串包含问题

(1) 问题描述：存在字符串1和字符串2，假设字符串2相对较短，如何快速地判定字符串2中的字符都存在于字符串1里（假定字符串只包含字母）？

(2) 举例：字符串1为ABCDEFGHIJK，字符串2为ABCDE，则字符串1包含字符串2，因为字符串2中包含的字母在字符串1中也都有。

(3) 解决方案：

思路一

最直接的思路就是针对字符串2中的每个字符，轮询字符串1进行比较，看是否在字符串1里面。很明显这种的时间效率为O(n*m)。

									/************************************************************************* 

									  > File Name: test.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<string> 

									using namespace std; 

									void Compare(string long_str, string short_str) 

									{ 

									  int i,j; 

									  for(i=0; i<short_str.size(); ++i) 

									  { 

									    for(j=0; j<long_str.size(); ++j) 

									    { 

									      if(long_str[j] == short_str[i]) 

									      { 

									        break; 

									      } 

									    } 

									    if(j == long_str.size()) 

									    { 

									      cout << "false" << endl; 

									      return; 

									    } 

									  } 

									  cout << "true" << endl; 

									  return; 

									} 

									int main() 

									{ 

									  string l = "ABCDEFGHIJK"; 

									  string s = "ABCDEF"; 

									  Compare(l, s); 

									  return 0; 

									}

思路二

这里由于假定了字符串只包含字母，所以我们可以用一个额外的数组flag[26]作为26个字符标识位，先遍历长字符串将对应的标识位置1，再遍历短字符串，如果对应的标示位都是1，则包含；否则不包含。这种方法的时间复杂度为O(n+m)，为了提高空间效率，这里不使用数组而使用26个bit位来作为标示位（bitset容器）。

									/************************************************************************* 

									  > File Name: test1.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<bitset> 

									#include<string> 

									using namespace std; 

									bool Compare(string long_str, string short_str) 

									{ 

									  bitset<26> flag; 

									  for(int i=0; i<long_str.size(); ++i) 

									  { 

									    // flag.set(n)置第n位为1 

									    flag.set(long_str[i]-'A'); 

									  } 

									  for(int i=0; i<short_str.size(); ++i) 

									  { 

									    // flag.test(n)判断第n位是否为1 

									    if(!flag.test(short_str[i]-'A')) 

									      return false; 

									  } 

									  return true; 

									} 

									int main() 

									{ 

									  string l = "ABCDEFGHIJK"; 

									  string s = "ABCDEZ"; 

									  if(Compare(l, s)) 

									    cout << "true" << endl; 

									  else

									    cout << "false" << endl; 

									  return 0; 

									}

这种方法还可以进行优化，例如如果长字串的前缀就为短字串，那么我们可以不需要n+m次，而只需要2m次。具体实现请自己思考。

思路三

给每个字母分配一个素数，从2开始到3，5，7...遍历长字串，求得每个字符对应素数的乘积。然后遍历短字串，判断该乘积能否被短字符串中的字符对应的素数整除，如果除的结果存在余数，则说明有不匹配的字母；如果整个过程都没有余数，则说明短字串中的字母在长字串里都有。这种方法的时间复杂度也是O(n+m)，需要26个额外空间存素数，但是这种方法有一个缺点就是需要跟大整数打交道，因为乘积可能非常大。（这里我们使用<cstdint>头文件中定义的int64_t和uint64_t）

									/************************************************************************* 

									  > File Name: test2.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<string> 

									#include<stdint.h> 

									//#include<cstdint> // C++11 

									using namespace std; 

									bool Compare(string long_str, string short_str) 

									{ 

									  unsigned int primeNum[26] = {2,3,5,7,11,13,17,19,23,29,31,37,41,43,47, 

									            53,59,61,67,71,73,79,83,89,97,101}; 

									  /* int64_t和uint64_t分别表示64位的有符号和无符号整形数 */

									  /* 在不同位数机器的平台下通用，都是64位 */

									  uint64_t ch = 1; 

									  for(int i=0; i<long_str.size(); ++i) 

									  { 

									    ch = ch*primeNum[long_str[i]-'A']; 

									  } 

									  for(int i=0; i<short_str.size(); ++i) 

									  { 

									    if(ch%primeNum[short_str[i]-'A'] != 0) 

									      return false; 

									  } 

									  return true; 

									} 

									int main() 

									{ 

									  string l = "ABCDEFGHIJK"; 

									  string s = "ABCDEK"; 

									  if(Compare(l, s)) 

									    cout << "true" << endl; 

									  else

									    cout << "false" << endl; 

									  return 0; 

									}

二、比较两个字符串是否为变位词

(1) 问题描述：如果两个字符串的字符一样，但是顺序不一样，被认为是兄弟字符串，问如何在迅速匹配兄弟字符串（如，bad和adb就是兄弟字符串）。

(2) 注意：第一点中讨论了字符串包含问题，但是不要以为两个字符串互相包含就是（变位词）兄弟字符串了，例如aabcde和edcba互相包含，但它们不是变位词。

(3) 解决方案：

思路一

给每个字母分配一个素数，可以通过判断两个字符串的素数乘积是否相等。跟上述素数法一样，时间复杂度也是O(n+m)，需要跟大整数打交道。

									/************************************************************************* 

									  > File Name: test3.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<string> 

									#include<stdint.h> 

									//#include<cstdint> // C++11 

									using namespace std; 

									bool Compare(string s1, string s2) 

									{ 

									  unsigned int primeNum[26] = {2,3,5,7,11,13,17,19,23,29,31,37,41,43,47, 

									            53,59,61,67,71,73,79,83,89,97,101}; 

									  uint64_t ch = 1; 

									  for(int i=0; i<s1.size(); ++i) 

									  { 

									    ch = ch*primeNum[s1[i]-'a']; 

									  } 

									  for(int i=0; i<s2.size(); ++i) 

									  { 

									    ch = ch/primeNum[s2[i]-'a']; 

									  } 

									  if(ch == 1) 

									    return true; 

									  else

									    return false; 

									} 

									int main() 

									{ 

									  string s1 = "abandon"; 

									  string s2 = "banadon"; 

									  if(Compare(s1, s2)) 

									    cout << "They are brother words!" << endl; 

									  else

									    cout << "They aren't brother words!" << endl; 

									  return 0; 

									}

思路二

将两个字符串按照字母表顺序排序，看排序后的字符串是否相等，如果相等则是兄弟字符串（变位词）。这种方法的时间效率根据你使用的排序算法不同而不同。当然，你可以自己写排序算法，这里我们使用C++的STL中的sort()函数对字符串进行排序。

									/************************************************************************* 

									  > File Name: test4.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<algorithm> 

									#include<string> 

									using namespace std; 

									// 自定义序函数（二元谓词） 

									bool myfunction(char i, char j) 

									{ 

									  return i > j; 

									} 

									bool Compare(string s1, string s2) 

									{ 

									  // 采用泛型算法对s1,s2排序，sort()采用的是快速排序算法 

									  sort(s1.begin(), s1.end(), myfunction); 

									  sort(s2.begin(), s2.end(), myfunction); 

									  if(!s1.compare(s2)) // 相等返回0 

									    return true; 

									  else

									    return false; 

									} 

									int main() 

									{ 

									  string s1 = "abandon"; 

									  string s2 = "banadon"; 

									  if(Compare(s1, s2)) 

									    cout << "They are brother words!" << endl; 

									  else

									    cout << "They aren't brother words!" << endl; 

									  return 0; 

									}

三、字典找出所有变位词集合（重点）

(1) 问题描述：给定一个英语字典，找出其中的所有变位词集合。

(2) 解决方案：

思路一

对于这个问题，最快想到的最直接的方法就是针对每一个单词跟字典中的其他单词进行比较。然而，假设一次比较至少花费1微秒的时间，则拥有二十万单词的字典将花费：200000单词 x 200000比较/单词 x 1微秒/比较 = 40000x10^6秒 = 40000秒 ≈ 11.1小时。比较的次数太多了，导致效率低下，我们需要找出效率更高的方法。

思路二

标识字典中的每一个单词，使得在相同变位词类中的单词具有相同的的标识，然后集中具有相同标识的单词。将每个单词按照字母表排序，排序后得到的字符串作为该单词的标识。那么对于该问题的解题过程可以分为三步：第一步，读入字典文件，对单词进行排序得到标识；第二步，将所有的单词按照其标识的顺序排序；第三步，将同一个变位词类中的各个单词放到同一行中。

这里出现了标识-单词（key-value）对，我们很容易想到C++中的关联容器map，使用map的好处就是：

动态管理内存，容器大小动态改变；
单词与它的标识一一对应，对于相同标识(key)的单词直接加在值(value)后面；
无需根据标识排序，因为map会自动按关键字有序（默认升序）。
所以，在将每个单词及其标识存入map以后，就可以直接遍历输出了，每一个map元素就是一个变位词集合。

C++实现代码如下：

									/************************************************************************* 

									  > File Name: test5.cpp 

									  > Author: SongLee 

									 ************************************************************************/

									#include<iostream> 

									#include<fstream>  // file I/O 

									#include<map>    // map 

									#include<string>   // string 

									#include<algorithm> // sort 

									using namespace std; 

									/* 

									 *map是C++中的关联容器 

									 *   按关键字有序 

									 *   关键字不可重复 

									 */

									map<string, string> word; 

									/* 自定义比较函数（用于排序） */

									bool myfunction(char i, char j) 

									{ 

									  return i < j; 

									} 

									/* 

									 *对每个单词排序 

									 *排序后字符串作为关键字，原单词作为值 

									 *存入map中 

									 */

									void sign_sort(const char* dic) 

									{ 

									  // 文件流 

									  ifstream in(dic); 

									  if(!in) 

									  { 

									    cout << "Couldn't open file: " + string(dic) << endl; 

									    return; 

									  } 

									  string aword; 

									  string asign; 

									  while(in >> aword) 

									  { 

									    asign = aword; 

									    sort(asign.begin(), asign.end(), myfunction); 

									    // 若标识不存在，创建一个新map元素，若存在，加在值后面 

									    word[asign] += aword + " "; 

									  } 

									  in.close(); 

									} 

									/* 

									 *写入输出文件 

									 */

									void write_file(const char* file) 

									{ 

									  ofstream out(file); 

									  if(!out) 

									  { 

									    cout << "Couldn't create file: " + string(file) << endl; 

									    return; 

									  } 

									  map<string, string>::iterator begin = word.begin(); 

									  map<string, string>::iterator end = word.end(); 

									  while(begin != end) 

									  { 

									    out << begin->second << "\n"; 

									    ++begin; 

									  } 

									  out.close(); 

									} 

									int main() 

									{ 

									  string dic;  

									  string outfile; 

									  cout << "Please input dictionary name: "; 

									  cin >> dic; 

									  cout << "Please input output filename: "; 

									  cin >> outfile; 

									  sign_sort(dic.c_str()); 

									  write_file(outfile.c_str()); 

									  return 0; 

									}