我需要将.text文件中的文本复制到另一个.txt文件中。我唯一的限制是，如果有任何重复，我只会把这个词放在新文件中一次。分隔原始文件中单词的唯一值是空格。我正在考虑将整个文本复制到一个字符串中，然后检查重复，但我不知道如何检查，如果是一个好主意。你们能帮我一些想法吗？复制文件中的单词

2015-07-20 Marco

使二叉树（或类似线索，地图（或设置）） – BLUEPIXY

战略

我们需要的数据结构，可容纳单词的数量不受限制，可以称之为SET。

这个结构必须有一个操作来添加一个单词，让它叫做SET :: ADD。
这种操作的返回值必须表明单词是否已添加或是否有错误。

的SET结构，就是一个字只能添加一次，所以有通过SET返回了两个错误值的特点::地址：GENERIC_ERROR用于内部实现错误和DUPLICATE为尝试插入已经存在的单词。

的SET结构也有一个操作对其进行初始化（SET :: INIT）和一个释放它（SET ::免费）。

然后我们从输入文件中读取单词，每次只读一个单词，然后我们将每个单词添加到SET结构中。
如果插入成功，我们将这样一个单词写入输出文件，否则我们跳过这一步。

伪算法

1. S = SET::INIT 
2. FIN = OpenFile(input_file) 
3. FOUT = OpenFile(output_file); 
4. FOR EACH WORD IN FIN 
4.1 IF SET::ADD(S, WORD) == SUCCESS THEN 
4.1.1 WriteToFile(FOUT, WORD) 
4.2 END IF 
5. CloseFile(FIN); 
6. CloseFile(FOUT); 
7. SET::FREE(S);

的战术

真正艰苦的工作在这里正在实施SET结构。
数据结构由可在其上执行的操作以及此操作的前后条件定义。

所以理论上我们只需要做到这一点实现SET当简单的事情::地址：搜索，如果这个词已经存在，添加它：

1. FUNCTION SET::ADD(S, W) 
1.1 FOR EACH WORD IN S 
1.1.1 IF WORD == W THEN 
1.1.1.1 RETURN DUPLICATE 
1.1.2 END IF 
1.2 ADD W TO S

这两个步骤严重依赖实施。
对于具有这种需求的数据结构有很多实现，例如非常天真的实现可以使用固定大小的指针阵列。然而这具有严重的缺点。
更好的实现可能是链接列表，这让我们插入无限数量的单词，但需要线性时间来插入和搜索！

所以我们进入时间复杂的领域。
现在我会告诉你，这个结构可以用分期固定的时间来实现。但让我们从头开始。

从链表的下一个逻辑步骤是：用二叉树，使用和哈希表。
两者都可以同时进行插入和搜索，即两个操作都可以合并。
第一个采取ø（为log N）来插入和搜索，第二次在Ö（）。
然而，第一个更结构化让我们不仅做搜索，还做有序搜索。这个特性对这个问题没有用，所以我们应该选择哈希表。

我选择了树。这是因为二叉树让你用指针和递归练习比散列表更好，这是一个Well Well Type（当你参加一个类型理论课你会感谢我！）。

如果你不熟悉二叉树，现在就开始吧！询问Google或您的老师！

实施

我们的树的节点应该包含

它存储
的链接文字（即指针）到它的左子树
的链接（即指针）其右子树

最后两个很简单，第一个可以实现为poi或者作为固定大小的结构内阵列。
我选择了后者，因为它使得节点只需要一个呼叫到malloc，这也得到了一个事实的支持，即由于我们用fscanf来读取它，所以我们有一个词的最大尺寸。

的一些注意事项，以代码

我用指针的指针来实现add_word，这使溶液优雅，但保持铅笔和纸在手！
此外，我（应该）已确保通过使用strncpy,snprintf并动态地将fscanf格式说明符与右len一起使用，不会发生缓冲区溢出。 这是很好的做法 -
我已经使用了assert检查是否为格式说明缓冲区分配的大小不够大，这是因为正确的大小可以通过编程来计算，一旦代码被编译它保持固定，所以不需要重运行检查。
与fscanf使用的格式说明的形式为％S其中是MAX_WORD_SIZE，因此，例如它原来是“40多岁的％”。
我使用递归获得乐趣，将三个从叶子释放到根。

#include <stdio.h> 
#include <string.h> 
#include <stdlib.h> 
#include <assert.h> 


/* Values returned by add_word */ 
/* The word has been added to the tree  */ 
#define AW_ADDED   0 
/* The word cannot be added as malloc failed*/   
#define AW_ERROR  -1 
/* The word is a duplicate     */ 
#define AW_DUPLICATE  1  

/* Maximum size of a word     */ 
#define MAX_WORD_SIZE 40   


/* Structure of the binary tree node  */ 
typedef struct node 
{ 
    struct node* left;    /* Ptr to left (less than) branch  */ 
    struct node* right;    /* Ptr to right (greater than) branch */ 
    char word[MAX_WORD_SIZE+1];  /* Word stored in the node    */ 
} node; 


/* 
    Add a word to the tree identified by the root pointer. 
    This function allocate all the memory itself, the root of a tree is a pointer to a node structure. 

    root is a pointer to the root (a ptr to ptr since the root may be updated) 
    word is the word to add 
*/ 
int add_word(node** root, const char* word) 
{ 
    int compare; 

    /* Traverse the tree until you find a null pointer (beware, we work with ptr to ptr) */ 
    while (*root) 
    { 
     /* Compare the current node word with the given one */ 
     compare = strcmp(word, (*root)->word); 

     /* They are equal? Easy, just return the appropriate value */ 
     if (!compare) 
      return AW_DUPLICATE; 

     /* Move to the left of right based on the sign of the comparison */ 
     root = compare < 0 ? &((*root)->left) : &((*root)->right); 
    } 

    /* We found a null ptr to update with a ptr to a new node */ 

    /* Allocate memory */ 
    if (!(*root = malloc(sizeof(node)))) 
     return AW_ERROR; 

    /* Init the new node */ 
    (*root)->left = (*root)->right = 0; 

    /* Copy the given word, up to MAX_WORD_SIZE byte*/ 
    strncpy((*root)->word, word, MAX_WORD_SIZE); 

    /* Set a null terminator on the last byte in the case the word is exactly MAX_WORD_SIZE char*/ 
    (*root)->word[MAX_WORD_SIZE] = 0; 

    return AW_ADDED; 
} 

/* 
    Free the memory used by the tree 
    Set the pointers to NULL. 
    Use recursion for didactic purpose, an iterative solution would consume less resources as 
    this is NOT tail recursion. 
*/ 
void free_tree(node** root) 
{ 
    if (*root) 
    { 
     /* Go to children nodes */ 
     free_tree(&((*root)->left)); 
     free_tree(&((*root)->right)); 

     /* Free current node */ 
     free(*root); 
     *root = NULL; 
    } 

} 

int main() 
{ 
    /* Open the files */ 
    FILE* fin = fopen("in.txt", "r"); 
    FILE* fout = fopen("out.txt", "w"); 

    /* Check the descriptors */ 
    if (!fin) 
     return printf("Cannot open input file\n"), 1; 

    if (!fout) 
     return printf("Cannot open output file\n"), 2; 

    /* This is out tree */ 
    node* root = NULL; 
    /* This is the buffer for reading word from fin*/ 
    char new_word[MAX_WORD_SIZE+1]; 
    /* This is the buffer for creating fscanf format specifier*/ 
    char format[32]; 

    /* Create the fscanf format specifier */ 
    int char_used = snprintf(format, sizeof(format), "%%%ds", MAX_WORD_SIZE); 
    assert(char_used + 1 <= sizeof(format)); 

    /* Read the file until the end */ 
    while (!feof(fin)) 
    { 
     /* Read a word and add it to the tree, if it is added, write it to new file */ 
     if (fscanf(fin, format, new_word) && add_word(&root, new_word) == AW_ADDED) 
      fprintf(fout, "%s ", new_word);        
    } 

    /* Close and exit */ 
    fclose(fin); 
    fclose(fout); 

    free_tree(&root); 

    return 0; 

}

来源

2015-07-20 23:22:54

您可以尝试使用MD5或类似的东西散列字符串，然后将它们存储在二叉树。应具有相当低的平均时间复杂度。我不确定MD5的速度有多快;对于小字符串可能会有更好的散列算法。

你可以只存储在阵列中的所有字符串和使用的strcmp在每次你拿起一个新的字符串时所有的人，如果你不关心效率虽然。

来源

2015-07-20 23:13:45

复制文件中的单词

回答

战略

的战术

实施

相关问题