Logo
All Questions

Find duplicate files in a file system

DifficultyalgorithmsAsked at Dropbox

Question Explain

In this question, the interviewer is asking you to propose an algorithm or method that can be used to identify duplicate files within a computer file system. The first aspect to understand is what constitutes a 'duplicate' file - it can mean files with the same name, same size or the same content. Be prepared for the interviewer to specify any of these scenarios.

Here are some guidance points:

  1. Be sure which criteria you are using to determine duplicated files. Is it the file name, the file size, or the file content?
  2. Decide the type of data structure that would be suited to store and access the file data. Hashmaps or sets may be useful here.
  3. Remember to consider the scalability of your approach. The file system could potentially have a huge number of files.

Answer Example 1

Assuming we're looking for files with the same content, we need to compare the files based on their content, not the name. So we can use a hashmap where the key is each file's hash value, and value is the file's path.

Here is how I would do it:

  1. Walk through the whole directory of files using Depth-First-Search (DFS).
  2. For each file, calculate a unique hash (for instance using MD5 or SHA256 algorithms).
  3. If the hash is not in the hashmap, add this hash and the corresponding file path.
  4. If the hash is already in the hashmap, that means we've found a duplicate file.

Answer Example 2

If the interviewer specifies another scenario where only files with the same name are considered duplicates, a simple modification to the previous approach can be done:

  1. Use a Depth-First-Search (DFS) or Breadth-First-Search (BFS) to traverse a directory.
  2. Store each file's name in a hashset.
  3. If the filename is already in the hashset, this is a duplicate file.

More Questions

Question Quick Reference by Category: