Search for files
midas9087 last edited by
I’m trying to do a pretty simple search: i want to find files that contain 2 words. The words are not contiguous. For example, if I input “iPad iPhone” in the search field, it should identify a file that contains the text “iPads are bigger than iPhones”. The default appears to be a search that matches the query string exactly. Is there a way to do this? Do I use regex?
technoway last edited by technoway
It would be complicated to use a regular expression to do this.
If the files are not too large, you can read the file into memory, and check the entire file at once. It's a better general solution to read the lines sequentially though.
If you read the lines sequentially, you might have to read two lines at a time and merge them the first time, and then read one line at a time after that merging each new line with the last line, so you can handle cases like this:
This is line one of the file that ends in iPho- ne and this is line two that contains the word iPad.
You would always keep the last line around, remove any hyphen at the end, and append the next line, while keeping the original line to do this again after you read the next line. Then you would check this merged line. You'd only need to merge a line if the last line ended with a hyphen character.
Below is code that reads the lines sequentially, but without the line-merge feature, so this code will not handle words that wrap from one line to the next. I used a generator to obtain each line to make it easier to modify this code to add that feature. I just typed this code into an editor, I have not run this code at all, so I don't even know if it's syntactically correct, but it should be very close to what you need.
If you don't need to handle words wrapping between lines, you can remove the generator, and just use a regular for-loop that iterates over the file handle. In that case, check if
found_word_count == len_word_listinside of the loop, and
breakwhen that expression is true.
def make_iterable_item_generator(iterable_item): """ Make a generator for any iterable item. """ for item in iterable_item: yield item def file_contains_words(infh, word_list): """ Return True iff file contains all words in the passed word list. There must be no duplicate words in word_list. infh is an open file handle to the file to test. """ found_word_count = 0 # Create a list of boolean flags, one flag for each word in word_list. len_word_list = len(word_list) word_not_found_list = [True for i in range(0, len_word_list)] # Create a generator to read each line of the file. ingen = make_iterable_item_generator(infh) try: # Stop checking when all words are found. while found_word_count < len_word_list: line = ingen.next() for i in range(0, len_word_list): if word_not_found_list[i]: if word_list[i] in line: word_not_found_list[i] = False found_word_count += 1 except StopIteration: pass return found_word_count == len_word_list
ccc last edited by ccc
When you want to both count and get the element than enumerate() is your friend.
# Instead of for i in range(0, len_word_list): if word_not_found_list[i]: if word_list[i] in line: word_not_found_list[i] = False found_word_count += 1 # you could write: for i, word in enumerate(word_list): if word_not_found_list[i] and word in line: word_not_found_list[i] = False found_word_count += 1
I also believe that your return value could be: return all(word_not_found_list) but being sure of that would require a bit more testing.
Using set may simplify the code.
def search(filename, wordlist): return set(wordlist.split()).issubset(open(filename).read().split())
ccc last edited by
Great one!! but leaves a file handle open on some Python implementations.
ok. I just added code to do proper searching. I need to test this well and need to make it a proper editorial workflow.
import os import fnmatch def search_files(wordlist, directory, include_pattern=None, exclude_pattern=None): fnlist =  for dirpath, dirs, files in os.walk(directory): for filename in files: fname = os.path.join(dirpath, filename) to_include = True if exclude_pattern: if fnmatch.fnmatch(fname, exclude_pattern): to_include = False if include_pattern and to_include: if not fnmatch.fnmatch(fname, include_pattern): to_include = False if to_include: if are_all_words_in_file(fname, wordlist): fnlist.append(fname) return fnlist def get_words(filename): with open(filename) as fp: for line in fp: for word in line.split(): yield word def are_all_words_in_file(filename, wordlist): return set(wordlist.split()).issubset(get_words(filename)) print (search_files('os class', '.', include_pattern="*.py"))
Here is the editorial workflow. I hope it helps
ccc last edited by
if ignore_case == '': ignore_case = True else: if ignore_case == 'ON': ignore_case = True else: ignore_case = False # can be rewritten as: ignore_case = ignore_case in ('', 'ON')
Thanks @ccc Replacing if statements (particularly nested if statements) by a simple statement is always better