omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Search for files

    Editorial
    4
    9
    5843
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • midas9087
      midas9087 last edited by

      I’m trying to do a pretty simple search: i want to find files that contain 2 words. The words are not contiguous. For example, if I input “iPad iPhone” in the search field, it should identify a file that contains the text “iPads are bigger than iPhones”. The default appears to be a search that matches the query string exactly. Is there a way to do this? Do I use regex?

      1 Reply Last reply Reply Quote 0
      • technoway
        technoway last edited by technoway

        It would be complicated to use a regular expression to do this.

        If the files are not too large, you can read the file into memory, and check the entire file at once. It's a better general solution to read the lines sequentially though.

        If you read the lines sequentially, you might have to read two lines at a time and merge them the first time, and then read one line at a time after that merging each new line with the last line, so you can handle cases like this:

        This is line one of the file that ends in iPho-
        ne and this is line two that contains the word iPad.
        

        You would always keep the last line around, remove any hyphen at the end, and append the next line, while keeping the original line to do this again after you read the next line. Then you would check this merged line. You'd only need to merge a line if the last line ended with a hyphen character.

        Below is code that reads the lines sequentially, but without the line-merge feature, so this code will not handle words that wrap from one line to the next. I used a generator to obtain each line to make it easier to modify this code to add that feature. I just typed this code into an editor, I have not run this code at all, so I don't even know if it's syntactically correct, but it should be very close to what you need.

        If you don't need to handle words wrapping between lines, you can remove the generator, and just use a regular for-loop that iterates over the file handle. In that case, check if found_word_count == len_word_list inside of the loop, and break when that expression is true.

        def make_iterable_item_generator(iterable_item):
            """ Make a generator for any iterable item. """
            for item in iterable_item:
                yield item
        
        def file_contains_words(infh, word_list):
            """ Return True iff file contains all words in the passed word list.
                There must be no duplicate words in word_list.
                infh is an open file handle to the file to test.
            """
            found_word_count = 0
            # Create a list of boolean flags, one flag for each word in word_list.
            len_word_list = len(word_list)
            word_not_found_list = [True for i in range(0, len_word_list)]
            # Create a generator to read each line of the file.
            ingen = make_iterable_item_generator(infh)
            try:
                # Stop checking when all words are found.
                while found_word_count < len_word_list:
                    line = ingen.next()
                    for i in range(0, len_word_list):
                        if word_not_found_list[i]:
                            if word_list[i] in line:
                                word_not_found_list[i] = False
                                found_word_count += 1
            except StopIteration:
                pass
            return found_word_count == len_word_list
        
        1 Reply Last reply Reply Quote 0
        • ccc
          ccc last edited by ccc

          When you want to both count and get the element than enumerate() is your friend.

          # Instead of
                      for i in range(0, len_word_list):
                          if word_not_found_list[i]:
                              if word_list[i] in line:
                                  word_not_found_list[i] = False
                                  found_word_count += 1
          
          # you could write:
                      for i, word in enumerate(word_list):
                          if word_not_found_list[i] and word in line:
                              word_not_found_list[i] = False
                              found_word_count += 1
          

          I also believe that your return value could be: return all(word_not_found_list) but being sure of that would require a bit more testing.

          • https://docs.python.org/3/library/functions.html#all
          1 Reply Last reply Reply Quote 1
          • enceladus
            enceladus last edited by

            Using set may simplify the code.

            def search(filename, wordlist):
                return set(wordlist.split()).issubset(open(filename).read().split())
            
            
            1 Reply Last reply Reply Quote 1
            • ccc
              ccc last edited by

              Great one!! but leaves a file handle open on some Python implementations.

              1 Reply Last reply Reply Quote 0
              • enceladus
                enceladus last edited by

                ok. I just added code to do proper searching. I need to test this well and need to make it a proper editorial workflow.

                import os
                import fnmatch
                
                def search_files(wordlist, directory, 
                      include_pattern=None, exclude_pattern=None):
                    fnlist = []
                    for dirpath, dirs, files in os.walk(directory):
                        for filename in files:
                            fname = os.path.join(dirpath, filename)
                            to_include = True
                            if exclude_pattern:
                                if fnmatch.fnmatch(fname, exclude_pattern):
                                    to_include = False
                            if include_pattern and to_include:
                                if not fnmatch.fnmatch(fname, include_pattern):
                                    to_include = False
                            if to_include:
                                if are_all_words_in_file(fname, wordlist):
                                    fnlist.append(fname)
                    return fnlist
                
                def get_words(filename):
                    with open(filename) as fp:
                        for line in fp:
                            for word in line.split():
                                yield word
                
                def are_all_words_in_file(filename, wordlist):
                    return set(wordlist.split()).issubset(get_words(filename))
                
                print (search_files('os class', '.', include_pattern="*.py"))
                
                
                1 Reply Last reply Reply Quote 0
                • enceladus
                  enceladus last edited by

                  Here is the editorial workflow. I hope it helps
                  http://www.editorial-workflows.com/workflow/5872578073722880/kmRKF8RqYvQ

                  1 Reply Last reply Reply Quote 0
                  • ccc
                    ccc last edited by

                    	if ignore_case == '':
                    		ignore_case = True
                    	else:
                    		if ignore_case == 'ON':
                    			ignore_case = True
                    		else:
                    			ignore_case = False
                    
                    # can be rewritten as:
                    
                    	ignore_case = ignore_case in ('', 'ON')
                    
                    1 Reply Last reply Reply Quote 0
                    • enceladus
                      enceladus last edited by

                      Thanks @ccc Replacing if statements (particularly nested if statements) by a simple statement is always better

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      Powered by NodeBB Forums | Contributors