ruby - Fastest way to find if an array contains any member of another array? -
i'm trying detect whether file, read in string, either:
- text (of type of single-byte encoding).
 - a multi-byte encoding, or binary, etc.
 
i have "blacklist" array of characters/bytes normally never occur in "text":
bad_bytes = [0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]   and my_bytes = file.binread('some_file').bytes.
i can think of:
(my_bytes & bad_bytes).empty?, andmy_bytes == (my_bytes - bad_bytes)
both produce correct result, , intuition latter might little faster. or, maybe they're equivalent? both seem rather inefficient me, @ purpose. don't need find complete intersection, or remove each instance of second array first - finding one common element sufficient.
am missing method exists this? there faster technique? if not, of above faster? or approaching wrong?
also, bonus points: there math/computer science/fancy term i'm trying here?
gentlemen! start engines!
below benchmark comparison of 3 answers given date. main reason doing evaluate relative efficiency of @stefan's solution, uses regex. had impression regex's relatively inefficient, can see results below, not case here.
@uri's , solution show how improvement made converting array of bad characters set, , reading file byte byte. apologies, @uri, if didn't read file array way have.
i see more benchmarking of answers done. it's not difficult or time-consuming, , can provide useful insights. find of time preparing test cases. notice i've put methods tested in module, if method benchmarked, need add method module-i don't have touch of other code.
methods compared
module methods   require 'set'    bad_bytes_pattern = /[\x00-\x06\x0b\x0e-\x1a\x1c-\x1f\x7f]/n   bad_bytes = [*0..6, 11, *14..26, *28..31, 127]   bad_chars = bad_bytes.map(&:chr)   bad_bytes_set = set[*bad_bytes]   bad_chars_set = set[*bad_chars]    def stefan(fname)     file.read(fname)[bad_bytes_pattern]   end    def uri_with_array(fname)     !file.read(fname).each_char.map(&:ord).none? { |b|       bad_bytes.include? b }   end    def uri_with_set(fname)     !file.read(fname).each_char.map(&:ord).none? { |b|       bad_bytes_set.include? b }   end    def cary(fname)     f = file.new fname     f.each_char.any? { |c| bad_chars_set.include?(c) }   end end   include module
include methods @methods = methods.instance_methods(false)   #=> [:stefan, :uri_with_array, :uri_with_set, :cary]   create test files
def make_test_files(prefix, nbr_files, file_size, prob_bad_byte)   nbr_bad_bytes = bad_bytes.size   nbr_files.times.with_object([]) |i, fnames|     str = 'x'*file_size     str[rand(file_size)] = bad_chars[rand(nbr_bad_bytes)] if       rand < prob_bad_byte     fname = "#{prefix}.#{i}"     file.write(fname, str)     fnames << fname   end end  n = 50 m = 100_000 prob_bad_byte = 0.5  @test_files = make_test_files('test', n, m, prob_bad_byte)   create helper method
invoke method m process test files , return true/false array, true if bad byte found in given file:
def compute(m)   @test_files.each_with_object([]) { |fname,arr|     arr << (send(m, fname) ? true : false) } end   write test header
puts "#{n} files of size #{m}.\n" +   "each file contains 0 or 1 bad characters, probability of " +   "latter being #{prob_bad_byte}. if bad character present, @ " +   "a random location in file.\n\n"   confirm methods being tested return same values test data
unless @methods.map { |m| compute(m) }.uniq.size == 1   print "not methods agree"   exit end   write benchmark
require 'benchmark'  @indent = methods.map { |m| m.to_s.size }.max  benchmark.bm(@indent) |bm|   @methods.each |m|     bm.report m.to_s       compute(m)     end   end end   clean after
@test_files.each { |fname| file.delete fname }   results hand-coded test parameters
50 files of size 10000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.
                                 user     system      total        real stefan                       0.000000   0.000000   0.000000 (  0.003874) uri_with_array               0.560000   0.000000   0.560000 (  0.565312) uri_with_set                 0.170000   0.010000   0.180000 (  0.173694) cary                         0.100000   0.000000   0.100000 (  0.100730)   50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.
                                 user     system      total        real stefan                       0.030000   0.000000   0.030000 (  0.027062) uri_with_array               5.340000   0.040000   5.380000 (  5.387314) uri_with_set                 1.640000   0.040000   1.680000 (  1.683844) cary                         0.930000   0.010000   0.940000 (  0.929722)   50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 1.0. if bad character present, @ random location in file.
                                 user     system      total        real stefan                       0.020000   0.010000   0.030000 (  0.022462) uri_with_array               4.410000   0.030000   4.440000 (  4.447397) uri_with_set                 1.520000   0.040000   1.560000 (  1.560788) cary                         0.740000   0.010000   0.750000 (  0.747580)      
Comments
Post a Comment