ruby - Fastest way to find if an array contains any member of another array? -


i'm trying detect whether file, read in string, either:

  1. text (of type of single-byte encoding).
  2. a multi-byte encoding, or binary, etc.

i have "blacklist" array of characters/bytes normally never occur in "text":

bad_bytes = [0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127] 

and my_bytes = file.binread('some_file').bytes.

i can think of:

  • (my_bytes & bad_bytes).empty?, and
  • my_bytes == (my_bytes - bad_bytes)

both produce correct result, , intuition latter might little faster. or, maybe they're equivalent? both seem rather inefficient me, @ purpose. don't need find complete intersection, or remove each instance of second array first - finding one common element sufficient.

am missing method exists this? there faster technique? if not, of above faster? or approaching wrong?

also, bonus points: there math/computer science/fancy term i'm trying here?

gentlemen! start engines!

below benchmark comparison of 3 answers given date. main reason doing evaluate relative efficiency of @stefan's solution, uses regex. had impression regex's relatively inefficient, can see results below, not case here.

@uri's , solution show how improvement made converting array of bad characters set, , reading file byte byte. apologies, @uri, if didn't read file array way have.

i see more benchmarking of answers done. it's not difficult or time-consuming, , can provide useful insights. find of time preparing test cases. notice i've put methods tested in module, if method benchmarked, need add method module-i don't have touch of other code.

methods compared

module methods   require 'set'    bad_bytes_pattern = /[\x00-\x06\x0b\x0e-\x1a\x1c-\x1f\x7f]/n   bad_bytes = [*0..6, 11, *14..26, *28..31, 127]   bad_chars = bad_bytes.map(&:chr)   bad_bytes_set = set[*bad_bytes]   bad_chars_set = set[*bad_chars]    def stefan(fname)     file.read(fname)[bad_bytes_pattern]   end    def uri_with_array(fname)     !file.read(fname).each_char.map(&:ord).none? { |b|       bad_bytes.include? b }   end    def uri_with_set(fname)     !file.read(fname).each_char.map(&:ord).none? { |b|       bad_bytes_set.include? b }   end    def cary(fname)     f = file.new fname     f.each_char.any? { |c| bad_chars_set.include?(c) }   end end 

include module

include methods @methods = methods.instance_methods(false)   #=> [:stefan, :uri_with_array, :uri_with_set, :cary] 

create test files

def make_test_files(prefix, nbr_files, file_size, prob_bad_byte)   nbr_bad_bytes = bad_bytes.size   nbr_files.times.with_object([]) |i, fnames|     str = 'x'*file_size     str[rand(file_size)] = bad_chars[rand(nbr_bad_bytes)] if       rand < prob_bad_byte     fname = "#{prefix}.#{i}"     file.write(fname, str)     fnames << fname   end end  n = 50 m = 100_000 prob_bad_byte = 0.5  @test_files = make_test_files('test', n, m, prob_bad_byte) 

create helper method

invoke method m process test files , return true/false array, true if bad byte found in given file:

def compute(m)   @test_files.each_with_object([]) { |fname,arr|     arr << (send(m, fname) ? true : false) } end 

write test header

puts "#{n} files of size #{m}.\n" +   "each file contains 0 or 1 bad characters, probability of " +   "latter being #{prob_bad_byte}. if bad character present, @ " +   "a random location in file.\n\n" 

confirm methods being tested return same values test data

unless @methods.map { |m| compute(m) }.uniq.size == 1   print "not methods agree"   exit end 

write benchmark

require 'benchmark'  @indent = methods.map { |m| m.to_s.size }.max  benchmark.bm(@indent) |bm|   @methods.each |m|     bm.report m.to_s       compute(m)     end   end end 

clean after

@test_files.each { |fname| file.delete fname } 

results hand-coded test parameters

50 files of size 10000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.

                                 user     system      total        real stefan                       0.000000   0.000000   0.000000 (  0.003874) uri_with_array               0.560000   0.000000   0.560000 (  0.565312) uri_with_set                 0.170000   0.010000   0.180000 (  0.173694) cary                         0.100000   0.000000   0.100000 (  0.100730) 

50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.

                                 user     system      total        real stefan                       0.030000   0.000000   0.030000 (  0.027062) uri_with_array               5.340000   0.040000   5.380000 (  5.387314) uri_with_set                 1.640000   0.040000   1.680000 (  1.683844) cary                         0.930000   0.010000   0.940000 (  0.929722) 

50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 1.0. if bad character present, @ random location in file.

                                 user     system      total        real stefan                       0.020000   0.010000   0.030000 (  0.022462) uri_with_array               4.410000   0.030000   4.440000 (  4.447397) uri_with_set                 1.520000   0.040000   1.560000 (  1.560788) cary                         0.740000   0.010000   0.750000 (  0.747580) 

Comments

Popular posts from this blog

c++ - OpenMP unpredictable overhead -

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

javascript - Wordpress slider, not displayed 100% width -