ruby - Fastest way to find if an array contains any member of another array? -
i'm trying detect whether file, read in string, either:
- text (of type of single-byte encoding).
- a multi-byte encoding, or binary, etc.
i have "blacklist" array of characters/bytes normally never occur in "text":
bad_bytes = [0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]
and my_bytes = file.binread('some_file').bytes
.
i can think of:
(my_bytes & bad_bytes).empty?
, andmy_bytes == (my_bytes - bad_bytes)
both produce correct result, , intuition latter might little faster. or, maybe they're equivalent? both seem rather inefficient me, @ purpose. don't need find complete intersection, or remove each instance of second array first - finding one common element sufficient.
am missing method exists this? there faster technique? if not, of above faster? or approaching wrong?
also, bonus points: there math/computer science/fancy term i'm trying here?
gentlemen! start engines!
below benchmark comparison of 3 answers given date. main reason doing evaluate relative efficiency of @stefan's solution, uses regex. had impression regex's relatively inefficient, can see results below, not case here.
@uri's , solution show how improvement made converting array of bad characters set, , reading file byte byte. apologies, @uri, if didn't read file array way have.
i see more benchmarking of answers done. it's not difficult or time-consuming, , can provide useful insights. find of time preparing test cases. notice i've put methods tested in module, if method benchmarked, need add method module-i don't have touch of other code.
methods compared
module methods require 'set' bad_bytes_pattern = /[\x00-\x06\x0b\x0e-\x1a\x1c-\x1f\x7f]/n bad_bytes = [*0..6, 11, *14..26, *28..31, 127] bad_chars = bad_bytes.map(&:chr) bad_bytes_set = set[*bad_bytes] bad_chars_set = set[*bad_chars] def stefan(fname) file.read(fname)[bad_bytes_pattern] end def uri_with_array(fname) !file.read(fname).each_char.map(&:ord).none? { |b| bad_bytes.include? b } end def uri_with_set(fname) !file.read(fname).each_char.map(&:ord).none? { |b| bad_bytes_set.include? b } end def cary(fname) f = file.new fname f.each_char.any? { |c| bad_chars_set.include?(c) } end end
include module
include methods @methods = methods.instance_methods(false) #=> [:stefan, :uri_with_array, :uri_with_set, :cary]
create test files
def make_test_files(prefix, nbr_files, file_size, prob_bad_byte) nbr_bad_bytes = bad_bytes.size nbr_files.times.with_object([]) |i, fnames| str = 'x'*file_size str[rand(file_size)] = bad_chars[rand(nbr_bad_bytes)] if rand < prob_bad_byte fname = "#{prefix}.#{i}" file.write(fname, str) fnames << fname end end n = 50 m = 100_000 prob_bad_byte = 0.5 @test_files = make_test_files('test', n, m, prob_bad_byte)
create helper method
invoke method m
process test files , return true/false array, true if bad byte found in given file:
def compute(m) @test_files.each_with_object([]) { |fname,arr| arr << (send(m, fname) ? true : false) } end
write test header
puts "#{n} files of size #{m}.\n" + "each file contains 0 or 1 bad characters, probability of " + "latter being #{prob_bad_byte}. if bad character present, @ " + "a random location in file.\n\n"
confirm methods being tested return same values test data
unless @methods.map { |m| compute(m) }.uniq.size == 1 print "not methods agree" exit end
write benchmark
require 'benchmark' @indent = methods.map { |m| m.to_s.size }.max benchmark.bm(@indent) |bm| @methods.each |m| bm.report m.to_s compute(m) end end end
clean after
@test_files.each { |fname| file.delete fname }
results hand-coded test parameters
50 files of size 10000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.
user system total real stefan 0.000000 0.000000 0.000000 ( 0.003874) uri_with_array 0.560000 0.000000 0.560000 ( 0.565312) uri_with_set 0.170000 0.010000 0.180000 ( 0.173694) cary 0.100000 0.000000 0.100000 ( 0.100730)
50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 0.5. if bad character present, @ random location in file.
user system total real stefan 0.030000 0.000000 0.030000 ( 0.027062) uri_with_array 5.340000 0.040000 5.380000 ( 5.387314) uri_with_set 1.640000 0.040000 1.680000 ( 1.683844) cary 0.930000 0.010000 0.940000 ( 0.929722)
50 files of size 100000. each file contains 0 or 1 bad characters, probability of latter being 1.0. if bad character present, @ random location in file.
user system total real stefan 0.020000 0.010000 0.030000 ( 0.022462) uri_with_array 4.410000 0.030000 4.440000 ( 4.447397) uri_with_set 1.520000 0.040000 1.560000 ( 1.560788) cary 0.740000 0.010000 0.750000 ( 0.747580)
Comments
Post a Comment