Performance Problem mit Perl bei RegEx - #121480 (Allgemeines zu Perl)

topeg

2009-05-11 19:07

User since
2006-07-10
2611 Artikel
BenutzerIn

user image

für eine einigermaßen brauchbare Testgrundlage:

Code: (dl )

perl -e '$such="--TEST--"; for(0..(200*1024*1024)){ print chr(33+rand(90)); print "\n" if(rand(40)<2); print $such if(rand(1000)<5)}' > test.random.txt

sollte eine TextRandomDatei erzeugen, die 200MB groß ist, und einige Suchstrings enthält.

Quote
>$ ls -lh test.random.txt
-rw-rw---- 1 topeg topeg 210M 11. Mai 18:26 test.random.txt

mit dem Code:

Code (perl): (dl )

#!/usr/bin/perl

use strict;
use warnings;

my $file='test.random.txt';

# regexp vorkompilieren
my $regexp=qr/--TEST--/o;

# vernünftige Fehlermeldung
open(TRACEFILE, '<',  $file ) or die "cannot open $file $!\n";

# vor der Schleife definieren
# das redefine in der Schleife bremst aus
my $found=0;
my $in_line;

while ($in_line = <TRACEFILE>)
{
  $found++ while($in_line =~ m/$regexp/gc);
  #$found++ if($in_line =~ m/$regexp/);
}
print "Anzahl Treffer: $found\n";

bekomme ich:

mit "found++ if($in_line =~ m/$regexp/)"

Quote
>$ time ./regexpfind.pl
Anzahl Treffer: 922896

real 0m15.530s
user 0m14.361s
sys 0m0.312s

mit "$found++ while($in_line =~ m/$regexp/gc);"

Quote
>$ time ./regexpfind.pl
Anzahl Treffer: 1010759

real 0m19.092s
user 0m17.133s
sys 0m0.392s

mit dem Code:

Code (perl): (dl )

#!/usr/bin/perl
use strict;
use warnings;

my $shared=100; # 100 Zeichen Überschneidung
my $chuncksize=10*1024*1024;

my $file='test.random.txt';
my $regexp=qr/--TEST--/o;

open(TRACEFILE, '<',  $file ) or die "cannot open $file $!\n";

my $found=0;
my $chunk;
my $old="";

while (read(TRACEFILE, $chunk, $chuncksize))
{
  $chunk=$old.$chunk;
  $found++ while($chunk =~ m/$regexp/gsc);
  $old = substr($chunk,-$shared,$shared);
  $old =~ s/$regexp//gs;
}

print "anzahl treffer: $found\n";

bekomme ich:

Quote
>$ time ./regexpfind2.pl
anzahl treffer: 1010759

real 0m3.076s
user 0m2.260s
sys 0m0.708s

EDIT:
Ach ja etwas zu meinem Computer:
"lshw" sagt dazu:

Code: (dl )

  *-core
       description: Motherboard
       product: MS-6570
       vendor: MICRO-STAR INTERNATIONAL CO., LTD
       physical id: 0
       slot: External Cache
     *-cpu
          description: CPU
          product: AMD Athlon(tm) XP 2700+
          vendor: Advanced Micro Devices [AMD]
          physical id: 4
          bus info: cpu@0
          version: 6.10.0
          slot: Socket A
          size: 2GHz
          capacity: 2200MHz
          width: 32 bits
          clock: 166MHz
     *-memory
          description: System Memory
          physical id: 1b
          slot: System board or motherboard
          size: 1GiB
          capacity: 1536MiB

EDIT2:
Diese Version, die "forks" benutzt, sollte auf Multiprozessormaschinen schneller laufen:

Code (perl): (dl )

#!/usr/bin/perl

use forks;
# oder "use threads",
# forks bringen hier die bessere Multiprozessorunterstützung denke ich.

use strict;
use warnings;

my $shared=100; # 100 Zeichen Überschneidung
my $chuncksize=10*1024*1024; # 10 MB

my $file='/home/topeg/test.random.txt';
my $regexp=qr/--TEST--/o;

# maximal 4 Prozesse das macht bei 10 MB pro Prozess 40 MB...
my $threads=4;

open(TRACEFILE, '<',  $file ) or die "cannot open $file $!\n";

my $found=0;
my $chunk;
my $old="";

my @running;
my $pos=0;
while (read(TRACEFILE, $chunk, $chuncksize))
{
  $chunk=$old.$chunk;

  # erstmal alle Prozesse erzeugen
  if(@running < $threads)
  { push(@running,get_thread($chunk)); }
  else
  {
    #auf einen Prozess warten ...
    $found+=$running[$pos]->join();
    #neuen erzeugen ...
    $running[$pos]=get_thread($chunk);
    # einen weiter
    $pos++;
    # Liste wieder von vorne beginnen
    $pos=0 if($pos >= $threads);
  }
  $old = substr($chunk,-$shared,$shared);
  $old =~ s/$regexp//gs;
}

# auf die restlichen warten....
$pos=0;
while($pos<$threads)
{
  $found+=$running[$pos]->join();
  $pos++;
}

print "anzahl treffer: $found\n";

exit(0);
###############################################
# thread/prozess erzeugen
sub get_thread
{
  my $thread=threads->create(\&parse, shift);
  die "error create thread" unless(defined($thread));
  return $thread;
}

#  die Arbeit erledigen
sub parse
{
  my $found=0;
  my $chunk=shift;
  $found++ while($chunk =~ m/$regexp/gsc);
  $chunk="";
  return $found;
}

Bei mir ist sie etwas langsamer (kein Wunder mit nur einem Prozessor :-) )

Quote
>$ time ./regexpfind3.pl
anzahl treffer: 1010759

real 0m5.603s
user 0m3.844s
sys 0m1.596s

Last edited: 2009-05-11 23:50:12 +0200 (CEST)