Links extrahieren (Webframeworks, Sonstige Fragen zur Webprogrammierung mit Perl)

[thread]3032[/thread]

Links extrahieren

Leser: 1

Mary

2006-09-22 21:45

User since
2006-06-25
17 Artikel
BenutzerIn
[default_avatar]

Hallo,

ich möchte mit meinem Perl-Programm alle URLs aus einer HTML-Datei extrahieren. Hier ist ein Teil meines Scripts:

Code: (dl )

         $ua = LWP::UserAgent->new();
                   $request = HTTP::Request->new('GET', $url);
                   $response = $ua->request($request);
                   if ($response->is_success) {
                        print OUT $response->content();
                   }
                   else {
                         print STDERR $response->status_line, "\n";
                   }
           }

Ich möchte, dass das content vom Programm "gefiltert" wird, bevor es in die Datei geschrieben wird, und zwar sollen alle Links

Code: (dl )

<a class="res" href=....>

rausgefiltert (und in die Datei geschrieben) werden.

Hat jemand eine Idee? :)

Liebe Grüße
Mary

renee

2006-09-22 23:17

User since
2003-08-04
14371 Artikel
ModeratorIn

Ja:

Artikel

OTRS-Erweiterungen (http://feature-addons.de/)
Frankfurt Perlmongers (http://frankfurt.pm/)
--

Unterlagen OTRS-Workshop 2012: http://otrs.perl-services.de/workshop.html
Perl-Entwicklung: http://perl-services.de/

Mary

2006-09-23 14:15

User since
2006-06-25
17 Artikel
BenutzerIn
[default_avatar]

Danke! Das Beispiel aus dem Artikel ist schon mal hilfreich:

Code: (dl )

#! /usr/bin/perl
use strict;
use warnings;
use HTML::Parser;

my @links;
my $file = "my_file.html";

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse_file($file);

foreach my $link(@links){
  print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
  return if(shift ne 'a');
  my ($class) = shift->{href};
  my $self = shift;
  my $text;
  $self->handler(text => sub{$text = shift;},"dtext");
  $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}

Und wie kann ich diesen Skript anpassen, um die Links wie

Code: (dl )

<a class="res" href="http://...">

zu extrahieren? Ich kriege es leider nicht selber hin.

Lg
Mary

Mary

2006-09-24 12:33

User since
2006-06-25
17 Artikel
BenutzerIn
[default_avatar]

Ich meine, ich brauche NUR die Links mit dem Attribut

Code: (dl )

class="res"

.

Gruß
Mary

renee

2006-09-24 13:17

User since
2003-08-04
14371 Artikel
ModeratorIn

Code: (dl )

#! /usr/bin/perl
use strict;
use warnings;
use HTML::Parser;

my @links;
my $file = "my_test.html";

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse_file($file);

foreach my $link(@links){
 print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
    my ($tagname,$attr,$self) = @_;
    return unless($tagname eq 'a' and defined $attr);
    my ($url,$class) = @{$attr}{qw/href class/};
    my $text;
    $self->handler(text => sub{$text = shift;},"dtext");
    $self->handler(end => sub{
                             my ($tag) = @_;
                             if($tag eq 'a' and defined $class and $class eq 'res'){
                                 push(@links,[$url,$text]);
                             }
                          },"tagname");
}

Mary

2006-09-24 13:28

User since
2006-06-25
17 Artikel
BenutzerIn
[default_avatar]

Es funktioniert! Vielen Dank!

Lg
Mary

View all threads created 2006-09-22 21:45.