TableParser: Daten aus Tabellen auslesen (Allgemeines zu Perl)

[thread]9049[/thread]

TableParser: Daten aus Tabellen auslesen

Leser: 2

TomBombadil

2007-05-30 18:52

User since
2007-05-30
4 Artikel
BenutzerIn
[default_avatar]

Hallo zusammen,

Beschäftige mich erst seit heute mit Perl, aber habe bereits eine deftige Aufgabe gefasst, wo ich froh wäre wenn mir jemand helfen kann. Und zwar möchte ich gerne Werte aus einer HTML-Tabelle auslesen (http://www.securityfocus.com/bid/4183) und diese dann in ein text file speichern. Habe gesehen, es gibt die Module:
- TableContentParser
- TableParser
- TableExtract
Welches von denen muss ich verwenden? Wäre um ein Code-Beispiel für den ersten Eintrag ganz froh. Also für "Bugtraq ID:" und "4183"? Hätte zudem ebenso gerne gewusst, wie ich dann weitere id's auslesen und speichern kann?

Besten Dank & Gruss, Tom[B]

MisterL

2007-05-30 19:12

User since
2006-07-05
334 Artikel
BenutzerIn
[default_avatar]

Hallo,

so direkt von 0 auf 100 wird das ohne fundierte Perlkenntnisse nichts. Die genannten Module sind im CPAN z.B. objektorientiert programmiert. Als Beispiel der Code für TableExtract: hier lesen
Und ein Muss zum Verwenden eines Moduls gibt es nicht. Es gibt nur gute ===||=========== schlechte Lösungen.

Gruss MisterL

“Perl is the only language that looks the same before and after RSA encryption.”

TomBombadil

2007-05-31 17:49

User since
2007-05-30
4 Artikel
BenutzerIn
[default_avatar]

Da muss ich dir recht geben. Habe mich heute mal ein wenig eingelesen und folgenden code getippselt. Möchte, dass das script die securityfocus-seite durchgeht (für alle id's < 30k) und mir jeweils die eine Tabelle (depth = 1, count = 0) als ein Text file zurückgibt... irgendwas geht noch nicht so wie ich das möchte. Was mache ich falsch? Schon im Voraus besten Dank fürs Angucken!

Code: (dl )

#!c:\xampp\perl\bin\perl.exe -w
# Provides the path to the perl interpreter

# Purpose: Script for parsing html, especially information in tables
# Version: 0.1

use strict;
use warnings;
use HTML::TableExtract;

my $html_file = "http://www.securityfocus.com/bid";
my $row = "row";
my $te = "table extract";
my $ts = "table search";

my $bid = "bugtraq id"
for (my $bid = 1; $bid <= 30000; $bid++) {
print "<table border=1 cellspacing=0>";
}

 $te = HTML::TableExtract->new( depth => 1, count => 0 );
 $te->parse_file($html_file);
 foreach $ts ($te->tables) {
    print "Table found at ", join(',', $ts->coords), ":\n";
    foreach $row ($ts->rows) {
       print "   ", join(',', @$row), "\n";
    }
}

---
Modedit Gwendragon: BITTE Code-Tags verwenden!
---\n\n

vayu

2007-05-31 19:21

User since
2005-01-13
782 Artikel
BenutzerIn
[default_avatar]

erstmal ...

Code: (dl )

my $te = "table extract";
my $ts = "table search";

my $bid = "bugtraq id"

das bringt gar nix da vorher irgendwas reinzuschreiben :) das macht man normalerweise mit kommentaren:

Code: (dl )

my $te; #table extract
my $ts;  #table search

my $bid; #bugtraq id

dann

Code: (dl )

1
2
3

for (my $bid = 1; $bid <= 30000; $bid++) {
print "<table border=1 cellspacing=0>";
}

hier printest du genau 30.000mal hintereinander diesen text "<table border=1 cellspacing=0>" in die konsole.

und ganz unten gehst du ganz einfach nur diese seite durch:
"http://www.securityfocus.com/bid"

du möchtest ja aber wohl die seiten
"http://www.securityfocus.com/bid/1"
bis
"http://www.securityfocus.com/bid/30000"

durchgehen.

evtl solltest du dich erstmal durch das perltutorial durchlesen ...

in unserem Wiki findest du so ziemlich alles wichtige für ensteiger auf deutsch :)

perlintro

achso ... und

Quote
irgendwas geht noch nicht so wie ich das möchte

ist immer ganz schlecht. weil du weisst vielleicht was passiert und was schiefläuft, aber _wir_ wissen das nicht.

prinzipiell solltest du sowas in der art machen:

Code: (dl )

for(1 .. 30000) {
   my $table = $html_file."/".$_;
   $te = HTML::TableExtract->new( depth => 1, count => 0 );
   $te->parse_file($table);
}

aber wie du jetzt genau die tabellen auslesen musst, da musst dich mal selbst durchknobeln, hab jetzt vorlesung und muss noch ein wenig aufpassen :)\n\n

"imitation is the sincerest form of flattery."
- Lee Anthony Iacocca
http://img156.imageshack.us/img156/2056/perluserba...
http://img410.imageshack.us/img410/2664/tcmduserba...

nepos

2007-05-31 19:52

User since
2005-08-17
1420 Artikel
BenutzerIn

Und ganz allgemein bitte die Code-Tags benutzen. Das macht deinen Code wesentlich lesbarer hier im Board. Danke dir :)

GwenDragon

2007-05-31 21:44

User since
2005-01-17
14848 Artikel
Admin1

Mutig als 1-Tags-Perl-Anfänger sowas zu wagen. Aber wer nicht lernt, der nicht gewinnt.

die Drachin Gwen

Meine Perl-Artikel · perldev – verschiedene Perl-Versionen unter Windows starten

TomBombadil

2007-06-04 18:59

User since
2007-05-30
4 Artikel
BenutzerIn
[default_avatar]

Quote
Danke erst mal für die Hilfe. Habe jetzt mal versucht den Output in ein File zu leiten. Wenn ich das Script nun über xampp laufen lasse hört er mir gar nicht auf und ich erhalte auch kein file namens bid.txt, warum?

Code: (dl )

#!C:\Dokumente und Einstellungen\testuser\Desktop\xampp\xampp\perl\bin\perl.exe -w

print "Content-type: text/html\n\n";

use CGI::Carp qw(fatalsToBrowser);
use strict;
use HTML::TableExtract;

my $table;                                                 # table of interest
my $html_file = "http://www.securityfocus.com/bid";        # url of web site
my $te;                                                    # table extract
my $ts;                                                    # table search
my $row;                                                   # row of table of interest
my @securityfocus;                                         # array

# Depth represents how deeply a table resides in other tables. The depth of a top-level
# table in the document is 0. A table within a top-level table has a depth of 1, and so
# on. Each depth can be thought of as a layer; tables sharing the same depth are on the
# same layer. Within each of these layers, Count represents the order in which a table
# was seen at that depth, starting with 0. Providing both a depth and a count will
# uniquely specify a table within a document -> the table of interest is on the second
# level (depth = 1), the first one (count = 0).

for(1..30000) {
  my $table = $html_file."/".$_;
  $te = HTML::TableExtract->new( depth => 1, count => 0 );
  $te->parse_file($table);
}

foreach $ts ($te->tables) {
   print "Table found at ", join(',', $ts->coords), ":\n";
   foreach $row ($ts->rows) {
       print "   ", join(',', @$row), "\n";
    }
}

@securityfocus=("Bugtraq ID: \n","Class: \n","CVE: \n","Remote: \n","Local: \n",
"Published: \n","Updated: \n","Credit: \n","Vulnerable: \n","Not Vulnerable: \n");
open(OUTPUTFILE,">c:/bid.txt");
print OUTPUTFILE @securityfocus;
close(OUTPUTFILE);

open(OUTPUTFILE,"c:/bid.txt");
while (<OUTPUTFILE>)
{
chomp;
print " $_ \n";
}
close(OUTPUTFILE);

MisterL

2007-06-04 19:15

User since
2006-07-05
334 Artikel
BenutzerIn
[default_avatar]

Mehrere Anmerkungen zum Design (ohne allerdings das Problem zu lösen):

Code: (dl )

#!C:\perl\bin\perl.exe -w

statt über den Apache Ordner

Code: (dl )

 c:\bid.txt

ohne c:\

bid.txt enthält nach einiger Rechenarbeit übrigens folgendes:
Bugtraq ID:
Class:
CVE:
Remote:
Local:
Published:
Updated:
Credit:
Vulnerable:
Not Vulnerable:

“Perl is the only language that looks the same before and after RSA encryption.”

Gast Gast

2007-06-04 19:30

Inhalt von bid.txt ist gut -> siehe auch www.securityfocus.com/bid/20000 ... muss jetzt nur noch den Inhalt nach dem Doppelpunkt in für die versch. Id's versch. files kriegen...

TomBombadil

2007-06-06 10:45

User since
2007-05-30
4 Artikel
BenutzerIn
[default_avatar]

Mit deinem Input hab ich jetzt mal folgenden code getippselt - auch was für die Ausgabe. Indes kreiert er mir keine txt-files, zudem scheint das script bei Anwendung gar nicht zu stoppen :-( Liegt die Lösung vielleicht in der Verknüpfung der Blocks $ts, $te mit dem Block OUTPUTFILE? Hmm...

Code: (dl )

#!C:\perl\bin\perl.exe -w

# Purpose: Script for parsing html, especially information in tables
# Created by: Tom Bombadil, June 6, 2007
# Version: 0.2

print "Content-type: text/html\n\n";

use CGI::Carp qw(fatalsToBrowser);
use strict;
use HTML::TableExtract;

my $table;                                                 # table of interest
my $html_file = "http://www.securityfocus.com/bid";        # url of web site
my $te;                                                    # table extract
my $ts;                                                    # table search
my $row;                                                   # row of table of interest
my @securityfocus;                                         # array


@securityfocus=("Bugtraq ID: \n","Class: \n","CVE: \n","Remote: \n","Local: \n",
"Published: \n","Updated: \n","Credit: \n","Vulnerable: \n","Not Vulnerable: \n");
open(OUTPUTFILE,">bid.txt");
print OUTPUTFILE @securityfocus;
close(OUTPUTFILE);

open(OUTPUTFILE,"bid.txt");
while (<OUTPUTFILE>)
{
chomp;
print " $_ \n";
}
close(OUTPUTFILE);

# Depth represents how deeply a table resides in other tables. The depth of a top-level
# table in the document is 0. A table within a top-level table has a depth of 1, and so
# on. Each depth can be thought of as a layer; tables sharing the same depth are on the
# same layer. Within each of these layers, Count represents the order in which a table
# was seen at that depth, starting with 0. Providing both a depth and a count will
# uniquely specify a table within a document -> the table of interest is on the second
# level (depth = 1), the first one (count = 0).

for(1..30000) {
  my $table = $html_file."/".$_;
  $te = HTML::TableExtract->new( depth => 1, count => 0 );
  $te->parse_file($table);
}

foreach $ts ($te->tables) {
   print "Table found at ", join(',', $ts->coords), ":\n";
   foreach $row ($ts->rows) {
       print "   ", join(',', @$row), "\n";
    }
}

\n\n

View all threads created 2007-05-30 18:52.