Thread Benötige Perl-Skript zur Auswertung von .pdf-Dateien
(14 answers)
Opened by ClaudiaRohmeier at 2013-03-06 15:09
Hier mal ein kleines Skript zum Herumprobieren oder darauf Aufbauen:
Code (perl): (dl
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 #!/usr/bin/perl use 5.012; use warnings; use Getopt::Long; use Pod::Usage; use Text::CSV; my $out; my $verb = 1; my $help = 0; GetOptions( 'output|o=s' => \$out, 'verbose|v+' => \$verb, 'help|h|?' => \$help, ) or pod2usage(-exitstatus => 2); if ($help) { pod2usage(-exitstatus => 0, -verbose => $verb); } my ($key, $doc) = @ARGV; unless (defined $key and defined $doc) { pod2usage(-exitstatus => 2); } unless (defined $out) { $out = $doc =~ s/(?:\.[^.]+)?$/.csv/r; } $|++ if ($verb > 1); say "Reading keywords from '$key' ..." if ($verb > 2); my @keywords = do { open my $in, '<', $key or die "Error opening keyword file: $!"; my %unique; while (my $_ = <$in>) { chomp; for my $keyword (split /[\s.:!?,;()]+/) { $unique{$keyword} = 1; } } keys %unique; }; say scalar(@keywords), " keywords read" if ($verb > 1); say "Scanning document '$doc', writing output to '$out' ..." if ($verb > 2); my $ispdf = do { open my $in, '<', $doc or die "Error opening document file: $!"; read $in, my $magic, 4; $magic eq '%PDF'; }; my $src; if ($ispdf) { say "Document seems to be a PDF file" if ($verb > 2); open $src, '-|', 'pdftotext', $doc, '-' or die "Error opening document stream: $!"; } else { say "Document does not seem to be a PDF file" if ($verb > 2); open $src, '<', $doc or die "Error opening document file: $!"; } open my $tgt, '>', $out or die "Error opening output file: $!"; my $csv = Text::CSV->new({binary => 1, eol => $/}); $csv->print($tgt, [qw(Page Word Keyword Sentence)]); my $page = 0; my $word = 0; my $sentence = ''; my @hits = (); my $total = 0; while (my $_ = <$src>) { chomp; while ($_ ne '') { if (s/^\f//) { $page += 1; $word = 0; } elsif (s/^([^\s.:!?,;()]+)//) { my $candidate = $1; for my $keyword (@keywords) { if ($candidate eq $keyword) { print "$page,$word ... " if ($verb > 2); push @hits, [$page, $word, $keyword]; } } $sentence .= ' ' if ($sentence ne ''); $sentence .= $candidate; $word += 1; } elsif (s/^([.:!?,;()])//) { $sentence .= $1; for my $hit (@hits) { push @$hit, $sentence; $csv->print($tgt, $hit); } $total += @hits; $sentence = ''; @hits = (); } else { s/^\s+//; } } } say "Done" if ($verb > 2); say "$total matches found" if ($verb > 1); close $src or die "Failed to close document stream: $!"; close $tgt or die "Failed to close output stream: $!"; __END__ =head1 NAME keywords - Find keywords in PDF or text files =head1 SYNOPSIS keywords [OPTION ...] KEYWORDS DOCUMENT =head1 OPTIONS =over 4 =item B<--output=FILE> =item B<-o FILE> Write output to the given file. If no such option is given, the output filename is constructed by replacing the extension of the input document by C<.csv>. =item B<--verbose> =item B<-v> Increases the verbosity of program output. Up to two instances of this option currently make sense. =item B<--help> =item B<-h> =item B<-?> Shows documentation about the program. Combine with B<--verbose> to view the entire manual page. =back =head1 DESCRIPTION This program reads a list of keywords from a file and scans another file for occurrences of those keywords. Both the keyword and document file are split into words separated by whitespace or any of the sentence separator characters C<.:!?,;()>. If the document file is not plain text but a PDF file, it is automatically filtered through the program C<pdftotext> and the output is scanned instead. While scanning the document, each occurrence of a keyword is printed to the output in CSV format. The fields printed are =over 4 =item the current page number, determined by counting form feeds; =item the number of the word counting from the start of the page; =item the matched keyword and =item the sentence in which the keyword occurred. =back =head1 LICENSE Copyright (c) 2013 by Thomas Chust L<mailto:chust@web.de> This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. =cut When C++ is your hammer, every problem looks like your thumb.
|