Notice
Recent Posts
Recent Comments
Link
아님말고
PDFBOX를 이용한 PDF 텍스트 추출하기 본문
필요한 jar : PDFBox-0.7.3.jar , fontbox-0.1.0-dev.jar
view plaincopy to clipboardprint?
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class PDFTest {
public static void main(String[] args){
String src = "D:\\study\\data\\test.pdf";
String text = null;
COSDocument cosDoc = null;
try{
File file = new File(src);
InputStream is = new FileInputStream(file);
cosDoc = parseDocument(is);
PDFTextStripper striper = new PDFTextStripper();
text = striper.getText(new PDDocument(cosDoc));
System.out.println(text);
}catch(IOException e){
e.printStackTrace();
}
}
private static COSDocument parseDocument(InputStream is) throws IOException {
PDFParser parser = new PDFParser(is);
parser.parse();
return parser.getDocument();
}
}
'Parser' 카테고리의 다른 글
POI를 이용한 excel, word, powerpoint, visio 텍스트 추출 (1) | 2009.03.09 |
---|---|
NekoHTML 파서 (0) | 2009.03.04 |
Comments