當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

将Doc或者Docx文档处理成html的代码逻辑；统计word中的字数，段数，句数，读取word中文档内容的代码逻辑

發布時間：2024/9/27 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了将Doc或者Docx文档处理成html的代码逻辑；统计word中的字数，段数，句数，读取word中文档内容的代码逻辑小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

將Doc或者Docx文檔處理成html的代碼邏輯

下面是maven的配置代碼：

<dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-examples</artifactId><version>3.9</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-scratchpad</artifactId><version>3.9</version></dependency><dependency><groupId>fr.opensagres.xdocreport</groupId><artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId><version>1.0.4</version></dependency><dependency><groupId>fr.opensagres.xdocreport</groupId><artifactId>org.apache.poi.xwpf.converter.core</artifactId><version>1.0.4</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.9</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>3.9</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml-schemas</artifactId><version>3.9</version></dependency><dependency><groupId>org.apache.xmlbeans</groupId><artifactId>xmlbeans</artifactId><version>2.3.0</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>ooxml-schemas</artifactId><version>1.1</version></dependency>

將word處理成html的代碼：

import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern;import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerException; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult;import org.apache.commons.io.FileUtils; import org.apache.commons.io.IOUtils; import org.apache.commons.io.output.ByteArrayOutputStream; import org.apache.commons.lang.StringUtils; import org.apache.log4j.Logger; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.converter.PicturesManager; import org.apache.poi.hwpf.converter.WordToHtmlConverter; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.hwpf.usermodel.Picture; import org.apache.poi.hwpf.usermodel.PictureType; import org.apache.poi.xwpf.converter.core.BasicURIResolver; import org.apache.poi.xwpf.converter.core.FileImageExtractor; import org.apache.poi.xwpf.converter.core.FileURIResolver; import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter; import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFParagraph; import org.apache.poi.xwpf.usermodel.XWPFTable; import org.apache.poi.xwpf.usermodel.XWPFTableCell; import org.apache.poi.xwpf.usermodel.XWPFTableRow; import org.w3c.dom.Document; import com.sun.org.apache.xalan.internal.xsltc.compiler.Template;import cn.com.hbny.docdetection.entity.ResourcesWord; import cn.com.hbny.docdetection.server.ExtendedServerConfig; import cn.com.hbny.docdetection.utils.Pinyin4jUtils.PinyinType;/*** @brief ReadWordUtils.java 文檔處理對應的工具類* @attention* @author toto* @date 2017年3月3日* @note begin modify by 涂作權 2017年3月3日原始創建*/ public final class ReadWordUtils {private static Logger logger = Logger.getLogger(ReadWordUtils.class);protected static final String CHARSET_UTF8 = "UTF-8";private static String tempImagePath = "";/*** 讀取docx* @throws Exception*/public static ResourcesWord readDocx(String path) throws Exception {int paragNum = 0; // 段落的個數int sentenceNum = 0; // 句子個數int wordNum = 0; // 字體個數StringBuffer content = new StringBuffer();ResourcesWord resourcesWord = new ResourcesWord();InputStream is = new FileInputStream(path);XWPFDocument doc = new XWPFDocument(is);List<XWPFParagraph> paras = doc.getParagraphs();for (XWPFParagraph para : paras) {// 當前段落的屬性if (!StringUtils.isEmpty(para.getText())) {paragNum++;sentenceNum += para.getText().replace("\r\n", "").trim().split("。").length;content.append(para.getText());}}// 獲取文檔中所有的表格List<XWPFTable> tables = doc.getTables();List<XWPFTableRow> rows;List<XWPFTableCell> cells;for (XWPFTable table : tables) {// 表格屬性// 獲取表格對應的行rows = table.getRows();for (XWPFTableRow row : rows) {// 獲取行對應的單元格cells = row.getTableCells();for (XWPFTableCell cell : cells) {content.append(cell.getText());}}/** MongoDBUtils mongoDb = new MongoDBUtils("javadb"); DBObject dbs =* new BasicDBObject(); dbs.put("name", "創新性"); //分類* dbs.put("major", "醫療"); //專業 dbs.put("content",* content.toString().trim()); dbs.put("paragNum", paragNum);* dbs.put("sentenceNum", sentenceNum); dbs.put("wordNum", wordNum);* mongoDb.insert(dbs, "javadb");*/}// 得到全部內容的字數wordNum += content.toString().trim().length();resourcesWord.setContent(content.toString());resourcesWord.setParagNum(paragNum);resourcesWord.setSentenceNum(sentenceNum);resourcesWord.setWordNum(wordNum);close(is);return resourcesWord;}/*** 讀取doc文件的內容* * @throws IOException*/public static ResourcesWord readDoc(String path) throws IOException {int paragNum = 0; // 段落的個數int sentenceNum = 0; // 句子個數int wordNum = 0; // 字體個數ResourcesWord resourcesWord = new ResourcesWord();StringBuffer content = new StringBuffer();try {File f = new File(path);FileInputStream is = new FileInputStream(f);WordExtractor ex = new WordExtractor(is);// is是WORD文件的InputStreamString[] paragraph = ex.getParagraphText();for (int i = 0; i < paragraph.length; i++) {paragNum++;System.out.println("Paragraph " + (i + 1) + " : " + paragraph[i]);sentenceNum += paragraph[i].replace("\r\n", "").trim().split("。").length;wordNum += paragraph[i].trim().length();content.append(paragraph[i].trim());}System.out.println("段落：" + paragNum);System.out.println("句子：" + sentenceNum);System.out.println("字體：" + wordNum);resourcesWord.setContent(content.toString());resourcesWord.setParagNum(paragNum);resourcesWord.setSentenceNum(sentenceNum);resourcesWord.setWordNum(wordNum);/** MongoDBUtils mongoDb = new MongoDBUtils("javadb"); DBObject dbs =* new BasicDBObject(); dbs.put("name", "創新性"); //分類* dbs.put("major", "醫療"); //專業 dbs.put("content",* content.toString()); dbs.put("paragNum", paragNum);* dbs.put("sentenceNum", sentenceNum); dbs.put("wordNum", wordNum);* mongoDb.insert(dbs, "javadb");*/is.close();} catch (Exception e) {e.printStackTrace();}return resourcesWord;}/*** \brief doc轉換成html,并返回輸出的相對路徑* @param filePath :要轉換的doc文檔* @param outPutFilePath :文檔輸出的位置* @attention* @author toto* @throws IOException * @throws FileNotFoundException * @throws ParserConfigurationException * @date 2017年2月27日 * @note begin modify by 涂作權 2017年2月27日原始創建*/public static String doc2Html(String filePath,final String outPutFilePath)throws TransformerException, IOException, ParserConfigurationException {HWPFDocument wordDocument = new HWPFDocument(new FileInputStream(filePath));WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());wordToHtmlConverter.setPicturesManager(new PicturesManager() { public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) {//File file = new File(outPutFilePath);//String name = file.getName();tempImagePath = outPutFilePath.substring(0,outPutFilePath.indexOf(".html")) + File.separator;File imageFolder = new File(tempImagePath);if (!imageFolder.exists()) {try {FileUtils.forceMkdir(imageFolder);} catch (IOException e) {e.printStackTrace();}}String newTempImagePath = imageFolder.getPath().replace(imageFolder.getParentFile().getPath() + File.separator, "");return newTempImagePath + File.separator + suggestedName;}});wordToHtmlConverter.processDocument(wordDocument);// 保存圖片List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();if (pics != null) {for (int i = 0; i < pics.size(); i++) {Picture pic = (Picture) pics.get(i);try {File picOutFolder = new File(tempImagePath + File.separator);if (!picOutFolder.exists()) {picOutFolder.mkdirs();}pic.writeImageContent(new FileOutputStream(tempImagePath + File.separator + pic.suggestFullFileName()));} catch (FileNotFoundException e) { e.printStackTrace();}} } Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream out = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(out); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); out.close(); writeFile(new String(out.toByteArray()), outPutFilePath);return gainRelativePathByOutputPath(outPutFilePath);}/** * 將docx格式的word轉換為html格式的文檔* * @param filePath 原始的docx文件路徑存儲位置* @param outPutFile html輸出文件路徑 * @return * @throws TransformerException* @throws IOException* @throws ParserConfigurationException*/public static String docx2Html(String filePath,final String outPutFilePath) throws TransformerException, IOException, ParserConfigurationException {//String fileOutName = outPutFile;XWPFDocument wordDocument = new XWPFDocument(new FileInputStream(filePath));XHTMLOptions options = XHTMLOptions.create().indent(4);// 導出圖片Map<String, String> imageInfoMap = gainTempImagePath(outPutFilePath);File imageFolder = new File(imageInfoMap.get("imageStoredPath"));options.setExtractor(new FileImageExtractor(imageFolder));// URI resolver//這種方式獲得word中的圖片地址是絕對地址//options.URIResolver(new FileURIResolver(imageFolder));//設置生成的html中的img src中的地址是相對路徑options.URIResolver(new BasicURIResolver(imageInfoMap.get("imageFolder")));File outFile = new File(outPutFilePath);outFile.getParentFile().mkdirs();OutputStream out = new FileOutputStream(outFile);XHTMLConverter.getInstance().convert(wordDocument, out, options);return gainRelativePathByOutputPath(outPutFilePath);//System.out.println("Generate " + fileOutName + " with " + (System.currentTimeMillis() - startTime) + " ms.");}/*** \brief 將內容寫到path路徑下面* @param content :文檔內容* @param path :最終的文件存儲路徑* @attention 方法的使用注意事項 * @author toto* @date 2017年2月27日 * @note begin modify by 涂作權 2017年2月27日修改輸出的文件名稱*/public static void writeFile(String docContent, String path) { FileOutputStream outDocFos = null;try {//判斷文件是否為空的if (StringUtils.isNotBlank(path)) {File file = new File(path);if (!file.exists()) {FileUtils.forceMkdir(file.getParentFile());}outDocFos = new FileOutputStream(path);IOUtils.write(docContent, outDocFos,CHARSET_UTF8);}} catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); } catch (IOException ioe) { ioe.printStackTrace(); } finally { try { if (outDocFos != null) outDocFos.close(); } catch (IOException ie) { }} }/*** 關閉輸入流* * @param is*/private static void close(InputStream is) {if (is != null) {try {is.close();} catch (IOException e) {e.printStackTrace();}}}/*** \brief 通過文檔輸出路徑獲得圖片存儲路徑* @param outPutFile :文檔輸出路徑* @return* @attention 方法的使用注意事項 * @author toto* @date 2017年2月28日 * @note begin modify by 修改人修改時間修改內容摘要說明*/private static Map<String, String> gainTempImagePath(String outPutFilePath) {Map<String,String> imageInfoMap = new HashMap<String,String>();try {//File file = new File(outPutFilePath);tempImagePath = outPutFilePath.substring(0,outPutFilePath.indexOf(".html")) + File.separator;File imageFolder = new File(tempImagePath);if (!imageFolder.exists()) {try {FileUtils.forceMkdir(imageFolder);} catch (IOException e) {e.printStackTrace();}}//System.out.println(imageFolder.getPath().replace(imageFolder.getParentFile().getPath() + File.separator, ""));//return imageFolder.getPath().replace(imageFolder.getParentFile().getPath() + File.separator, "");imageInfoMap.put("imageStoredPath", imageFolder.getPath());imageInfoMap.put("imageFolder", imageFolder.getPath().replace(imageFolder.getParentFile().getPath(), "").replace(File.separator, ""));return imageInfoMap;} catch (Exception e) {e.printStackTrace();}return null;}private static String gainRelativePathByOutputPath(String outPutFilePath) {//用于預覽的存儲路徑String docsPreviewPath = ExtendedServerConfig.getInstance().getStringProp("DOCS_PREVIEW_PREFIX");return outPutFilePath.split(docsPreviewPath)[1];}/*** \brief * @param orgStr :表示要替換的就得字符串* @param regEx :表示的是正則表達式* @param targetStr :表示要替換的字符串* @return* @attention 方法的使用注意事項 * @author toto* @date 2017年3月4日 * @note begin modify by 涂作權原始創建 2017年3月4日*/public static String replaceStr(String orgStr,String regEx,String targetStr){if (null !=orgStr && !"".equals(orgStr.trim())) {//String regEx="[\\s~·`!！@#￥$%^……&*（()）\\-——\\-_=+【\\[\\]】｛{}｝\\|、\\\\；;：:‘'“”\"，,《<。.》>、/？?]";Pattern p = Pattern.compile(regEx);Matcher m = p.matcher(orgStr);return m.replaceAll(targetStr);}return null;}public static void main(String[] args) throws Exception { // String uploadFile = ExtendedServerConfig.getInstance().getStringProperty("UPLOAD_PATH"); // String docsTempPath = ExtendedServerConfig.getInstance().getStringProperty("DOCS_TEMP_PATH"); // String docsOutputPath = ExtendedServerConfig.getInstance().getStringProp("DOCS_OUTPUT_PATH"); // System.out.println("uploadFile = " + uploadFile + " " + docsTempPath + " " + docsOutputPath); // // Testtest.readWord("E://111.doc");// Testtest.readDoc();// System.out.println(content); // ResourcesWord readDocx = ReadWordUtils.readDoc(uploadFile + "/大學生創新創業項目申報書.doc"); // logger.info(readDocx.getContent()); // logger.info(readDocx.getParagNum()); // // new ReadWordUtils().doc2Html(uploadFile + "/大學生創新創業項目申報書.doc" , docsOutputPath + "/大學生創新創業項目申報書.html");//new ReadWordUtils().docx2Html(uploadFile + "/大學生創新創業項目申報書副本.docx" , docsOutputPath + "/大學生創新創業項目申報書副本.html");String newStr = replaceStr("afdas//\\as dfasd a//asd\\\\\\asd\\/", "[\\\\]","/");newStr = replaceStr(newStr, "(/){1,}", "/");newStr = replaceStr(newStr, "[ ]", "");System.out.println(newStr);} }

下面是調用案例：

import java.io.File;import org.apache.log4j.Logger; import org.springframework.stereotype.Service;import cn.com.hbny.docdetection.mongodb.beans.DocInfo; import cn.com.hbny.docdetection.server.ExtendedServerConfig; import cn.com.hbny.docdetection.service.base.impl.BaseServiceImpl; import cn.com.hbny.docdetection.service.docInfoHandler.DocInfoHandlerService; import cn.com.hbny.docdetection.utils.Pinyin4jUtils; import cn.com.hbny.docdetection.utils.ReadWordUtils; import cn.com.hbny.docdetection.utils.UUIDGenerator; import cn.com.hbny.docdetection.utils.Pinyin4jUtils.PinyinType;/*** @brief DocInfoHandlerServiceImpl.java 文檔檢測對應的文檔* @attention* @author toto* @date 2017年3月2日* @note begin modify by 涂作權 2017年3月2日原始創建*/ @Service(value = "docInfoHandlerService") public class DocInfoHandlerServiceImpl extends BaseServiceImpl implements DocInfoHandlerService {private static Logger logger = Logger.getLogger(DocInfoHandlerServiceImpl.class);/*** 文檔處理對應的service* @param docLibrayId :文檔庫對應的id* @param originalDocPath :原始文檔所在的位置* @param uploadPath :文檔上傳路徑* @param outPutFolderPath :文檔最終的輸出文件夾* @param docsPreviewPrefix :文檔預覽的前綴*/public DocInfo handlerSingleDocInfo(String docLibrayId,String originalDocPath,String uploadPath,String outPutFolderPath,String docsPreviewPrefix) {try {DocInfo docInfo = new DocInfo();docInfo.setId(UUIDGenerator.generate());docInfo.setDocLibrayId(docLibrayId);//處理傳遞過來的文件路徑File file = new File(originalDocPath);//判斷文件是否哦存在，如果不存在直接返回，如果存在繼續下面的操作if (file.exists()) {//獲取到文檔的名稱String fileName = file.getName();docInfo.setOriginalFileName(fileName.substring(0,fileName.toLowerCase().indexOf(".doc")));//截取上傳文件的后面那一串路徑String fileRelativePath = originalDocPath.substring(uploadPath.length());docInfo.setOriginalDocPath(fileRelativePath);//判斷文件后綴if (fileName.endsWith(".doc")) {//1、處理word文檔，并將word文檔存儲在相應的位置上，將word存儲成htmlString outPutFilePath = Pinyin4jUtils.toPinYin(outPutFolderPath + fileRelativePath.replace(".doc", ".html"),PinyinType.LOWERCASE);outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "[\\\\]","/");outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "(/){1,}", "/");outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "[ ]", "");//下面是經過處理后的文件存儲位置String filePathAfterHandled = ReadWordUtils.doc2Html(originalDocPath,outPutFilePath);docInfo.setHtmlDocPath(filePathAfterHandled);} else {//1、處理word文檔，并將word文檔存儲在相應的位置上,將word存儲成html//1、處理word文檔，并將word文檔存儲在相應的位置上，將word存儲成htmlString outPutFilePath = Pinyin4jUtils.toPinYin(outPutFolderPath + fileRelativePath.replace(".docx", ".html"),PinyinType.LOWERCASE);outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "[\\\\]","/");outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "(/){1,}", "/");outPutFilePath = ReadWordUtils.replaceStr(outPutFilePath, "[ ]", "");//下面是經過處理后的文件存儲位置String filePathAfterHandled = ReadWordUtils.docx2Html(originalDocPath, outPutFilePath);docInfo.setHtmlDocPath(filePathAfterHandled);}return null;} else {return null;}} catch (Exception e) {e.printStackTrace();}return null;}public static void main(String[] args) {String uploadPath = ExtendedServerConfig.getInstance().getStringProperty("UPLOAD_PATH");String outPutFolderPath = ExtendedServerConfig.getInstance().getStringProperty("DOCS_OUTPUT_PATH");String docsPreviewPrefix = ExtendedServerConfig.getInstance().getStringProperty("DOCS_PREVIEW_PREFIX"); // new DocInfoHandlerServiceImpl().handlerSingleDocInfo( // UUIDGenerator.generate(), // uploadPath + "/雙創項目申報書20170301/國家大學生創新訓練計劃項目申請書華師大.doc", // uploadPath, // outPutFolderPath);// new DocInfoHandlerServiceImpl().handlerSingleDocInfo( // UUIDGenerator.generate(), // uploadPath + "/雙創項目申報書20170301/國家級大學生創新創業訓練計劃立項申請書上海電力學院.doc", // uploadPath, // outPutFolderPath, // docsPreviewPrefix);new DocInfoHandlerServiceImpl().handlerSingleDocInfo(UUIDGenerator.generate(), uploadPath + "/雙創項目申報書20170301/專題產品需求規格說明書.docx",uploadPath,outPutFolderPath,docsPreviewPrefix);} }
下面是所以用到的參數配置：

#上傳的文件的存儲位置的配置，統一的最后面不要加斜杠 UPLOAD_PATH=D:/installed/apache-tomcat-7.0.47/webapps/upload ##處理后的文檔輸出位置，統一的最后面不要加斜杠 DOCS_OUTPUT_PATH=D:/installed/apache-tomcat-7.0.47/webapps/docs-output-path ##文檔預覽路徑，注意最后面不要加斜杠 DOCS_PREVIEW_PREFIX=/docs-output-path ##處理文檔是，生成的一些圖片的臨時存儲路徑，最后面不要加斜杠 DOCS_TEMP_PATH=D:/installed/apache-tomcat-7.0.47/webapps/temp

總結

以上是生活随笔為你收集整理的将Doc或者Docx文档处理成html的代码逻辑；统计word中的字数，段数，句数，读取word中文档内容的代码逻辑的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：桂林到凯里的汽车可以从哪里下车到从江？
下一篇：车上lock是什么意思？