當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Apache Tika源码研究（七）

發布時間：2025/3/17 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 Apache Tika源码研究（七）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

tika怎樣加載Parser實現類的，怎樣根據文檔的mime類型調用相應的Parser實現類,本文接著分析

先熟悉一下tika的解析類的相關接口和類的UML模型：

Parser接口的源碼如下：

/*** Tika parser interface.*/ public interface Parser extends Serializable {/*** Returns the set of media types supported by this parser when used* with the given parse context.** @since Apache Tika 0.7* @param context parse context* @return immutable set of media types*/Set<MediaType> getSupportedTypes(ParseContext context);/*** Parses a document stream into a sequence of XHTML SAX events.* Fills in related document metadata in the given metadata object.* <p>* The given document stream is consumed but not closed by this method.* The responsibility to close the stream remains on the caller.* <p>* Information about the parsing context can be passed in the context* parameter. See the parser implementations for the kinds of context* information they expect.** @since Apache Tika 0.5* @param stream the document stream (input)* @param handler handler for the XHTML SAX events (output)* @param metadata document metadata (input and output)* @param context parse context* @throws IOException if the document stream could not be read* @throws SAXException if the SAX events could not be processed* @throws TikaException if the document could not be parsed*/void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException;}

該接口只提供void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法供其他類調用

抽象類AbstractParser實現了Parser接口，其源碼如下：

/*** Abstract base class for new parsers. This method implements the old* deprecated parse method so subclasses won't have to.** @since Apache Tika 0.10*/ public abstract class AbstractParser implements Parser {/*** Serial version UID.*/private static final long serialVersionUID = 7186985395903074255L;/*** Calls the* {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)}* method with an empty {@link ParseContext}. This method exists as a* leftover from Tika 0.x when the three-argument parse() method still* existed in the {@link Parser} interface. No new code should call this* method anymore, it's only here for backwards compatibility.** @deprecated use the {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)} method instead*/public void parse(InputStream stream, ContentHandler handler, Metadata metadata)throws IOException, SAXException, TikaException {parse(stream, handler, metadata, new ParseContext());}}

新增void parse(InputStream stream, ContentHandler handler, Metadata metadata)方法，提供模板方法功能

下面接著貼出CompositeParser的源碼，它繼承自抽象類AbstractParser

/*** Composite parser that delegates parsing tasks to a component parser* based on the declared content type of the incoming document. A fallback* parser is defined for cases where a parser for the given content type is* not available.*/ public class CompositeParser extends AbstractParser {/** Serial version UID */private static final long serialVersionUID = 2192845797749627824L;/*** Media type registry.*/private MediaTypeRegistry registry;/*** List of component parsers.*/private List<Parser> parsers;/*** The fallback parser, used when no better parser is available.*/private Parser fallback = new EmptyParser();public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers) {this.parsers = parsers;this.registry = registry;}public CompositeParser(MediaTypeRegistry registry, Parser... parsers) {this(registry, Arrays.asList(parsers));}public CompositeParser() {this(new MediaTypeRegistry());}public Map<MediaType, Parser> getParsers(ParseContext context) {Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();for (Parser parser : parsers) {for (MediaType type : parser.getSupportedTypes(context)) {map.put(registry.normalize(type), parser);}}return map;}/*** Utility method that goes through all the component parsers and finds* all media types for which more than one parser declares support. This* is useful in tracking down conflicting parser definitions.** @since Apache Tika 0.10* @see <a href="https://issues.apache.org/jira/browse/TIKA-660">TIKA-660</a>* @param context parsing context* @return media types that are supported by at least two component parsers*/public Map<MediaType, List<Parser>> findDuplicateParsers(ParseContext context) {Map<MediaType, Parser> types = new HashMap<MediaType, Parser>();Map<MediaType, List<Parser>> duplicates =new HashMap<MediaType, List<Parser>>();for (Parser parser : parsers) {for (MediaType type : parser.getSupportedTypes(context)) {MediaType canonicalType = registry.normalize(type);if (types.containsKey(canonicalType)) {List<Parser> list = duplicates.get(canonicalType);if (list == null) {list = new ArrayList<Parser>();list.add(types.get(canonicalType));duplicates.put(canonicalType, list);}list.add(parser);} else {types.put(canonicalType, parser);}}}return duplicates;}/*** Returns the media type registry used to infer type relationships.** @since Apache Tika 0.8* @return media type registry*/public MediaTypeRegistry getMediaTypeRegistry() {return registry;}/*** Sets the media type registry used to infer type relationships.** @since Apache Tika 0.8* @param registry media type registry*/public void setMediaTypeRegistry(MediaTypeRegistry registry) {this.registry = registry;}/*** Returns the component parsers.** @return component parsers, keyed by media type*/public Map<MediaType, Parser> getParsers() {return getParsers(new ParseContext());}/*** Sets the component parsers.** @param parsers component parsers, keyed by media type*/public void setParsers(Map<MediaType, Parser> parsers) {this.parsers = new ArrayList<Parser>(parsers.size());for (Map.Entry<MediaType, Parser> entry : parsers.entrySet()) {this.parsers.add(ParserDecorator.withTypes(entry.getValue(), Collections.singleton(entry.getKey())));}}/*** Returns the fallback parser.** @return fallback parser*/public Parser getFallback() {return fallback;}/*** Sets the fallback parser.** @param fallback fallback parser*/public void setFallback(Parser fallback) {this.fallback = fallback;}/*** Returns the parser that best matches the given metadata. By default* looks for a parser that matches the content type metadata property,* and uses the fallback parser if a better match is not found. The* type hierarchy information included in the configured media type* registry is used when looking for a matching parser instance.* <p>* Subclasses can override this method to provide more accurate* parser resolution.** @param metadata document metadata* @return matching parser*/protected Parser getParser(Metadata metadata) {return getParser(metadata, new ParseContext());}protected Parser getParser(Metadata metadata, ParseContext context) {Map<MediaType, Parser> map = getParsers(context);MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));if (type != null) {// We always work on the normalised, canonical formtype = registry.normalize(type);}while (type != null) {// Try finding a parser for the typeParser parser = map.get(type);if (parser != null) {return parser;}// Failing that, try for the parent of the typetype = registry.getSupertype(type);}return fallback;}public Set<MediaType> getSupportedTypes(ParseContext context) {return getParsers(context).keySet();}/*** Delegates the call to the matching component parser.* <p>* Potential {@link RuntimeException}s, {@link IOException}s and* {@link SAXException}s unrelated to the given input stream and content* handler are automatically wrapped into {@link TikaException}s to better* honor the {@link Parser} contract.*/public void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException {Parser parser = getParser(metadata);TemporaryResources tmp = new TemporaryResources();try {TikaInputStream taggedStream = TikaInputStream.get(stream, tmp);TaggedContentHandler taggedHandler = new TaggedContentHandler(handler);try {parser.parse(taggedStream, taggedHandler, metadata, context);} catch (RuntimeException e) {throw new TikaException("Unexpected RuntimeException from " + parser, e);} catch (IOException e) {taggedStream.throwIfCauseOf(e);throw new TikaException("TIKA-198: Illegal IOException from " + parser, e);} catch (SAXException e) {taggedHandler.throwIfCauseOf(e);throw new TikaException("TIKA-237: Illegal SAXException from " + parser, e);}} finally {tmp.dispose();}}}

?該類的注釋很清楚，相當于將解析任務委托給了其他的解析組件，而自身提供的parser方法供其他類調用

且分析CompositeParser類是怎樣將解析任務委托給其他解析組件的，關鍵是parser方法的這行代碼?Parser parser = getParser(metadata);

它調用了下面的方法：

protected Parser getParser(Metadata metadata) {return getParser(metadata, new ParseContext());}protected Parser getParser(Metadata metadata, ParseContext context) {Map<MediaType, Parser> map = getParsers(context);MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));if (type != null) {// We always work on the normalised, canonical formtype = registry.normalize(type);}while (type != null) {// Try finding a parser for the typeParser parser = map.get(type);if (parser != null) {return parser;}// Failing that, try for the parent of the typetype = registry.getSupertype(type);}return fallback;}

執行流程是首先獲取mime類型跟相應的Parser實現類的映射Map<MediaType, Parser> ，然后根據Metadata的Metadata.CONTENT_TYPE屬性得到MediaType類型，最后從Map<MediaType, Parser>獲取相應的Parser實現類

?上面的代碼Map<MediaType, Parser> map = getParsers(context)是獲取Map<MediaType, Parser>映射

public Map<MediaType, Parser> getParsers(ParseContext context) {Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();for (Parser parser : parsers) {for (MediaType type : parser.getSupportedTypes(context)) {map.put(registry.normalize(type), parser);}}return map;}

即根據構造方法初始化的List<Parser> parsers組件集合，這里注意的是如果該組件類集合中的成員之一為CompositeParser本身的類型，則該成員提供的可以支持的mime類型同時又來自于該成員的解析組件集合（這里也許是CompositeParser命名的原因，這里用到了Composite模式），我們可以看到它Set<MediaType> getSupportedTypes(ParseContext context)方法：

public Set<MediaType> getSupportedTypes(ParseContext context) {return getParsers(context).keySet();}

?Composite模式的簡要UML模型圖如下：

我們接下來分析DefaultParser的源碼，該類繼承自CompositeParser類，用于初始化CompositeParser類的相關成員變量

/*** A composite parser based on all the {@link Parser} implementations* available through the* {@link javax.imageio.spi.ServiceRegistry service provider mechanism}.** @since Apache Tika 0.8*/ public class DefaultParser extends CompositeParser {/** Serial version UID */private static final long serialVersionUID = 3612324825403757520L;/*** Finds all statically loadable parsers and sort the list by name,* rather than discovery order. CompositeParser takes the last* parser for any given media type, so put the Tika parsers first* so that non-Tika (user supplied) parsers can take precedence.** @param loader service loader* @return ordered list of statically loadable parsers*/private static List<Parser> getDefaultParsers(ServiceLoader loader) {List<Parser> parsers =loader.loadStaticServiceProviders(Parser.class);Collections.sort(parsers, new Comparator<Parser>() {public int compare(Parser p1, Parser p2) {String n1 = p1.getClass().getName();String n2 = p2.getClass().getName();boolean t1 = n1.startsWith("org.apache.tika.");boolean t2 = n2.startsWith("org.apache.tika.");if (t1 == t2) {return n1.compareTo(n2);} else if (t1) {return -1;} else {return 1;}}});return parsers;}private transient final ServiceLoader loader;public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader) {super(registry, getDefaultParsers(loader));this.loader = loader;}public DefaultParser(MediaTypeRegistry registry, ClassLoader loader) {this(registry, new ServiceLoader(loader));}public DefaultParser(ClassLoader loader) {this(MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(loader));}public DefaultParser(MediaTypeRegistry registry) {this(registry, new ServiceLoader());}public DefaultParser() {this(MediaTypeRegistry.getDefaultRegistry());}@Overridepublic Map<MediaType, Parser> getParsers(ParseContext context) {Map<MediaType, Parser> map = super.getParsers(context);if (loader != null) {// Add dynamic parser service (they always override static ones)MediaTypeRegistry registry = getMediaTypeRegistry();for (Parser parser: loader.loadDynamicServiceProviders(Parser.class)) {for (MediaType type : parser.getSupportedTypes(context)) {map.put(registry.normalize(type), parser);}}}return map;}}

該類主要是為基類初始化MediaTypeRegistry registry成員與List<Parser> parsers成員，它本身并沒有覆蓋void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法，為的是執行基類的方法（根據mime類型執行具體parser類的parse方法）

最后來分析AutoDetectParser類的源碼，它也繼承自CompositeParser類：

public class AutoDetectParser extends CompositeParser {/** Serial version UID */private static final long serialVersionUID = 6110455808615143122L;/*** The type detector used by this parser to auto-detect the type* of a document.*/private Detector detector; // always set in the constructor/*** Creates an auto-detecting parser instance using the default Tika* configuration.*/public AutoDetectParser() {this(TikaConfig.getDefaultConfig());}public AutoDetectParser(Detector detector) {this(TikaConfig.getDefaultConfig());setDetector(detector);}/*** Creates an auto-detecting parser instance using the specified set of parser.* This allows one to create a Tika configuration where only a subset of the* available parsers have their 3rd party jars included, as otherwise the* use of the default TikaConfig will throw various "ClassNotFound" exceptions.* * @param detector Detector to use* @param parsers*/public AutoDetectParser(Parser...parsers) {this(new DefaultDetector(), parsers);}public AutoDetectParser(Detector detector, Parser...parsers) {super(MediaTypeRegistry.getDefaultRegistry(), parsers);setDetector(detector);}public AutoDetectParser(TikaConfig config) {super(config.getMediaTypeRegistry(), config.getParser());setDetector(config.getDetector());}/*** Returns the type detector used by this parser to auto-detect the type* of a document.** @return type detector* @since Apache Tika 0.4*/public Detector getDetector() {return detector;}/*** Sets the type detector used by this parser to auto-detect the type* of a document.** @param detector type detector* @since Apache Tika 0.4*/public void setDetector(Detector detector) {this.detector = detector;}public void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException {TemporaryResources tmp = new TemporaryResources();try {TikaInputStream tis = TikaInputStream.get(stream, tmp);// Automatically detect the MIME type of the documentMediaType type = detector.detect(tis, metadata);metadata.set(Metadata.CONTENT_TYPE, type.toString());// TIKA-216: Zip bomb preventionSecureContentHandler sch = new SecureContentHandler(handler, tis);try {// Parse the documentsuper.parse(tis, sch, metadata, context);} catch (SAXException e) {// Convert zip bomb exceptions to TikaExceptions sch.throwIfCauseOf(e);throw e;}} finally {tmp.dispose();}}public void parse(InputStream stream, ContentHandler handler, Metadata metadata)throws IOException, SAXException, TikaException {ParseContext context = new ParseContext();context.set(Parser.class, this);parse(stream, handler, metadata, context);}}

?該類也初始化基類的MediaTypeRegistry registry成員與List<Parser> parsers成員，不過這里的List<Parser> parsers成員有TikaConfig類提供，后者默認提供的Parser實現類為DefaultParser

它的void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法首先檢測文檔的mime類型，然后將解析處理委托給CompositeParser基類執行，自身對外提供接口

所以整個解析的流程是

一、AutoDetectParser的parse方法：首先檢測文件的mime類型，然后將解析任務交給基類CompositeParser的parse方法

二、AutoDetectParser的基類CompositeParser的parse方法：根據參數里面的mime類型獲取解析類DefaultParser（支持所有已經注冊的mime類型，由List<Parser> parsers成員提供）

三、調用DefaultParser的parse方法（DefaultParser默認執行父類的parse方法），即基類CompositeParser的parse方法，根據參數里面的mime類型獲取具體解析類

四、最后執行具體解析類的parse方法

這里第一次AutoDetectParser初始化基類CompositeParser的parser組件集合是DefaultParser，基類CompositeParser的parse方法委托給DefaultParser

第二次DefaultParser初始化基類的CompositeParser的parser組件集合是具體的parser實現類集合，基類CompositeParser的parse方法委托給具體的parser實現類

這里體現的是Composite模式的運用。

總結

以上是生活随笔為你收集整理的Apache Tika源码研究（七）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Object C数据类型
下一篇： ASCII中关于大小写字母间隔为32的思