PDF Crawler and Engine configuration
About WorkZone PDF Crawler and Engine
The WorkZone PDF module is used to convert existing documents to PDF documents. The WorkZone PDF module consists of two sub-modules:
- WorkZone PDF Engine: A stateless web service that performs real-time conversions of files to PDF documents.
- WorkZone PDF Crawler: Used for deferred, asynchronous conversions of selected documents, either manually or with policies.
If an organization uses WorkZone PDF with policies, WorkZone PDF Crawler searches for documents that have not been converted, and match one of the policies that have been defined. When a document that matches a policy is found, it is converted and then saved back to the WorkZone Content Server as a PDF document.
For more detailed information, see WorkZone PDF Administrator Guide.
Configure WorkZone PDF
- On the start page, click PDF.
- Select the Engine configuration or Crawler configuration tab depending on your needs.
- Apply your changes.
Engine configuration and crawler configuration parameters
Field | Description | Notes |
---|---|---|
Conversion |
||
Bypass page bounds verification for files |
Specify which file types are to be bypassed when boundaries of objects in the files are checked during PDF conversion. The default file extension is PDF, which means that the check for out-of-bounds objects in PDF documents will be bypassed. |
|
Convert documents with their attachments |
Enables conversion of document attachments together with the original document. |
|
Suppress content that exceeds page bounds |
If deactivated, the document is converted even if the content exceeds the bounds of a document. If enabled, the content fails to convert. Note: Documents with the states UL (Locked), ARK (Archived), and AFS (Closed) will not be checked for content that is out of bounds, even if the toggle button is enabled. Documents with these states are locked in their final state and disregard content that exceeds the page bounds.
|
|
Document processing retries |
Specifies how many times WorkZone PDF Crawler tries to convert a document. |
This setting is only available for WorkZone PDF Crawler
|
Document processing time-out |
Specifies a time-out after which the WorkZone PDF Crawler stops waiting for conversion to be finished. If the conversion is not finished during the specified time-out, an error message is written to the dvs_render_info table. |
This setting is only available for WorkZone PDF Crawler |
For PDF documents |
|
|
PDF forms:
|
Select whether to include PDF forms as regular content or to show as review edits, when converting PDF documents.
Flatten: PDF forms will be included in the final document as regular content (text and images). Show: PDF forms will be shown in the final document as review edits (and will remain editable). |
|
Annotations:
|
Select whether to include annotations as regular content, show as edits, or hide completely, when converting PDF documents.
Flatten: Annotations will be included to the final document as regular content (text and images). Show: Annotations will be shown in the final document as review edits. Hide: Annotations will be hidden (excluded) from the final document. |
|
For Word documents |
|
|
Show comments |
Enable the option to show comments when converting Word documents |
|
Use document culture for date format fields |
|
If the Use document culture for date format fields setting is disabled, the culture of the first page in the document will be applied to all date format fields in the document when generating a PDF rendition of the document. A document with multiple culture settings will generate a PDF document with the same date format for all pages.
The Use document culture for date format fields setting is disabled by default. |
Revisions:
|
Select whether to accept, show or reject existing revisions, when converting Word documents |
|
For Excel documents |
|
|
Show comments |
Enable to show comments when converting Excel documents |
|
Revisions:
|
Select whether to accept, show or reject existing revisions, when converting Excel documents |
|
For PowerPoint documents |
|
|
Show annotations |
Enable to show annotations when converting PowerPoint documents |
|
Show notes |
Enable to show notes when converting PowerPoint documents |
|
Content settings |
||
Header |
Define the content of the header. |
Custom header text example:
OpenField code header example: <setting name="Header" serializeAs="String"> <value> {page}</value> </setting> |
Header styles |
Define the default style of the header that will be used if the style is not specified in the request. This parameter must be in JSON format. You can specify the formatting of the header, for example, bold, italic, as well as which font to use. All parameters are optional. |
Example of a header in bold using Arial as font: { "Bold": true, "Font": "Arial"} |
Watermark |
Specify the text to print as a watermark on each page. |
|
Watermark styles |
Define the default style of the watermark. This parameter must be in JSON format. You can specify color, transparency, and font. All parameters are optional. You specify them as follows: Color: Any valid html string format such as standard color name (e.g. red), hex value (e.g. #FF0000), and RGB colors (e.g. 255,0,0). Transparency: Transparency ranges from 0 to 100 where 100 represents full opacity. Font: A font name. Note: You do not need to specify font size. The watermark will be sized to fit the page automatically.
|
Example of a watermark with red as the font color, medium transparency, and using Verdana as the font: { "Color": "Red", "Transparency": "55", "Font": "Verdana"} |
Footer |
Define the content of the footer. You can specify the footer content as normal text or use the following Microsoft field codes: {Title}: The document title. {Date}: The current date based on defined culture settings. {Page}: The current page number in the document. {NumPages}: The total amount of pages in the document. |
Custom footer text example: <setting name="Footer" serializeAs="String"> <value> 'My Custom Footer'</value> </setting> Field code footer example: <setting name="Footer" serializeAs="String"> <value> " Page {page} of {NumPages}"</value> </setting> |
Footer styles |
Define the default style of the footer that will be used if the style is not specified in the request. This parameter must be in JSON format. You can specify the formatting of the footer, for example bold, italic, as well as which font to use. All parameters are optional. |
Example of a footer in bold using Arial as font: { "Bold": true, "Font": "Arial"} |
Output settings |
||
PDF format:
|
Select the default PDF format that will be used:
|
|
Compress PDF output |
Enable to reduce a size of the PDF output. This parameter is particularly important as large documents may take a long time to download from a server. |
|
Optimize PDF output for the web |
Enable to optimize the PDF output for the web. This parameter is particularly important as regards large documents that may take a long time to download from a server. |
Crawler specific parameters
The parameters below are specific to the configuration of WorkZone PDF Crawler. You can use the parameters to optimize the performance of the Crawler service. The suggested settings should be adjusted to your organization's needs in terms of worker count and document processing interval.
Field | Description | Notes |
---|---|---|
Document processing retries |
Specifies how many times WorkZone PDF Crawler tries to convert a document. The parameter is used when the document conversion exceeds the document processing time-out, or when you have installed multiple instances of WorkZone PDF Crawler and two crawlers try to convert the same document at the same time. In this case, WorkZone PDF Crawler may pause and then try to convert the document again. |
The default value is 3. |
Document processing timeout |
Specifies a time-out for conversion. When time-out is exceeded, an error message is written to the DVS_RENDER_MESSAGE table. |
Default time-out: 5 minutes. |
Maximum batch size |
Specify the maximum number of documents that will be fetched for processing in each iteration. Enter a number in the of range 1-100. |
Default number: 50 |
Document processing interval | Specify a time interval between each iteration. Enter a time between 00:00:01-23:59:59. |
Default time interval: 00:00:30. |
Worker thread count |
Specify how many worker threads should run simultaneously. This parameter configures the service to start the specified number of high priority threads and the same number of background threads. Enter a number in the range of 1-256. |
Default threads: 2. |
Maximum memory threshold (MB) |
Specify the maximum expected memory usage in megabytes. If the threshold is exceeded, the Crawler service will execute an additional memory cleanup. Enter a memory usage in the range of 256-32768. |
Default memory usage: 8192. |
Performance optimization
PDF Crawler instances
We recommend that you only use one WorkZone PDF Crawler instance.
Parameter settings
You can use the Worker thread count, Document processing interval, and Maximum batch size parameters to enhance the performance of the WorkZone PDF Crawler service. If your organization processes a high volume of documents daily, such as 40000 documents, we recommend the following setup:
-
Document processing interval, and Maximum batch size: The parameters define how many documents are selected for processing by WorkZone PDF Crawler during the day. If you use the default values, WorkZone PDF Crawler can process up to 50000 documents during an 8 hour working day. If you need to handle a different number of documents, you can set the parameters according to this formula:
Maximum batch size=Expected number of documents/(Working hours*3600/Document processing interval(in sec))
Examples:
If you want to process 48000 documents in an 8 hour working day, you can set the Document processing interval parameter to 30 seconds and the RecordLimit parameter to 48000/(8*3600/30)=50. Set the Maximum batch size parameter to 50.
If you want to process 24000 documents in an 8 hour working day, you can set the Document processing interval parameter to 30 seconds and RecordLimit parameter to 24000/(8*3600/30)=26. Set the Maximum batch size parameter to 25.
You may also change the Document processing interval parameter to 15 seconds, and set the Maximum batch size parameter to 25. This way, WorkZone PDF Crawler can also process 48000 documents in an 8 hour working day, but it fetches new documents from WorkZone Content Server more often. Be aware that setting the Document processing interval parameter to a lower value may cause more load on WorkZone Content Server.
-
Worker thread count: Set it to 2 or less than half of the logical processors on the machine, minimum 1. If the document processing speed is not fast enough, and the CPU and memory on the server where WorkZone PDF Crawler is installed are not fully used, you can increase the Worker thread count parameter slightly.