January 9, 2023

Cleaning up the Allwinner H3 datasheet

I have always been annoyed by the big grey "confidential" pasted across every page in the 614 page datasheet for the Allwinner H3. It is silly, given that the datasheet is available at a multitude of locations on the internet.

Remove it using pdftk

This fellow describes how he uses pdftk to remove watermarks.

Get PDFTK

The first trick is to obtain pdftk. On my Fedora system I see "pdftk-java.noarch" available as a package. The reasonable thing is to install that and see if it will do the job. This is a port of the original C++ code to Java. The "real thing" is "pdftk-server" They offer RPM files for RedHat enterprise linux and CentOS. (They also offer a source RPM). I will try the x86_64 RPM for RHEL-6. Downloading it gives me:
pdftk-2.02-1.el6.x86_64.rpm
It needs something called "libgcj" and there is no such thing in the Fedora package set. Apparently "gcj" is the Gnu compiler for Java. This seems strange if pdftk is coded in C++. As I do some searching, I learn that GCJ was abandoned around 2017.

I take the easy route, delete this RPM, and do:

dnf install pdftk-java
(1/2): pdftk-java-3.3.3-1.fc37.noarch.rpm       1.0 MB/s | 977 kB     00:00
(2/2): bouncycastle-1.70-6.fc37.noarch.rpm      3.1 MB/s | 4.4 MB     00:01
It pulls in a cute dependency "bouncycastle", which is apparently a cryptography API.
Now I can just type "pdftk" to run this. So far so good.

Give it a try

mkdir Clean
cd Clean
cp ../datasheet.pdf .
pdftk datasheet.pdf output xyz.pdf uncompress
The file "xyz.pdf" is another PDF file, but much bigger (41M instead of 7M). And, by golly, I can just open it using "vim". The fellow (Tyler Davis) says that it is now sort of a trial and error process of finding and deleting PDF "obj" objects that contain the watermark.

I search for "confid" and find it in many places. Also interesting to search for is the string "atermark". As an example I find and delete this block of stuff:

1044 0 obj
<<
/LastModified (D:20150528172744+08'00')
/OC 2532 0 R
/Subtype /Form
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/PieceInfo
<<
/ADBE_CompoundType
<<
/LastModified (D:20150528172744+08'00')
/DocSettings 2533 0 R
/Private /Watermark
>>
>>
/Resources 2534 0 R
/BBox [0.0 -24.0 115.958 2.40015]
/Length 119
>>
stream
0 g 0 G 0 i 0 J []0 d 0 j 1 w 10 M 0 Tc 0 Tw 100 Tz 0 TL 0 Tr 0 Ts
BT
/Calibri 24 Tf
0 g
0 -18 Td
(confidential) Tj
ET

endstream
endobj
I merrily delete it and save the result at new.pdf. Interestingly, this removes the watermark from chapters 1,2,3, but not the rest of the chapters, and not the table of contents.

I continue searching for the string "onfiden" and deleting the entire "obj" block that encloses it. (The block runs from the line with "obj" to the line with "endobj". There seem to be dozens of these, but I delete all that I can find.

And voila! I seem to have a clean document!! It is still big (41M) bug perhaps I can use pdftk to compress as well as uncompress. I try:

pdftk new.pdf output clean.pdf compress
I takes quite a while (almost a minute) but it works!! I see:
-rw-r--r-- 1 tom tom 17525531 Jan  9 13:27 clean.pdf
-rw-r--r-- 1 tom tom  7408830 Jan  9 12:58 datasheet.pdf
-rw-r--r-- 1 tom tom 41047374 Jan  9 13:22 new.pdf
-rw-r--r-- 1 tom tom 41065029 Jan  9 13:02 xyz.pdf
So the cleaned up document is 17M (as compared to the original 7M), but 10M is a small price to pay not to be endlessly annoyed by the doggone watermark.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / tom@mmto.org