I have a substantial pile of books I’ve read that will injure me in an earthquake. I ought to write perspicacious pithy reviews of them. I could write them on Amazon, but why should Amazon own and profit from my words? I could write them on https://lib.reviews/ “a free, open and not-for-profit platform for reviewing absolutely anything, in any language,” but it seems a bit moribund. Instead I have this web site! Putting book reviews here will ensure they live forever in complete obscurity.
Oh no, not the semantic web again!
A long time ago I simply wrote a definition list in HTML in Blogger with each book title followed by a paragraph underneath. Then the idea of a semantic web came along: the web page should unambiguously tell machines that a chunk of writing is a review of a particular book rather than me advertising some books for sale, or writing about the author. And it should tell the machines it’s a review by skierpage, of a book with a particular title and ISBN, who gives it a rating of 3 out of 5 stars, etc.
Why bother?
Disclaimer: all the semantic web work below is probably irrelevant. If your web page is important according to Google’s PageRank algorithm, then Google will devote AI to figuring out what it says, even if it has no, or incorrect, semantic markup. So most of those making the effort to do this semantic markup are shady SEO (search engine optimization) sites, trying to convince you that if you jump through all these hoops or pay them to do it, then your site on topic X will somehow rise in search results; from utter obscurity on the 20th page of search results to mostly ignored on the 4th page.
hReview microformat
Back in 2011 the leading implementation of this idea for plain web pages was microformats: you probably already have the relevant pieces of text in your human-readable book review, so put additional markup (the ‘M’ in Hypertext Markup Language) around them identifying the bit that’s the rating, the summary, etc. using invisible HTML attributes like class=reviewer
, class=rating
, class=summary
, etc. So I wrote a few reviews using an online tool to generate the necessary HTML, which I pasted into WordPress.
So many schemas
The hReview microformat is still going and supposedly Google still parses it when it crawls web pages. Some big guns of Web 2.0 (Google, Microsoft, Yahoo, and Yandex) came up with their own standard for structured data, similar but different, at the poorly named schema.org: “a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.” This got more detailed and complicated than microformats: there are separate related schemas for a review by the person skierpage about a book authored by another person. And there are three ways you can put the machine-readable information into your web pages (two too many!).
Google provides a structured data markup helper to guide me in creating this markup, and then its structured data testing tool to see if I got it right. (There was another schema generator at tools.seochat.com now defunct, and other checkers at linter.structured-data.org/ , https://jsonschemalint.com/ , etc.) If you choose to put invisible markup in the page surrounding the text of your review (schema.org calls this “microdata,” different from “microformat”), the HTML looks something like:
<!-- Microdata markup added by Google Structured Data Markup Helper. --> <div itemscope itemtype="https://schema.org/Book" id="hreview-Sprawling,-very-good!"> <meta itemprop="isbn" content="03-5091234-034"> <meta itemprop="genre" content="Science Fiction"> <meta itemprop="datePublished" content="2017-06-04"> <h3>Sprawling, very good!</h3> <p> <img itemprop="image" class="photo" src="https://ecx.images-amazon.com/images/I/51Gvu3UlqGL.jpg" width="167" height="250" alt="cover of 'River of Gods'" align="left" style="margin-right: 1em"/> </p> <div class="item"> <a title="paperback at Amazon" href="https://www.amazon.com/River-Gods-Ian-McDonald/dp/1591025958" class="fn url"> <span itemprop="name">River of Gods</span> </a> by <a href="https://en.wikipedia.org/wiki/Ian_McDonald_%28British_author%29"> <span itemprop="author" itemscope itemtype="http://schema.org/Person"> <span itemprop="name">Ian McDonald</span> </span> </a> </div> <p itemprop="review" itemscope itemtype="https://schema.org/Review" class="description"> <abbr itemprop="reviewRating" itemscope itemtype="https://schema.org/Rating" class="rating" title="4"> <span itemprop="ratingValue">4</span> 5 </abbr> <span itemprop="reviewBody">This does a fantastic job of presenting the foreign culture of ... !</span> <meta itemprop="datePublished" content="2007-08-01"> <span itemprop="author" itemscope itemtype="https://schema.org/Person"> <meta itemprop="name" content="skierpage"> <meta itemprop="sameAs" content="https://www.skierpage.com/about/"> </span> </p> </div>
The problem is, if I copy and paste this complicated HTML into WordPress’s post editor, it throws away much of the HTML markup, for example all the <meta>
tags for information I don’t want to display, like <meta itemprop="datePublished" content="2007-08-01">
. There are any number of dubious plug-ins to WordPress that support parts of schema.org schemas and want money for a professional version from desperate non-technical web site owners who see their traffic dropping and will clutch at straws hoping to appear higher in Google search results, but I don’t understand what these plug-ins do or don’t do.
Another representation for this structured data is JSON-LD, a completely separate representation of the semantic information that you stick in your web page and the reader never sees it; see A Guide to JSON-LD for Beginners. So maybe just sticking in a block of JSON-LD will work better (a guide to supporting it in WordPress is in section “Implementing Structured Data Using JSON-LD” in schema article at torquemag.io). Hmmm…, instead of copying and pasting twice, can I put this inside WordPress myself? Maybe try Markup (JSON-LD) Structure in schema.org plug-in for WordPress? wpengine article has JSON-LD generators, but they’re not much good:
- Webcode.tools has a comprehensive generator tool but its review type is too generic
- Microdatagenerator.org has a markup generator tool but it has nothing for reviews.
- Hall Analysis has created a step-by-step tool but it has nothing for reviews.
Tracking data
The problem with JSON-LD is I have to put the same information into the web page twice, first as HTML to display to human readers, and then again in this invisible block of data. Instead I could use Handlebars or something to spit out both the block of JSON and the HTML. A spreadsheet may be best to track most of this information. It sucks for entering formatted text, but probably OK just for a pithy two-sentence review. Let’s try it!
Generated HTML
Each book review in the spreadsheet should generate both the JSON-LD that web crawlers should read, and a human-readable book review. In the latter, I want things to link to something useful.
Author ISBN should probably link it to https://en.wikipedia.org/wiki/Special:BookSources/0060932902{ISBN}. Or I could accept that Jeff Bezos owns us and have it link to Amazon’s ASIN? Wikipedia’s Special:BookSources above creates a query https://www.amazon.com/s?k=0060932902, note how the dashes are removed in the query otherwise it doesn’t work. Spam-filled https://kindlepreneur.com/amazon-search-url-isbn-ref/ says you can use a 10-digit ISBN in place of ASIN, e.g. https://www.amazon.com/dp/0060932902, but you still have to remove the dashes.
For the cover, sometimes you can link to a cover image on English Wikipedia or Wikimedia Commons. You can mess around with an Amazon image URL; for some reason images on ecx.images-amazon.com can’t be accessed using https, Firefox complains about “SSL_ERROR_BAD_CERT_DOMAIN.” The Internet Archive runs (hosts?) the Open Library Covers Repository.
Other items in the review, like the author name and book title, should link to Wikipedia pages if available. There’s no easy way to know that Ian McDonald’s English Wikipedia page is at https://en.wikipedia.org/wiki/Ian_McDonald_(British_author), so the spreadsheet needs to have columns for Author URL and Book URL. (The alternative would be to store the Wikidata ‘Q’ numbers for each of these and work backwards from the wikidata info to the English Wikipedia pages, if any, for them.)
Coding it
Uh, scripting… Python? I quickly found a library pyexcel-ods to read a spreadsheet, and everyone uses seems jinja2 for HTML templating in Python. Adding these libraries mean dealing with all the ways to manage the Python libraries in a project; I have used pip
and virtualenv
in the past, but now teh hotness is pipenv
, so install that and then add pyexcel-ods
and jinja2
. I’m rocking! In two hours I’ve read a line of my book reviews spreadsheet and generated some HTML
Then I upgraded to Fedora 32, and nothing works because its Python is now at version 3.8, so I have to coerce pipenv to rebuild everything. Guessing what to do, I run pipenv check
and it tells me “In order to get an API Key you need a monthly subscription on pyup.io, starting at $14.99″ Guess I won’t run that command then.
HTML generation
For now my script plus template just generates a big HTML file of every book review in the spreadsheet. I’ll want to create blog posts about related books, such as “Interesting science”, which means selecting a few chunks from the generated HTML and pasting them into WordPress. WordPress accepts HTML but really wants you to use its Gutenberg WYSIWYG blog post editor. Fortunately, it seems I can choose Gutenberg’s “Custom HTML” block and paste in all my generated HTML, including <script>
tags containing JSON-LD. Finally, something easy! Part of me wants to make the HTML resemble Gutenberg’s blocks for WYSIWYG editing, but in theory I should go back into the spreadsheet to fix any errors.
Designing the JSON-LD
JSON (JavaScript Object Notation) is a simple file and data format to represent data. JSON-LD takes this and makes it slightly more complicated to represent Linked Data: on this Web page a person authored this review of a book which has its own author, another person(s). The details quickly degenerate into semantic triples, contexts, more three-letter acronyms like RDF, etc.
A person, a name, a friend-of-a-friend
Schema.org has fairly simple examples of JSON-LD for a review, but they leave it unclear if just writing "author": "skierpage"
is enough for computers to figure out that the person writing the review is the person who runs this web site.
update 2021-11 Google Search Console has started objecting to a plain author": "skierpage"
, now complaining:
‘Review snippets issues detected … Invalid object type for field “author”‘
So it seems I must go to more complicated nested structure for myself:
"author": [
{
"@type": "Person",
"name": "skierpage",
// Somehow point to some of the existing info about me all over my site!
}
],
Identifying a person on the web has been a concern for almost two decades. I cobbled together a FOAF (friend-of-a-friend) record for myself and my public key in 2012 back when it seemed you could tell the world “I’m skierpage dammit, use my web site to prove it” using OpenID, Persona, etc. to authenticate yourself on every web site instead of having to screw around with a hundred usernames and logins. Like most other initiatives going up against incumbent all-powerful social networks, it all mostly died, so people were forced to give up and login using Facebook or Google+ to provide Mark Fuckerberg with yet more information about your unrelated activities for no good reason. Should I update to hReview hCard? A WebID? An instance on a pod running on Tim Berners-Lee’s dream of a better Web, “Solid”? Arghhh.
It’s safest to use the Person schema information from the same schema.org that defines a Review. But I don’t want to have to duplicate my Person info as the author info in each review on each page. I should be able to point the author in all my reviews to a single Person chunk of data, which I’ve created at https://www.skierpage.com/people/skierpage/. Tantalizingly, the Review schema says “Please note that author is special in that HTML 5 provides a special mechanism for indicating authorship via the rel tag. That is equivalent to this and may be used interchangeably.” But in my experiments with Google’s Rich Results Test, Google ignores an <a rel="author">
link to this chunk in the HTML of the page, and complains that author is missing from the Review. So it seems I must put all of
"author": {
"@type": "Person",
"name": "skierpage",
"@id": "https://www.skierpage.com/people/skierpage/#person",
"url": "https://www.skierpage.com/people/skierpage/"
},
into each Review. I can’t even leave out "name"
, and it’s hella confusing whether the URL should be the web page or an identifier for me with a dummy #person
hash fragment on the end, or whether I should include both "url"
and "@id"
. The page is not the thing it describes. It seems even if the page with a Review passes Google’s test, Google doesn’t bother looking up my Person info anyway!
A graph of reviews, a person with lots of reviews, a list of products with reviews ??
To have multiple book reviews on a web page, you can output a separate JSON-LD <script>
block along with each review’s chunk of HTML. This results in a lot of duplication of the reviewer (me) in the page. There are much fancier ways to organize this: you can output a single JSON-LD block containing all the reviews by putting them into top-level “@graph” object which isn’t mentioned on schema.org but is part of JSON-LD (or maybe use schema.org’s @itemList… when you’re designing a set of linked objects there’s always more than one way to do it). What’s unclear is if the JSON-LD should have a graph of books, each with a single review, or a graph of reviews, each of a single itemReviewed
that’s a book:
{
"@context": "http://schema.org/",
"@graph": [{
"@type": "Review",
"author": "skierpage",
"datePublished": "2011-04-01",
"reviewBody": "The book has a nice cover.",
"itemReviewed": {
"@type": "Book",
"name": "River of Gods",
"isbn": "03-5091234-0344",
"author": "Ian McDonald"
},
"reviewRating": {
"@type": "Rating",
"ratingValue": 4,
"worstRating": 1,
"bestRating": 5
}
},
{
... another review
}]
}
Google’s Rich Results Test doesn’t like the above, it complains the review is missing a description
, publisher
, and url
. Isn’t this all obvious from the web page?
Maybe I don’t need author
, https://schema.org/Review says “Please note that author is special in that HTML 5 provides a special mechanism for indicating authorship via the rel
tag. That is equivalent to this and may be used interchangeably.” However, WordPress doesn’t add rel="author"
to its Posted by skierpage link.
Actually writing out the JSON-LD
There is a fancy pyld
Python module that outputs JSON-LD but I’m not clear what it does over simply printing json.dumps(reviewJSON)
. So I just build up reviewJSON as a Python dictionary object:
reviewJSON = {
"@context": "https://schema.org",
"@type": "Book",
"author": bookDict["Author"],
"isbn": bookDict["ISBN"],
"name": bookDict["Name"],
"review": {
"@type": "Review",
"author": "skierpage", ## TODO: can this be derived/inferred from the page?
"datePublished": TODAY, ## TODO: can this be derived/inferred from the page?
...
https://json-ld.org/playground/ lets you test the generated markup.
Summary: in early 2021 I got this pretty much working! E.g. View > Source of William Gibson flashes of excellence. I don’t think I’ll bother going back to re-publish older book reviews I hand-edited using hReview markup, e.g. bad SF.