Skip to content

Commit d2a0c72

Browse files
committed
Trimmed CLI version of HTTPreserve Workbench
1 parent 47c99b7 commit d2a0c72

13 files changed

+846
-2
lines changed

README.md

Lines changed: 133 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,133 @@
1-
# linktest
2-
Pure CLI implementation of httpreserve
1+
<div>
2+
<p align="center">
3+
<img id="logo" src="https://github.com/httpreserve/httpreserve/raw/master/src/images/httpreserve-logo.png" alt="httpreserve"/>
4+
</p>
5+
</div>
6+
7+
# linkstat
8+
9+
CLI implementation of httpreserve that can test links and retrieve Internet
10+
Archive replacements. The tool can output the result of individual links, or
11+
take a CSV list to output collected information in JSON, BoltDB, or CSV format.
12+
13+
## Usage
14+
```bash
15+
Usage: linkstat [Optional -link] [Optional -label]
16+
[Optional -list] [Optional -json]
17+
[Optional -bolt]
18+
[Optional -csv]
19+
[Optional -version -v]
20+
21+
Output: [Json]
22+
Output: [CSV]
23+
Output: [BoltDB]
24+
Output: [Version] 'exponentialDK-httpreserve/0.0.9 ...'
25+
26+
Usage of ./linkstat:
27+
-bolt
28+
Output to static BoltDB.
29+
-csv
30+
Output to CSV.
31+
-json
32+
Output to JSON.
33+
-label string
34+
Annotate single URL check response with label.
35+
-link string
36+
Seek the status of a single URL: JSON
37+
-list string
38+
Provide a list of URLs to test against in CSV format.
39+
-v Return httpreserve version.
40+
-version
41+
Return httpreserve version.
42+
```
43+
44+
## Examples
45+
46+
#### Example combining [tikalinkextract][httpreserve-1]
47+
48+
Inspired by Harvard Innovation Labs to test the ability of
49+
httpreserve-workbench at the time. This CLI version is a simplification of that
50+
work but should still produce decent results. HTTPreserve
51+
[Million Dollar Webpage Project][httpreserve-2]
52+
53+
[httpreserve-1]: https://github.com/httpreserve/tikalinkextract
54+
[httpreserve-2]: https://github.com/httpreserve/million-dollar-webpage
55+
56+
#### CSV input
57+
58+
An input CSV `example.csv` might look as follows:
59+
```csv
60+
"BBC News", "http://www.bbc.co.uk/news"
61+
"BBC Home", "http://www.bbc.co.uk/"
62+
"BBC Radio", "http://www.bbc.co.uk/radio"
63+
"Google", "http://www.google.com"
64+
"exponentialdecay.co.uk", "http://www.exponentialdecay.co.uk"
65+
"Internet Archive", "http://www.archive.org"
66+
"perma.cc", "http://perma.cc"
67+
"wikipedia.org", "http://wikipedia.org"
68+
"The Million Dollar Homepage", "http://www.getpixel.net"
69+
```
70+
71+
To output a CSV collecting all of the linkstat results, you can run a command
72+
as follows:
73+
```bash
74+
$ ./linkstat -csv --list example.csv > output.csv
75+
```
76+
77+
And the output looks as follows:
78+
```
79+
"id","filename","link","response code","response text","title","content-type","archived","internet archive response code","internet archive response text","wayback earliest date","internet archive earliest","wayback latest date","internet archive latest","internet archive save link","protocol error","protocol error","analysis version number","analysis version text","stats creation time"
80+
"1651a00b16a12ba06fc6c6b049c7cf7c","BBC News","https://www.bbc.co.uk/news","200","OK","home - bbc news","text/html;charset=utf-8","true","302","Found","09 October 1997","http://web.archive.org/web/19971009011901/http://www.bbc.co.uk/news/","19 March 2019","http://web.archive.org/web/20190319173721/https://www.bbc.co.uk/news","http://web.archive.org/save/https://www.bbc.co.uk/news","","","0.0.9","exponentialDK-httpreserve/0.0.9","1.574649021s"
81+
"57ab6349a47b53b982a939fb1da54fef","BBC Radio","https://www.bbc.co.uk/sounds","200","OK","bbc sounds - music. radio. podcasts","text/html; charset=utf-8","true","302","Found","19 March 2008","http://web.archive.org/web/20080319074038/http://www.bbc.co.uk/sounds","18 March 2019","http://web.archive.org/web/20190318211158/https://www.bbc.co.uk/sounds","http://web.archive.org/save/https://www.bbc.co.uk/sounds","","","0.0.9","exponentialDK-httpreserve/0.0.9","1.660729358s"
82+
"c85da5e372ffe2200e46527b74537ba3","BBC Home","https://www.bbc.co.uk/","200","OK","bbc - home","text/html; charset=utf-8","true","302","Found","21 December 1996","http://web.archive.org/web/19961221203254/http://www0.bbc.co.uk/","19 March 2019","http://web.archive.org/web/20190319141018/https://www.bbc.co.uk/","http://web.archive.org/save/https://www.bbc.co.uk/","","","0.0.9","exponentialDK-httpreserve/0.0.9","1.95442772s"
83+
"b3bd672c1014e07e87ef4a357a161528","exponentialdecay.co.uk","http://www.exponentialdecay.co.uk","206","Partial Content","ross spencer, digital preservation, archives, python developer, golang developer, uk, nz","text/html","true","302","Found","17 September 2008","http://web.archive.org/web/20080917054811/http://www.exponentialdecay.co.uk/","13 November 2018","http://web.archive.org/web/20181113021338/http://exponentialdecay.co.uk/","http://web.archive.org/save/http://www.exponentialdecay.co.uk","","","0.0.9","exponentialDK-httpreserve/0.0.9","425.368183ms"
84+
```
85+
86+
#### An individual link
87+
88+
The command: `./linkstat -link https://github.com/ -label "GitHub"` will
89+
output:
90+
```json
91+
{
92+
"FileName": "GitHub",
93+
"AnalysisVersionNumber": "0.0.9",
94+
"AnalysisVersionText": "exponentialDK-httpreserve/0.0.9",
95+
"SimpleRequestVersion": "httpreserve-simplerequest/0.0.4",
96+
"Link": "https://github.com/",
97+
"Title": "the world’s leading software development platform · github",
98+
"ContentType": "text/html; charset=utf-8",
99+
"ResponseCode": 200,
100+
"ResponseText": "OK",
101+
"ScreenShot": "snapshots are not currently enabled",
102+
"InternetArchiveLinkLatest": "http://web.archive.org/web/20190319223453/https://github.com/",
103+
"InternetArchiveLinkEarliest": "http://web.archive.org/web/20080514210148/http://github.com/",
104+
"InternetArchiveSaveLink": "http://web.archive.org/save/https://github.com/",
105+
"InternetArchiveResponseCode": 302,
106+
"InternetArchiveResponseText": "Found",
107+
"Archived": true,
108+
"Error": false,
109+
"ErrorMessage": "",
110+
"StatsCreationTime": "4.295493892s"
111+
}
112+
```
113+
114+
## Archiving Weblinks
115+
116+
* [Find and Connect Project:][linkstat-1] Nicola Laurent on the impact of
117+
broken links.
118+
* [Binary Trees? Automatically Identifying the links between born digital records:][linkstat-2]
119+
I write about hyperlinks as a public record in own right when submitted as part
120+
of a documentary heritage.
121+
* [HiberActive Pilot][linkstat-3] A scholarly publishing tool that extracts
122+
URLs, returns both the original URL and a perma-link.
123+
* [IIPC Awesome List][linkstat-4] A list of web-archiving links that invites
124+
contributions from the community to keep it up-to-date.
125+
126+
[linkstat-1]: http://www.findandconnectwrblog.info/2016/11/broken-links-broken-trust/
127+
[linkstat-2]: https://www.youtube.com/watch?v=Ked9GRmKlRw
128+
[linkstat-3]: https://www.era.lib.ed.ac.uk/handle/1842/23366
129+
[linkstat-4]: https://github.com/iipc/awesome-web-archiving
130+
131+
## License
132+
133+
GNU General Public License Version 3. [Full Text](LICENSE)

bolthandler.go

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
package main
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"github.com/httpreserve/httpreserve"
7+
kval "github.com/kval-access-language/kval-bbolt"
8+
"github.com/speps/go-hashids"
9+
"log"
10+
"os"
11+
"time"
12+
)
13+
14+
// values to use to create hashid
15+
var salt = "httpreserve"
16+
var namelen = 8
17+
18+
// bucket constants
19+
const linkIndex = "link index"
20+
21+
//const fnameIndex = "filename index"
22+
const hashIndex = "hash index"
23+
24+
// location for bolt databases. Names are random at present because
25+
// I'm unsure how they're going to be used in future so may write naming
26+
// functions and flags at a later date.
27+
const boltdir = "db/"
28+
29+
// For stdout the name of the database
30+
var boltoutput string
31+
32+
// getNewDBName provides three integers based on the time at
33+
// which we run the code to help us create a hashid name for the db.
34+
func getNewDBName() []int {
35+
t := time.Now()
36+
i1 := t.Minute()
37+
i2 := t.Second()
38+
i3 := t.Nanosecond()
39+
return []int{i1, i2, i3}
40+
}
41+
42+
// configureHashID will create a hashids name for our database
43+
func configureHashID() string {
44+
45+
name := getNewDBName()
46+
47+
//hashdata
48+
hd := hashids.NewData()
49+
hd.Salt = salt
50+
hd.MinLength = namelen
51+
52+
//hash
53+
h, _ := hashids.NewWithData(hd)
54+
e, _ := h.Encode(name)
55+
return e
56+
}
57+
58+
// makeIDIndex will write rows to the BoldDB based on an MD5 hash value
59+
// associated with the lmap passed to the function (a deconstructed LinkStats)
60+
func makeIDIndex(kb kval.Kvalboltdb, lmap map[string]interface{}) {
61+
for k, v := range lmap {
62+
_, err := kval.Query(kb, "INS "+convertInterface(lmap["response code"])+">>"+convertInterface(lmap["link"])+" >>>> "+k+" :: "+convertInterface(v))
63+
if err != nil {
64+
fmt.Fprintln(os.Stderr, err)
65+
}
66+
}
67+
}
68+
69+
// makeBoltDir will create a database for all BoldDB files generated
70+
// if the database doesn't already exist.
71+
func makeBoltDir() {
72+
if _, err := os.Stat(boltdir); os.IsNotExist(err) {
73+
err := os.Mkdir(boltdir, 0700)
74+
if err != nil {
75+
fmt.Fprintln(os.Stderr, err)
76+
os.Exit(1)
77+
}
78+
}
79+
}
80+
81+
// boltGetResultContainers returns the names of all top level buckets
82+
// n.b. these functions are heavily linked to the database schema and
83+
// could be made more generic with more effort.
84+
func boltGetResultContainers(kb kval.Kvalboltdb) []string {
85+
var buckets []string
86+
q := "GET " + hashIndex
87+
res, _ := kval.Query(kb, q)
88+
for k := range res.Result {
89+
buckets = append(buckets, k)
90+
}
91+
return buckets
92+
}
93+
94+
// boltGetSingleRecord will return a single record for a given md5 key
95+
// n.b. these functions are heavily linked to the database schema and
96+
// could be made more generic with more effort.
97+
func boltGetSingleRecord(kb kval.Kvalboltdb, md5Key string) map[string]string {
98+
records := make(map[string]string)
99+
q := "GET " + hashIndex + " >> " + md5Key
100+
res, _ := kval.Query(kb, q)
101+
for k, v := range res.Result {
102+
records[k] = v
103+
}
104+
return records
105+
}
106+
107+
// boltGetAllRecords returns all records in all top level buckets in the
108+
// database.
109+
// n.b. these functions are heavily linked to the database schema and
110+
// could be made more generic with more effort.
111+
func boltGetAllRecords(kb kval.Kvalboltdb) []map[string]string {
112+
var records []map[string]string
113+
keys := boltGetResultContainers(kb)
114+
for _, v := range keys {
115+
records = append(records, boltGetSingleRecord(kb, v))
116+
}
117+
return records
118+
}
119+
120+
var kb kval.Kvalboltdb
121+
122+
func openKVALBolt() {
123+
var err error
124+
boltname := configureHashID()
125+
makeBoltDir()
126+
127+
boltoutput = boltdir + "HP_" + boltname + ".bolt"
128+
kb, err = kval.Connect(boltoutput)
129+
if err != nil {
130+
fmt.Fprintf(os.Stderr, "Error opening bolt database: %+v\n", err)
131+
os.Exit(1)
132+
}
133+
}
134+
135+
func closeKVALBolt() {
136+
kval.Disconnect(kb)
137+
}
138+
139+
var id []string
140+
141+
// boltdbHandler is the primary handler for writing to a BoltDB
142+
// from our httpreserve results rsets.
143+
func boltdbHandler(js string) {
144+
145+
var ls httpreserve.LinkStats
146+
147+
err := json.Unmarshal([]byte(js), &ls)
148+
if err != nil {
149+
fmt.Fprintln(os.Stderr, "problem unmarshalling data.", err)
150+
}
151+
152+
var add = true
153+
154+
// retrieve a map from the structure and write it out to the
155+
// bolt db.
156+
lmap := storeStruct(ls, js)
157+
if len(lmap) > 0 {
158+
makeIDIndex(kb, lmap)
159+
160+
lmapid := convertInterface(lmap["id"])
161+
for x := range id {
162+
if lmapid == id[x] {
163+
add = false
164+
log.Println("Already seen:", lmap["filename"], lmap["title"])
165+
break
166+
}
167+
}
168+
if add {
169+
makeIDIndex(kb, lmap)
170+
id = append(id, lmapid)
171+
}
172+
}
173+
}

bolthandler_test.go

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
package main
2+
3+
/* Example JSON:
4+
{
5+
"FileName": "bbc news",
6+
"AnalysisVersionNumber": "0.0.0",
7+
"AnalysisVersionText": "exponentialDK-httpreserve/0.0.0",
8+
"Link": "http://www.bbc.co.uk/news",
9+
"ResponseCode": 200,
10+
"ResponseText": "OK",
11+
"ScreenShot": "",
12+
"InternetArchiveLinkLatest": "http://web.archive.org/web/20170328040059/http://www.bbc.co.uk/news/",
13+
"InternetArchiveLinkEarliest": "http://web.archive.org/web/19971009011901/http://www.bbc.co.uk/news/",
14+
"InternetArchiveSaveLink": "http://web.archive.org/save/http://www.bbc.co.uk/news",
15+
"InternetArchiveResponseCode": 200,
16+
"InternetArchiveResponseText": "OK",
17+
"Archived": true,
18+
"ProtocolError": false,
19+
"ProtocolErrorMessage": ""
20+
},
21+
*/
22+
23+
/*
24+
filename:bbc home
25+
id:891609239375c54fe326a2e23a8c5397
26+
27+
filename:bbc radio
28+
id:1d15698856a2487bade7d8994d21d30c
29+
30+
filename:tna
31+
id:3357924f215d974f627690dd6382076c
32+
33+
id:43d4a499caa590e912c8a059f7ab8323
34+
filename:bbc news */
35+
36+
/*
37+
//for now, for testing...
38+
var linkmap = map[string]string{
39+
"http://www.bbc.co.uk/news": "bbc news",
40+
"http://www.bbc.co.uk/": "bbc home",
41+
"http://www.bbc.co.uk/radio": "bbc radio",
42+
"http://www.nationalarchives.gov.uk/": "tna",
43+
}
44+
*/

0 commit comments

Comments
 (0)