Jump to content

User:SJK/Year in Review database notes

From Wikipedia, the free encyclopedia

2001-11-09 04:25 UTC: Okay, now I've extracted almost all of the Year-in-Review entries into a separate database. I ended up with 2520 facts in my database. There are a few facts which didn't get included, because the format they were in was too nonstandard to extract using a perl script. There are also several entries which aren't marked as to what type of entry they are, because they didn't appear under an "Events", etc., heading. I will put a copy of the database up somewhere just in case anyone wants to see it. Now all that remains to be done is to write a script to let people view and edit the database via the WWW. -- SJK


2001-11-09 09:55 UTC: I have just completed downloading all the year entries up to 2001 for "Year in Review". (I will do the date entries later). I did it using a perl script; if you want to see it, it is User:SJK/yrget perl script. It requires an "ENTRIES" file, which contains a list of all the year entries (I got one from downloading the index, and then using sed and grep with appropriate regexps).

It stores the downloaded files in the data/ directory, under the page's title (with spaces replaced with underscores, etc.). Each file contains the page's Wiki source (what you see when you edit). It inserts as the first line of each page the following command "#YEAR ''name of page'' REV=latest revision of page". Next I am going to analyse them, and try converting them to a database. I am not entirely sure what I will use, though it will probably be some combination of Perl, SICSTUS or GNU Prolog and Unix shell utilities.

These are some preliminary statistics on the structure of the entries:

  • years present: 1032
  • contain "Events" section: 988
  • contain "Births" section: 967
  • contain "Deaths" section: 967
  • none of the abovementioned sections: 65

The above statistics are probably not 100% accurate, but they would be close...

-- SJK


My goal here: to replace all the hard to maintain lists and other organizational features of Wikipedia with databases. I am planning to start with Year in Review and work from there. Comments on how to best do this, and how I propose to do it below, are more than welcome. (But if you object to the proposal in principle, forget about that until later -- once I have written the code then we will of course discuss whether to install it...) I plan to begin writing code after the end of my exams (Nov. 27).

We will use PHP to write this script, so it can be integrated into Magnus' PHP wiki.

The main YIR table in the database will have the following format:

 Year|Month|Day|EventType|Text

Where EventType is (Birth,Death,Event or a Nobel Prize) and Text is standard Wiki text Note possible for Month or Day to have null values

NobelPrize is of course Noble Prize Physics, Nobel Prize Chemistry, etc... Maybe this belongs in separate NOBELPRIZE table:

 Year|Month|Day|Field|Awardee|Comment

Then we can use the NOBELPRIZE table to generate subpages of Nobel Prize

Every year and month/day page will have subpages "/Intro" and "/Extra". These subpages will be automatically incorporated into the article at the appropriate points.

Eventually, the "Birth", "Death" eventtypes will be automatically generated from the Biographical Database.

I will write routines to use to:

1. download all Year-In-Review entries
2. extract data into database

We will generate (at this stage) two different kinds of reports: a "what happened in that year" report, and a "what happened on that day report"

We will produce the following output for the "what happened in that year report":


Centuries: Year in Review ''current-century''

''prev-century'' - ''current-century'' - ''next-century''

Decades: for every decade D from last decade of previous century to first decade of next century

convert D to text form T
if D is current century, output T
else output T
if not last iteration print ' - '

endfor

for every year Y from current year-5 to current year+5

if Y not current year output "Y "
else output bold "Y "

endfor



(script will insert /Intro data here...)

Births
select year, month, day, eventtype, text from YEARINREVIEW

where eventtype = Birth and year = current year

for each row in resultset

if month, day not null then
convert month, day to string DateString
write "*DateString - text"
else
write "* text"

endfor

Deaths
same code as for births, mutatis mutandis

Events
same code as for births, mutatis mutandis

Nobel Prizes
SQL query to generate output below (too tired to explain in detail, should be obvious to someone less tired):

(I will add support for Nobel Prizes as a special event type...)

(script will insert /Extra info here...)

e.g. Technology or Films sections found in some YIR entries


I will also have a report for each date: ....


And an edit dialogue for year/date report, replacing the standard edit dialogue:

List of Events (list generated by SQL query based on criteria above)

Year|Month|Day|EventType|edit entry
Text

Introductory information:

blah blah blah

edit introductory information -- link to action=edit&id=YEAR/Intro

Additional information: edit additional information -- link to action=edit&id=YEAR/Extra


Issues:

  • edit conflicts -- edits to intro & extra easy -- they are pages, and current code can mostly be resued
  • edit conflicts -- individual events -- in principle no different from above, just shorter... code would have to be different, since we'd be using different SQL tables; but can probably use principles of pre-existing code...

Let me point out that the year in review entries are FAR from standard. Several changes have occurred to the format and the multiple working points means that there are many, many formats out there! --MichaelTinkler

Well, based on the ones I've looked at, they seem pretty standard. There are a few minor variations, and some sections present in some but not others (e.g. Film, Technology) -- but my planned database will support any nonstandard sections through in the "additional information" part, which can contain anything. Can you point to any particular ones which are very nonstandard? -- SJK

there are whole centuries where there are lots of years in which none of the information ABOVE the births/deaths/events info is present yet. The formatting is shifty - some people have been putting 3 centuries in review on the top line (preceding, current, following) and some have been putting only current. Some have been doing

century in review
centuries: preceding current following
decades:
years:

The whole transition of years from listing the currrent year +/- five instead of beginning and ending with current decade only is far from complete. There are lots of minor variations like that. I'd say go ahead and do it, because then we'll find out eventually what's not standard and make it so, I suppose. --MichaelTinkler

Those sort of things were what I was referring to as the "minor variations"... me the optimist :) -- SJK

See also : Simon J Kissane