Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
764 views
in Technique[技术] by (71.8m points)

rdf - Querying Open Data Communities Data with SPARQL

I'm trying to get some information from the Lower Layer Super Output Areas (LSOAs) and UK Postcodes datasets.

I need the postal code and lsoa information in a data dump for excel use.

Notation and Label of type 'Lower Layer Super Output Area'. http://opendatacommunities.org/doc/geography/lsoa/E01009437

E.g. 'lsoa' per each type 'Postcode Unit' http://opendatacommunities.org/resource?uri=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fid%2Fpostcodeunit%2FB721NB

I have no idea how to use the SPARQL engine on the site to get this information, or how to extract the information from the N-Triples file I downloaded…

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are two main options for retrieving the data you want. In some cases, it is possible to query the data using a publicly available SPARQL endpoint. This is probably the most convenient approach, and the one to take unless there's some definite reason that you need the data locally. There are limitations to this approach, however, and in those cases, it makes sense to download the dataset and query against it locally. I'll describe the remote endpoint solution first, and then the solution using local queries. The limitations on the SPARQL endpoint (e.g., hard timeouts) mean that the first approach isn't sufficient for this particular task, so the specific answer to this question is the second approach.

I wasn't familiar with these particular datasets and ontologies before this question, so the first approach also walks though the "getting familiar with the data" process.

Using the SPARQL endpoint

There is a Open Data Communities SPARQL endpoint against which you can run queries and get some data out. I haven't looked at this data before, so rather than just posting the final answer, I'll walk through the process that I used to figure out what sort of query to run.

One of the pages you linked to, B72 1NB, mentions that the resource has type PostcodeUnit, which has the URI

http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit

Based on this, the first thing I tried was a SPARQL query to try to retrieve some postcode units, so I used the following query in the endpoint above. (If you copy and paste it in there, you'll need to remove any leading space before SELECT. I had to do that, anyhow.)

SELECT * WHERE { 
  ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit>
}
LIMIT 10

SPARQL results

in the endpoint linked above. (The LIMIT helps ensure that the results come back in a timely manner, and that we're not asking the server to do too much.) This produces results like

--------------------------------------------------------------
| postcodeUnit                                               |
==============================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY256SA> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY209DR> |
--------------------------------------------------------------

The B72 1NB page shows its lsoa as Birmingham 006C. The IRI for the lsoa property is (and you can see this in the data you downloaded)

http://opendatacommunities.org/def/geography#lsoa

so we extend the SPARQL query to

SELECT * WHERE { 
  ?postcodeUnit
    a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> ;
    <http://opendatacommunities.org/def/geography#lsoa> ?lsoa .
}
LIMIT 10

SPARQL results

The results are like this:

-----------------------------------------------------------------------------------------------------------------------------
| postcodeUnit                                               | lsoa                                                         |
=============================================================================================================================
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | <http://opendatacommunities.org/id/geography/lsoa/E01029309> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | <http://opendatacommunities.org/id/geography/lsoa/E01029706> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | <http://opendatacommunities.org/id/geography/lsoa/E01018373> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014172> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | <http://opendatacommunities.org/id/geography/lsoa/E01018514> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | <http://opendatacommunities.org/id/geography/lsoa/E01029175> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ>  | <http://opendatacommunities.org/id/geography/lsoa/E01014204> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | <http://opendatacommunities.org/id/geography/lsoa/E01029225> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/SW65TP>  | <http://opendatacommunities.org/id/geography/lsoa/E01001950> |
| <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF15AX>  | <http://opendatacommunities.org/id/geography/lsoa/E01014155> |
-----------------------------------------------------------------------------------------------------------------------------

You can use prefixes in your query if you want to make it a bit more readable and concise:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
SELECT * WHERE { 
  ?postcodeUnit
    a pc:PostcodeUnit ;
    geo:lsoa ?lsoa .
}
LIMIT 10

SPARQL results

The results will be the same, of course. At the bottom of each of those results pages, you can download the results in a number of other formats. One of the formats is CSV, and you might have luck importing that directly into a spreadsheet (you said you wanted to use the data in Excel).

Discussion in the comments pointed out that the sheer number of PostcodeUnits makes the result set very large. The UK Postcodes dataset contains four types of resources, in order of increasing size: Postcode Units, Postcode Sectors, Postcode Districts, and Postcode Areas. There are 1686911, 10833, 2087, and 120 resources of these types, respectively. As I understand the clarification in the comments, the idea is to associate these with Lower Layer Super Output Areas (LSOAs), e.g., Birmingham 006C. Individual Postcode Units are associated with LSOAs, but the higher level postcode regions are not. Each Postcode Unit is within its sector, district, and area. For instance, TA21 9HB is within TA, TA21 9, and TA21. Using this information, we can ask for postcode units and their corresponding district (or sector, or area), as well as their LSOA, and report just the district and the LSOA, ignoring the unit itself. For instance:

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/>
SELECT DISTINCT ?district ?lsoa 
WHERE { 
  ?postcodeunit a pc:PostcodeUnit ;
                geo:lsoa ?lsoa ;
                sr:within ?district .
  ?district a pc:PostcodeDistrict .
}
LIMIT 10 

SPARQL results

Now, there are 34378 LSOAs in the dataset, so there's still lots of data to be selected, and trying to pull down the text results for all distinct losa/district mappings still results in a timeout. In fact, since every LSOA is associated (I expect) with some district, there are probably as many results in the output as there are LSOAs.

It looks like this is the point where we start to hit response size limits and timeouts for the SPARQL endpoint, and need to start accessing the data locally. The postcode data alone is 5.6 GB though, so this isn't a wonderful solution.

But, if you're willing to take a representative LSOA for each district, we can use SPARQL subqueries to pull these out, as in the following query which first retrieves all the postcode districts, and then for each one, finds a single LSOA that some postcode unit in the district has. I don't know whether this is an acceptable result, but you end up with an LSOA for each district, and the results are small enough (there are 2087 rows, the same as the number of districts) that they can be pulled down in any of the results formats (including CSV).

PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX geo: <http://opendatacommunities.org/def/geography#>
PREFIX sr: <http://data.ordnancesurvey.co.uk/ontolo

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...