There's a little tab at the top of this side called "Investigations". For the longest time it only really had a single analysis (on Yelp reviews from over a year ago). So far, any investigations I've done has been as part of my job, on private data. Not anymore. From here on out, investigations will be a core component of the Union & Grant company. The only self-imposed rule that I'll make is the following: Only things that are actionable will make it on the site.
For the first batch, I'll take a look at Oakland crime data.
It's available. It's a relevant problem. I was born there and lived there for a brief time. It's across the bay and we're thinking of moving there. They're incredibly understaffed. Take your pick.
Where to get it
Crime data is supplied through CrimeWatch, Oakland's community crime mapping website. From there, you can look at crime by location and type.
The online mapping is provided by a third party group called, somewhat ominously, The Omega Group. According to their FAQ, the always present a rolling 180 days worth of crime data.
The city of Oakland also provides the same data but in a tabular format. A few caveats from the source:
While the City of Oakland strives to keep the data current and accurate, the data in the file is not official record and the City will not be responsible for any inaccuracies that may be encountered. Use this data at your own risk.
Cleaning the data
What to mess. The first thing to understand, and perhaps the most important thing I learned, is that each row was not a crime. The data is a mix of different types of cases and charges on cases. And it's not consistent between years (2013 had no arrests, 2012 did).
There were four types of case numbers in the data: crimes reports, on-the-spot arrests, traffic stops, and a fourth type I haven't deciphered yet (of the format FCYY-XXXXX). Of these, only crime reports had descriptions. And crime reports could appear multiple times if there were multiple charges in that report. In one case, a kidnapping had 9 separate charges associated with it. So I needed to de-duplicate.
I noticed something odd about the size of the data when I loaded each year into R. The file called CrimeData2012.csv had about 106,000 rows, yet the file called CrimeData2013.csv had about 145,000 rows. Crime's bad in Oakland but this makes it look like it's a nightmare. A 36% y/y increase in crime?
The explanation is pretty simple: 2013 contains almost all 2012 data with the arrests and traffic stops removed. I have no idea why they overloaded it. I also noticed that 2012 data was uploaded in 2014, after 2013 was uploaded. Likely they had some sort of schema change, but then why are arrests and traffic reports missing from 2013? CrimeWatch isn't very transparent about their schema or decision making.
One more thing I learned that relates to date ranges: Each case report represents a closed case. From the FAQ (emphasis mine):
Each incident has to be confirmed and entered as a report by the law enforcement agency before it can be uploaded to our website. For some incidents it may take some time for this process to be completed so it would not immediately appear on the website. If the case is still an open investigation, it will not appear until the case is closed.
In other words, some 2012 crimes won't be in the 2012 data table if they weren't closed until 2013. Similarly, I've including what they had for 2014, but because the numbers are very low because many cases are still open (we'll get into what implications this means in the next post).
Location, location, location
Despite being Oakland crime data, I noticed a lot of other Bay Area cities in the dataset. Not sure why the OPD takes a robbery report from Macy's Walnut Creek, but whatever. I filtered them out because they weren't too frequent.
Finally, I also decided to remove the non-sensical police beats. Oakland has around 60 police beats, formatted like NNX or NNY (where NN are integers up to 35). I can tell this data was entered by hand, because there were clear typos and parsing errors (about 60 incidents total).
There were also about 1,535 and 1,176 reports in the 77x and 99x beats, respectively. These beats don't exist in Oakland. From what I can gather, they're what the OPD uses for reports outside of the city (or when the location isn't known). I removed these as well.
Now that I have a clean data set, it's time to dig in. I'm also planning on emailing CrimeWatch to understand the process of data collection. I'll let you know what I find out. In the meantime, let me know if there's any data-related questions you might have (about Oakland crime or whatever).