R语言大数据分析纽约市的311万条投诉统计可视化与时间序列分析
原文链接:http://tecdat.cn/?p=9800
介绍
本文并不表示R在数据分析方面比Python更好或更快速,我本人每天都使用两种语言。这篇文章只是提供了比较这两种语言的机会。
本文中的 数据 每天都会更新,我的文件版本更大,为4.63 GB。
CSV文件包含纽约市的311条投诉。它是纽约市开放数据门户网站中最受欢迎的数据集。
数据工作流程
install.packages("devtools")library("devtools")install_github("ropensci/plotly")
library(plotly)
需要创建一个帐户以连接到plotly API。或者,可以只使用默认的ggplot2图形。
set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key
使用dplyr在R中进行分析
假设已安装sqlite3(因此可通过终端访问)。
$ sqlite3 data.db # Create your database$.databases # Show databases to make sure it works$.mode csv $.import <filename> <tablename># Where filename is the name of the csv & tablename is the name of the new database table$.quit
将数据加载到内存中。
library(readr)# data.table, selecting a subset of columnstime_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv', select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'), showProgress = T))
kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))
user.self |
sys.self |
elapsed |
user.child |
sys.child |
|
---|---|---|---|---|---|
time_data.table |
63.588 |
1.952 |
65.633 |
0 |
0 |
time_data.table_full |
205.571 |
3.124 |
208.880 |
0 |
0 |
time_readr |
277.720 |
5.018 |
283.029 |
0 |
0 |
我将使用data.table读取数据。该 fread
函数大大提高了读取速度。
关于dplyr
默认情况下,dplyr查询只会从数据库中提取前10行。
library(dplyr) ## Will be used for pandas replacement# Connect to the databasedb <- src_sqlite('/users/ryankelly/data.db')db
数据处理的两个最佳选择(除了R之外)是:
- 数据表
- dplyr
预览数据
# Wrapped in a function for display purposeshead_ <- function(x, n = 5) kable(head(x, n))head_(data)
Agency |
CreatedDate |
ClosedDate |
ComplaintType |
Descriptor |
City |
---|---|---|---|---|---|
NYPD |
04/11/2015 02:13:04 AM |
Noise - Street/Sidewalk |
Loud Music/Party |
BROOKLYN |
|
DFTA |
04/11/2015 02:12:05 AM |
Senior Center Complaint |
N/A |
ELMHURST |
|
NYPD |
04/11/2015 02:11:46 AM |
Noise - Commercial |
Loud Music/Party |
JAMAICA |
|
NYPD |
04/11/2015 02:11:02 AM |
Noise - Street/Sidewalk |
Loud Talking |
BROOKLYN |
|
NYPD |
04/11/2015 02:10:45 AM |
Noise - Street/Sidewalk |
Loud Music/Party |
NEW YORK |
选择几列
ComplaintType |
Descriptor |
Agency |
---|---|---|
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Senior Center Complaint |
N/A |
DFTA |
Noise - Commercial |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
ComplaintType |
Descriptor |
Agency |
---|---|---|
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Senior Center Complaint |
N/A |
DFTA |
Noise - Commercial |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Noise - Commercial |
Loud Music/Party |
NYPD |
HPD Literature Request |
The ABCs of Housing - Spanish |
HPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Street Condition |
Plate Condition - Noisy |
DOT |
使用WHERE过滤行
ComplaintType |
Descriptor |
Agency |
---|---|---|
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Noise - Commercial |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
使用WHERE和IN过滤列中的多个值
ComplaintType |
Descriptor |
Agency |
---|---|---|
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Noise - Commercial |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
Noise - Street/Sidewalk |
Loud Music/Party |
NYPD |
Noise - Street/Sidewalk |
Loud Talking |
NYPD |
在DISTINCT列中查找唯一值
## City## 1 BROOKLYN## 2 ELMHURST## 3 JAMAICA## 4 NEW YORK## 5 ## 6 BAYSIDE
使用COUNT(*)和GROUP BY查询值计数
# dt[, .(No.Complaints = .N), Agency]#setkey(dt, No.Complaints) # setkey index's the dataq <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(No.Complaints = n())head_(q)
Agency |
No.Complaints |
---|---|
3-1-1 |
22499 |
ACS |
3 |
AJC |
7 |
ART |
3 |
CAU |
8 |
使用ORDER和-排序结果
数据库中有多少个城市?
# dt[, unique(City)]q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())head(q)
## Number.of.Cities## 1 1818
让我们来绘制10个最受关注的城市
City |
No.Complaints |
---|---|
BROOKLYN |
2671085 |
NEW YORK |
1692514 |
BRONX |
1624292 |
766378 |
|
STATEN ISLAND |
437395 |
JAMAICA |
147133 |
FLUSHING |
117669 |
ASTORIA |
90570 |
Jamaica |
67083 |
RIDGEWOOD |
66411 |
- 用
UPPER
转换CITY格式。
CITY |
No.Complaints |
---|---|
BROOKLYN |
2671085 |
NEW YORK |
1692514 |
BRONX |
1624292 |
766378 |
|
STATEN ISLAND |
437395 |
JAMAICA |
147133 |
FLUSHING |
117669 |
ASTORIA |
90570 |
JAMAICA |
67083 |
RIDGEWOOD |
66411 |
投诉类型(按城市)
# Plot resultplt <- ggplot(q_f, aes(ComplaintType, No.Complaints, fill = CITY)) + geom_bar(stat = 'identity') + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))plt
第2部分时间序列运算
提供的数据不适合SQLite的标准日期格式。
在SQL数据库中创建一个新列,然后使用格式化的date语句重新插入数据 创建一个新表并将格式化日期插入原始列名。
使用时间戳字符串过滤SQLite行:YYYY-MM-DD hh:mm:ss
# dt[CreatedDate < '2014-11-26 23:47:00' & CreatedDate > '2014-09-16 23:45:00', # .(ComplaintType, CreatedDate, City)]q <- data %>% filter(CreatedDate < "2014-11-26 23:47:00", CreatedDate > "2014-09-16 23:45:00") %>% select(ComplaintType, CreatedDate, City)head_(q)
ComplaintType |
CreatedDate |
City |
---|---|---|
Noise - Street/Sidewalk |
2014-11-12 11:59:56 |
BRONX |
Taxi Complaint |
2014-11-12 11:59:40 |
BROOKLYN |
Noise - Commercial |
2014-11-12 11:58:53 |
BROOKLYN |
Noise - Commercial |
2014-11-12 11:58:26 |
NEW YORK |
Noise - Street/Sidewalk |
2014-11-12 11:58:14 |
NEW YORK |
使用strftime从时间戳中拉出小时单位
# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>% select(ComplaintType, CreatedDate, City, hour)head_(q)
ComplaintType |
CreatedDate |
City |
hour |
---|---|---|---|
Noise - Street/Sidewalk |
2015-11-04 02:13:04 |
BROOKLYN |
02 |
Senior Center Complaint |
2015-11-04 02:12:05 |
ELMHURST |
02 |
Noise - Commercial |
2015-11-04 02:11:46 |
JAMAICA |
02 |
Noise - Street/Sidewalk |
2015-11-04 02:11:02 |
BROOKLYN |
02 |
Noise - Street/Sidewalk |
2015-11-04 02:10:45 |
NEW YORK |
02 |
汇总时间序列
首先,创建一个时间戳记四舍五入到前15分钟间隔的新列
# Using lubridate::new_period()# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]q <- data %>% mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>% select(CreatedDate, interval)head_(q, 10)
CreatedDate |
interval |
---|---|
2015-11-04 02:13:04 |
2015-11-04 02:00:00 |
2015-11-04 02:12:05 |
2015-11-04 02:00:00 |
2015-11-04 02:11:46 |
2015-11-04 02:00:00 |
2015-11-04 02:11:02 |
2015-11-04 02:00:00 |
2015-11-04 02:10:45 |
2015-11-04 02:00:00 |
2015-11-04 02:09:07 |
2015-11-04 02:00:00 |
2015-11-04 02:05:47 |
2015-11-04 02:00:00 |
2015-11-04 02:03:43 |
2015-11-04 02:00:00 |
2015-11-04 02:03:29 |
2015-11-04 02:00:00 |
2015-11-04 02:02:17 |
2015-11-04 02:00:00 |
绘制2003年的结果
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 【leetcode两题选手】MySQL类题目(一)
- 【LeetCode每日一题】(8.11)被围绕的区域
- 二叉树的前中后序遍历(迭代法)(带动画)
- 【LeetCode两题选手】算法类题目(8.8)
- 【LeetCode每日一题】(8.9)复原IP地址(回溯)
- 【回溯算法】N叉树相关技巧
- 【回溯算法】回溯,从入门到入土,七道试题精选、精讲、精练
- 数据结构练手小项目(AVL树、哈希表、循环链表、MySQL数据库)
- 【LeetCode】每日一题(8.2)二叉树展开为链表
- 【小技巧】argc和argv的用法
- 全面分析redis持久化机制
- 【奇技淫巧】-- 接雨水
- 【奇技淫巧】-- 最长连续序列
- 【redis】跟我一起动手玩玩redis主从复制和哨兵模式
- 【C++】八大排序算法 :GIF + 亲测代码 +专项练习平台