转载︱案例 基于贪心算法的特征选择
时间:2022-06-22
本文章向大家介绍转载︱案例 基于贪心算法的特征选择,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
本文转载于R语言中文社区,详情链接
相关帖子
————————————————————————————————————————————————————————
greedy Algorithm Feature Selection
贪心算法(又称贪婪算法)是指,在对问题求解时,总是做出在当前看来是最好的选择。也就是说,不从整体最优上加以考虑, 它所做出的是在某种意义上的局部最优解。贪心算法不是对所有问题都能得到整体最优解,关键是贪心策略的选择,选择的贪心 策略必须具备无后效性,即某个状态以前的过程不会影响以后的状态,只与当前状态有关。
算法设计:
- 初始化问题的目标值
- while(实现优化目标的约束条件){ 利用筛选策略,求出解空间的一个可行解 }
- 将所有可行解组合成目标解空间。
options(warn = -1)
require(magrittr)
require(dplyr)
require(glmnet)
# Greedy Algorithm
GreedyAlgorithm = function(dataSet) {
# 基于逻辑回归,以AUC作为评价指标,采用贪心算法进行特征筛选
#
# Args:
# dataSet: A dataframe that contains a feature "label"
#
# Returns:
# A vector of selected features
features = data.frame(name = colnames(dataSet)) %>%
dplyr::filter(name != "label") # select all features of the dataSet except "label"
features = as.vector(features$name)
featureSelect = c("label") # init the feature vector to be selected
scoreBefore = data.frame() # init the storage whice stores the (feature,aucScore) tuple from the end of each iteration
while((nrow(scoreBefore)<2||scoreBefore[length(scoreBefore),2]>
scoreBefore[length(scoreBefore) - 1,2])&&nrow(scoreBefore)<length(features)){
score = data.frame()
for(feature in features){
if(length(intersect(feature,featureSelect)) == 0){
trainData = dataSet[,append(featureSelect,feature)]
model = glm(label~.,family = "binomial",data = trainData,epsilon = 1e-10)
prediction = predict(model,trainData)
aucValue = auc(trainData$label,prediction)
score = rbind(score,data.frame(feature = feature,aucValue = aucValue))
}
}
featureSelect = unique(append(featureSelect,as.character(score[which.max(score$aucValue),1])))
scoreBefore = rbind(scoreBefore,score[which.max(score$aucValue),])
}
featureSelect = head(featureSelect,length(featureSelect)-1) # delete the last feature that can't fit the iteration condition
return(featureSelect[-1]) # reture the selected features except "label"
}
KS值表征了模型将正例和负例区分开来的能力。值越大,模型的预测准确性越好。通常情况下,KS>0.3即可认为模型有比较好的预测准确性。
KS值计算方法:
将所有样本根据预测得分从低到高排序均分成N组,分别计算这N组的实际好样本数、坏样本数、累积好样本数、累积坏样本数、累积好样本数占比、 累积坏样本数占比,差值。其中,实际好坏样本数分别为该组内的好坏样本数,累积好坏样本数为该组累积的好坏样本数,累积好坏样本数占比为 累积好坏样本数占总好坏样本数的比值,差值为累积坏样本数占比减去累计好样本数占比。KS指标为差值绝对值的最大值。
# ksValue
KsValue = function(prediction,n){
# Compute the ks value of a model
#
# Args:
# prediction: A vector that the prediction of a model
# n: The group number
#
# Returns:
# A vector that the difference value between the rate cumulative bad sample and the rate of cumulative good sample
dataResult = sort(prediction,decreasing = T)
a = c()
b = c()
c = c()
a[1] = 0
b[1] = 0
c[1] = 0
if(length(dataResult)%%n==0){
cut = length(dataResult)/n
for (i in 2:(n+1)) {
a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))])
b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i]
}
}else{
cut = round(length(dataResult)/n)
for (i in 2:n) {
a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))])
b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i]
}
a[n+1] = sum(dataResult[(cut*(n-2)+1):(cut*(n-1))])
b[n+1] = length(dataResult[(cut*(n-2)+1):(cut*(n-1))])-a[n+1]
}
c = abs(cumsum(a)/sum(a)-cumsum(b)/sum(b))
return(c)
}
require(caret)
require(pROC)
data = read.csv("/data/workspace/Rworkspace/data_test.csv",encoding = "UTF-8")
data %<>%
mutate(label = ifelse(target>30,1,0))
data = data[,-1]
data = data.frame(apply(data, 2, function(x) ifelse(is.na(x),median(x,na.rm = T),x)))
# 剔除近似常量的变量
# feature1 = nearZeroVar(data)
# data = data[,-feature1]
# 剔除相关度过高的自变量
# dataCor = cor(data)
# highcor = findCorrelation(dataCor,0.8)
# data = data[,-highcor]
# 利用贪心算法进行特征选择
# feature = GreedyAlgorithm(dataSet = data)
load("/data/workspace/Rworkspace/featureSelect.RData") # 数据量较大,生成html过程中该算法比较耗时,所以直接加载测试时已经选取的特征
set.seed(521)
ind = base::sample(2,nrow(data),replace=T,prob=c(0.7,0.3))
trainData = data[ind==1,]
testData = data[ind==2,]
model = cv.glmnet(as.matrix(trainData[,feature]),trainData[,"label"],
family = "binomial",type.measure = "auc",alpha = 0,
lambda.min.ratio = 0.0001)
prediction = predict(model,as.matrix(testData[,feature]),s="lambda.min",type="response")
# compute ksValue
ksValue = KsValue(prediction,10)
par(mfrow = c(2,1))
plot(density(ksValue),type = 'l',main = "ksValue Plot",xlab = "cutPoint",ylab = "density_ks")
ks_value = max(ksValue)
text(.2,1.0,paste("ksValue = ",ks_value))
roc(testData$label, as.vector(prediction), auc = T,plot = T,print.auc=T)
##
## Call:
## roc.default(response = testData$label, predictor = as.vector(prediction), auc = T, plot = T, print.auc = T)
##
## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1).
## Area under the curve: 0.7385
par(mfrow=c(1,1))
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 关于Kotlin写界面时诸多控件的点击事件
- Android webview注入JS代码 修改网页内容操作
- Kotlin 使用高阶函数实现回调方式
- Android WebView通过动态的修改js去拦截post请求参数实例
- Android使用Kotlin实现多节点进度条
- Android中webView加载H5绑定cookie实例
- 解决Android webview设置cookie和cookie丢失的问题
- Android实现清除单个域名的cookie
- Android实现触发html页面的Button控件点击事件方式
- webview添加参数与修改请求头的user-agent实例
- android webview获取html代码和根据id获取value实例
- pip 安装MySQL-python:EnvironmentError: mysql_config not found
- 关于WebView 重定向行为导致的多次加载的问题
- 解决webview内的iframe中的事件不可用的问题
- Android WebView拦截iframe标签内部跳转教程