It is very similar to the usage of pandas in Python. I believe that friends who have used Pandas should have no pressure to get started.
DataFrame definition
Create a new DataFrame and add 4 columns of content
using DataFrames df1 = DataFrame() df1[:clo1] = Array([1.0,2.0,3.0]) df1[:clo2] = Array([4.0,5.0,6.0]) df1[:clo3] = Array([7.0,8.0,9.0]) df1[:ID] = Array(['a','b','c']) show(df1) >>3×4 DataFrame │ Row │ clo1 │ clo2 │ clo3 │ ID │ │ │ Float64 │ Float64 │ Float64 │ Char │ ├─────┼─────────┼─────────┼─────────┼─────┤ │ 1 │ 1.0 │ 4.0 │ 7.0 │'a' │ │ 2 │ 2.0 │ 5.0 │ 8.0 │'b' │ │ 3 │ 3.0 │ 6.0 │ 9.0 │'c' │
If no column name is specified, the default isx1,x2...
df2 = DataFrame(rand(5,6))
image
Specify the content directly when the DataFrame is defined
df3 = DataFrame([collect(1:3),collect(4:6)], [:A, :B]) >> AB Int64 Int64 1 1 4 2 2 5 3 3 6
DataFrame merge
df2 = DataFrame() df2[:clo1] = Array([11.0,12.0,13.0]) df2[:clo2] = Array([14.0,15.0,16.0]) df2[:clo3] = Array([17.0,18.0,19.0]) df2[:ID] = Array(['a','b','c']) show(df2) >>3×4 DataFrame │ Row │ clo1 │ clo2 │ clo3 │ ID │ │ │ Float64 │ Float64 │ Float64 │ Char │ ├─────┼─────────┼─────────┼─────────┼─────┤ │ 1 │ 11.0 │ 14.0 │ 17.0 │'a' │ │ 2 │ 12.0 │ 15.0 │ 18.0 │'b' │ │ 3 │ 13.0 │ 16.0 │ 19.0 │'c' │
Combine df1 and df2
df3 = join(df1, df2, on=:ID)
Column rename
rename!(df1, :clo1, :cool1)
index
df = DataFrame(rand(5,6)) show(df) 5×6 DataFrame │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ ├─────┼───────────┼──────────┼──────────┼───────── ─┼─────────┼───────────┤ │ 1 │ 0.678851 │ 0.514764 │ 0.829835 │ 0.996105 │ 0.427939 │ 0.163515 │ │ 2 │ 0.366487 │ 0.834486 │ 0.588859 │ 0.398066 │ 0.45738 │ 0.577453 │ │ 3 │ 0.0945619 │ 0.231903 │ 0.715664 │ 0.538696 │ 0.813534 │ 0.100683 │ │ 4 │ 0.102987 │ 0.269429 │ 0.292984 │ 0.719276 │ 0.756749 │ 0.705312 │ │ 5 │ 0.328692 │ 0.420872 │ 0.545552 │ 0.793303 │ 0.670403 │ 0.0619233 │
Use index to get the value, take 2~4 columns
df[2:4] >> x2 x3 x4 Float64 Float64 Float64 1 0.514764 0.829835 0.996105 2 0.834486 0.588859 0.398066 3 0.231903 0.715664 0.538696 4 0.269429 0.292984 0.719276 5 0.420872 0.545552 0.793303
Value by name
df.x1 >>5-element Array{Float64,1}: 0.6788506862994854 0.366486647281848 0.09456191069734388 0.10298681965872736 0.32869200293154477
Take 2~4 rows, x1, x3 columns
df[2:4, [:x1, :x3]] >> x1 x3 Float64 Float64 1 0.366487 0.588859 2 0.0945619 0.715664 3 0.102987 0.292984
for loop traversal
for row in eachrow(df) println(row) end
View the description information of df
describe(df)
Sort
sort!(df, cols = [:x1])
using CSV df = DataFrame(rand(5,6)) show(df) >>5×6 DataFrame │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ ├─────┼──────────┼──────────┼───────────┼───────── ─┼──────────┼───────────┤ │ 1 │ 0.173904 │ 0.981004 │ 0.997001 │ 0.742527 │ 0.816245 │ 0.950086 │ │ 2 │ 0.946748 │ 0.661713 │ 0.0734147 │ 0.181932 │ 0.175604 │ 0.0550761 │ │ 3 │ 0.638618 │ 0.30026 │ 0.926336 │ 0.427942 │ 0.803738 │ 0.481783 │ │ 4 │ 0.156693 │ 0.943436 │ 0.0614211 │ 0.35279 │ 0.0692527 │ 0.417888 │ │ 5 │ 0.351843 │ 0.64204 │ 0.934611 │ 0.910804 │ 0.715309 │ 0.3677 │
Write file
CSV.write("df123.csv",df)
Read file
data = CSV.read("df123.csv")
@where
using DataFramesMeta @where(df, :x1 .> 0.2)
image
@with
@with(df, :x2 .+1) >>5-element Array{Float64,1}: 1.9810036040861598 1.6617132284630372 1.3002596213068984 1.9434355174941438 1.6420398463078156
@select
@select(df, :x3) >> x3 Float64 1 0.997001 2 0.0734147 3 0.926336 4 0.0614211 5 0.934611
@select(df, p = 2 * :x1, :x2) >> p x2 Float64 Float64 1 0.347808 0.981004 2 1.8935 0.661713 3 1.27724 0.30026 4 0.313385 0.943436 5 0.703686 0.64204
RDatasets is a data set in Julia, which contains a lot of data that can be learned and verified, including the iris data set.
Introduction to iris data set
In the field of machine learning, there are a large number of public data sets. Iris is a very important one.
Iris Data Set (Iris data set) is a long history of data set, it first appeared in the famous British statistician and biologist Ronald Fisher in the 1936 paper "The use of multiple measurements in taxonomic problems" , Is used to introduce linear discriminant analysis. In this data set, three different types of Iris plants are included: Iris Setosa, Iris Versicolour, and Iris Virginica. Each category collected 50 samples, so this data set contains a total of 150 samples.
The data set measures 4 features of all 150 samples, which are:
The units of the above four characteristics are all centimeters (cm).
Usually m is used to represent the size of the sample, and n is the number of features each sample has.
using RDatasets
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4]) labels = convert(Array, iris[:, 5])
p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point)
Output picture to file
img = SVG("iris_plot.SVG", 10cm, 8cm) draw(img, p)
Connect with wires
p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point, Geom.line)
Use Species as color basis
p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point, color=:Species)
Add label to the picture
p = plot(iris, x=:SepalLength, y=:SepalWidth, label=:Species, color=:Species, Geom.point, Geom.label)
Histogram
fig1a = plot(iris, x="SepalLength", y="SepalWidth", Geom.point) fig1b = plot(iris, x="SepalWidth", Geom.bar) fig1 = hstack(fig1a, fig1b)
using Distributions
Generate a normal distribution model with a mean of 20 and a standard deviation of 2.0
n1 = Normal(20,2.0) params(n1) >>(20,2.0) fieldnames(typeof(n1)) >>(:μ, :σ)
The type of Normal is UnionAll
typeof(Normal) >>UnionAll
Generate random numbers that satisfy the n1 model
rand(n1,100)
Binomial distribution
b = Binomial(20, 0.8) rand(b, 1000)
Probability Density
n = Normal() pdf(n, 0.4) >>0.36827014030332333 pdf(n, 0) #The probability density of the standard normal distribution at x=0 is 0.3989... >>0.3989422804014327
Distribution function
cdf(n, 0.4) >>0.6554217416103242
Quantile
quantile.(n, [0.25, 0.5, 0.95]) >>3-element Array{Float64,1}: -0.6744897501960818 0.0 1.6448536269514717
For the random number rand(10), analyze its parameter values according to the Normal distribution
fit(Normal, rand(10)) >>Normal{Float64}(μ=0.41375573240875224, σ=0.25565343114046984)
StatsBase library: It also contains commonly used functions for statistics
using StatsBase
a = collect(1:6) b = collect(4:9) countne(a,b) #Compare in order 1!=4 2!=5 ... 6!=9 >>6
a = [1,2,3,4,5] b = [4,1,3,2,5] counteq(a,b) # Compare the number of equal elements in two vectors in order >>2
L1dist(a,b) # abs(a[1]-b[1]) + ... + abs(a[n]-b[n]) >>6.0 L2dist(a, b) # sqrt((a[1]-b[1])^2 + ... + (a[n]-b[n])^2) meanad(a, b) # mean absolute deviation L1dist(a,b)/n >>1.2
sample(a) # Sample data once from a
sample(a, 3) # Sample data 3 times from a and return a 1-dimensional Array >>3-element Array{Int64,1}: 3 2 3
a1 = [1, 10, 20, 30] a2 = [100, 200, 300] sample!(a1, a2) # From a1, take out length(a2) data according to the type of a2 >>3-element Array{Int64,1}: 20 10 30
a1 = [1, 10, 20, 30] a2 = [100.0, 200.0, 300.0] sample!(a1, a2) >>3-element Array{Float64,1}: 1.0 30.0 1.0
Other commonly used functions
x = [1.0, 2.0, 3.0, 4.0] autocov(x) autocov(x, [2,3]) autocor(x) autocor(x, [2,3]) y = [3., 4., 5., 6.] crosscov(x, y) crosscov(x, y, [2,3])
using TimeSeries
New Date variable
dates = collect(Date(2018, 1,1) :Day(1): Date(2018, 12, 31)) dates[1:20]
New TimeArray
ta = TimeArray(dates, rand(length(dates),2)) >>365×2 TimeArray{Float64,2,Date,Array{Float64,2}} 2018-01-01 to 2018-12-31 │ │ A │ B │ ├────────────┼────────┼────────┤ │ 2018-01-01 │ 0.7933 │ 0.4359 │ │ 2018-01-02 │ 0.6347 │ 0.7234 │ │ 2018-01-03 │ 0.2454 │ 0.6826 │ │ 2018-01-04 │ 0.767 │ 0.9007 │ │ 2018-01-05 │ 0.1884 │ 0.5429 │ │ 2018-01-06 │ 0.7809 │ 0.2292 │ │ 2018-01-07 │ 0.9329 │ 0.0098 │ │ 2018-01-08 │ 0.0032 │ 0.7566 │ │ 2018-01-09 │ 0.1355 │ 0.8986 │ │ 2018-01-10 │ 0.9274 │ 0.6022 │ │ 2018-01-11 │ 0.6483 │ 0.53 │ │ 2018-01-12 │ 0.8594 │ 0.0237 │ ⋮ │ 2018-12-21 │ 0.7245 │ 0.5286 │ │ 2018-12-22 │ 0.9374 │ 0.8508 │ │ 2018-12-23 │ 0.0528 │ 0.1361 │ │ 2018-12-24 │ 0.1184 │ 0.6156 │ │ 2018-12-25 │ 0.3645 │ 0.7896 │ │ 2018-12-26 │ 0.023 │ 0.8259 │ │ 2018-12-27 │ 0.0942 │ 0.9326 │ │ 2018-12-28 │ 0.6124 │ 0.3102 │ │ 2018-12-29 │ 0.5602 │ 0.3305 │ │ 2018-12-30 │ 0.273 │ 0.7611 │ │ 2018-12-31 │ 0.7595 │ 0.1093 │
Take out the timestamp
timestamp(ta::TimeArray)
Value
values(ta::TimeArray)
Take column name
colnames(ta::TimeArray)
using MLBase
The basic machine learning library does not contain any machine learning algorithms, but provides many necessary tools for machine learning, such as Cross validation, etc.
Let's first look at a few simple data processing functions in MLBase
repeach(1:3, 2) >>6-element Array{Int64,1}: 1 1 2 2 3 3 repeach(["a", "b", "c"], 2) >>6-element Array{String,1}: "a" "a" "b" "b" "c" "c" repeach(["a", "b", "c"], [1, 2, 3]) >>6-element Array{String,1}: "a" "b" "b" "c" "c" "c"
A = reshape(collect(1:6), 2, 3) repeachcol(A, 2) >>2×6 Array{Int64,2}: 1 1 3 3 5 5 2 2 4 4 6 6
repeachrow(A, 2) >>4×3 Array{Int64,2}: 1 3 5 1 3 5 2 4 6 2 4 6
labels = labelmap(['a','b','b','c','c','c']) >>LabelMap (with 3 labels): [1] a [2] b [3] c
labelencode(labels, ['b','c','c','a']) >>4-element Array{Int64,1}: 2 3 3 1
Use iris dataset
using RDatasets iris = dataset("datasets", "iris") features = convert(Array, iris[:, 1:4]) labels = convert(Array, iris[:, 5])
Call decision tree
using DecisionTree
In this library, contains
We only introduce the classification model here, that is, the first three algorithm models
model = build_tree(labels, features) >>Decision Tree Leaves: 9 Depth: 5
model = prune_tree(model, 0.9) >>Decision Tree Leaves: 8 Depth: 5
print_tree(model, 5) >>Feature 3, Threshold 2.45 L-> setosa: 50/50 R-> Feature 4, Threshold 1.75 L-> Feature 3, Threshold 4.95 L-> versicolor: 47/48 R-> Feature 4, Threshold 1.55 L-> virginica: 3/3 R-> Feature 3, Threshold 5.449999999999999 L-> versicolor: 2/2 R-> virginica: 1/1 R-> Feature 3, Threshold 4.85 L-> Feature 2, Threshold 3.1 L-> virginica: 2/2 R-> versicolor: 1/1 R-> virginica: 43/43
print_tree(model, 3) >>Feature 3, Threshold 2.45 L-> setosa: 50/50 R-> Feature 4, Threshold 1.75 L-> Feature 3, Threshold 4.95 L-> versicolor: 47/48 R-> R-> Feature 3, Threshold 4.85 L-> R-> virginica: 43/43
In the following way
Use the decision tree model to make judgments
apply_tree(model, [5.9, 3.0, 5.1, 1.9]) >>"virginica" apply_tree(model, [1.9, 2.0, 2.1, 1.9]) >>"setosa"
nfold cross validation for pruned tree
n_folds=3 accuracy = nfoldCV_tree(labels, features, n_folds, 0.9) >>3×3 Array{Int64,2}: 24 0 0 0 13 0 0 0 13 3×3 Array{Int64,2}: 12 0 0 0 17 0 0 0 21 3×3 Array{Int64,2}: 14 0 0 0 20 0 0 1 15 Fold 1 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 1.0 Kappa: 1.0 Fold 2 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 1.0 Kappa: 1.0 Fold 3 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 0.98 Kappa: 0.9695863746958637 Mean Accuracy: 0.9933333333333333 3-element Array{Float64,1}: 1.0 1.0 0.98
Construct adaboost_stumps model
model, coeffs = build_adaboost_stumps(labels, features, 10) >>(Ensemble of Decision Trees Trees: 10 Avg Leaves: 2.0 Avg Depth: 1.0, [0.346574, 0.255413, 0.264577, 0.335749, 0.288846, 0.258287, 0.356882, 0.298766, 0.242485, 0.349976])
Application model judgment
apply_adaboost_stumps(model, coeffs, [5.9, 3.0, 5.1, 1.9]) >>"virginica"
adaboost nfold cross-validation
# 3 folds, 8 iteration accuracy = nfoldCV_stumps(labels, features, 3, 8)
Construct a random forest model
# 2 random features, 10 trees, 0.5 portion of samples model = build_forest(labels, features, 2, 10, 0.5) >>Ensemble of Decision Trees Trees: 10 Avg Leaves: 6.4 Avg Depth: 4.4
Application model judgment
apply_forest(model, [5.9, 3.0, 5.1, 1.9]) >>"virginica"
forest nfold cross-validation
accuracy = nfoldCV_forest(labels, features, 3, 2) >>3×3 Array{Int64,2}: 12 0 0 0 15 0 0 1 22 3×3 Array{Int64,2}: 18 0 0 0 13 2 0 1 16 3×3 Array{Int64,2}: 20 0 0 0 20 0 0 0 10 Fold 1 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 0.98 Kappa: 0.9689440993788819 Fold 2 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 0.94 Kappa: 0.9096385542168673 Fold 3 Classes: ["setosa", "versicolor", "virginica"] Matrix: Accuracy: 1.0 Kappa: 1.0 Mean Accuracy: 0.9733333333333333 3-element Array{Float64,1}: 0.98 0.94 1.0
PCA (Principal Component Analysis) is a commonly used data analysis method. PCA transforms the original data into a set of linearly independent representations of each dimension through linear transformation, which can be used to extract the main feature components of the data, and is often used for dimensionality reduction of high-dimensional data.
using MultivariateStats, RDatasets # load iris dataset iris = dataset("datasets", "iris") # split half to training set Xtr = convert(Array,(iris[1:2:end,1:4]))' Xtr_labels = convert(Array,(iris[1:2:end,5])) # split other half to testing set Xte = convert(Array,(iris[2:2:end,1:4]))' Xte_labels = convert(Array,(iris[2:2:end,5])) # suppose Xtr and Xte are training and testing data matrix, # with each observation in a column # train a PCA model, allowing up to 3 dimensions M = fit(PCA, Xtr; maxoutdim=3) # apply PCA model to testing set Yte = transform(M, Xte) # reconstruct testing observations (approximately) Xr = reconstruct(M, Yte) # group results by testing set labels for color coding setosa = Yte[:,Xte_labels.=="setosa"] versicolor = Yte[:,Xte_labels.=="versicolor"] virginica = Yte[:,Xte_labels.=="virginica"]
size(Xte) >>(4, 75) size(Yte) >>(3, 75)
Draw the data after dimensionality reduction
using Plots p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
using LIBSVM
using RDatasets iris = dataset("datasets", "iris") features = convert(Array, iris[:, 1:4]) labels = convert(Array, iris[:, 5]) features_train, features_test = features[1:2:end, :], features[2:2:end, :] labels_train, lables_test = labels[1:2:end], labels[2:2:end]
model = svmtrain(features_train', labels_train)
(predicted_labels, decision_values) = svmpredict(model, features_test')
Check forecast results
using Statistics mean(predicted_labels .== lables_test) * 1.0 predicted_labels .== lables_test >>75-element BitArray{1}: true true true true true true true true true true true true true ⋮ true true true true true true true true true true true true
using GLM, GLMNet, DataFrames
lm model
data = DataFrame(X=[1,2,3], Y=[2,4,7]) ols = lm(@formula(Y ~ X), data) >>StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Formula: Y ~ 1 + X Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) -0.666667 0.62361 -1.06904 0.4788 X 2.5 0.288675 8.66025 0.0732
Linear regression model standard deviation
stderror(ols) >>2-element Array{Float64,1}: 0.6236095644623245 0.2886751345948133
Application model
newX = DataFrame(X=[2,3,4]); GLM.predict(ols, newX) >>3-element Array{Union{Missing, Float64},1}: 4.333333333333332 6.833333333333335 9.33333333333334
data = DataFrame(X=[1,2,3], Y=[2,4,6]) ols = lm(@formula(Y ~ X), data) newX = DataFrame(X=[2,3,4]) GLM.predict(ols, newX) >>3-element Array{Union{Missing, Float64},1}: 4.0 6.0 8.0
glm model
data = DataFrame(X=[0,1,2,3,4], Y=[0.1296,0.0864,0.0576,0.0384,0.0256]) probit = glm(@formula(Y ~ X), data, Binomial(), ProbitLink()) >>StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}}},Array{Float64,2}}}} }} Formula: Y ~ 1 + X Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) -1.14184 1.32732 -0.860259 0.3896 X -0.208282 0.659347 -0.315891 0.7521
newX = DataFrame(X=[1,2,3,4]) GLM.predict(probit, newX) >>4-element Array{Union{Missing, Float64},1}: 0.08848879660655808 0.05956904945792997 0.03864063839150974 0.024136051585243873
Linear regression model processing iris data set
data = DataFrame(); data[:x] = iris[1:50, :SepalWidth] data[:y] = iris[1:50, :SepalLength]
model = lm(@formula(y ~ x), data) >>StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}} Formula: y ~ 1 + x Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) 2.639 0.310014 8.51251 <1e-10 x 0.69049 0.0898989 7.68074 <1e-9
Compare the LM fitted curve with the original curve
using Plots plot(GLM.predict(model)) plot!(data[:y])
Process iris dataset with GLM
model = glm(@formula(y~x), data, Normal(), IdentityLink()) >>StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}}},Array{Float64,2}}}} }} Formula: y ~ 1 + x Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) 2.639 0.310014 8.51251 <1e-16 x 0.69049 0.0898989 7.68074 <1e-13
Compare the curve fitted by GLM with the original curve
using Plots plot(GLM.predict(model)) plot!(data[:y])
# Estimates for coefficents coef(model) # Standard errors of coefficents stderror(model) # Covariance of coefficents vcov(model) # RSS deviance(model)
using Clustering using RDatasets iris = dataset("datasets", "iris") features = convert(Array, iris[:, 1:4]) labels = convert(Array, iris[:, 5]);
# choose 3 random starting points initseeds(:rand, convert(Matrix,features'), 3) >>3-element Array{Int64,1}: 48 80 114
result = kmeans(features, 3) >>KmeansResult{Float64}([5.1 2.45 0.2; 4.9 2.2 0.2;…; 6.2 4.4 2.3; 5.9 4.05 1.8], [1, 2, 2, 3], [3.63798e-12, 166.128, 166.127, 0.0], [1, 2, 1], [1.0, 2.0, 1.0], 332.2550000000019, 2, true)
using Gadfly plot(iris,x="PetalLength",y="PetalWidth",color=result.assignments, Geom.point)