Como gerar Clusters (Agrupamentos) através do Algoritmo K-means em um determinado conjunto de dados?

8 respostas
programaçãojava
R

Sou iniciante na programação java e gostaria que alguém me ajudasse a gerar Clusters (Agrupamentos) através do Algoritmo K-means no conjunto de dados abaixo:

ExternalReviewer,Reviewer,0.41,0.5,0.57,1.0,1.0,1.0,false
PaperFullVersion,Conference_announcement,0.0,0.32,0.0,0.11,0.51,0.75,false
Conference,Conference,1.0,1.0,1.0,1.0,1.0,1.0,true
Decision,Conference_proceedings,0.0,0.43,0.0,0.1,0.35,0.67,false
Reviewer,Reviewer,1.0,1.0,1.0,1.0,1.0,1.0,false
ProgramCommitteeChair,Chair,0.0,0.45,0.33,1.0,1.0,1.0,false
Review,Review,1.0,1.0,1.0,1.0,1.0,1.0,true
PaperAbstract,Abstract,0.0,0.7,0.64,1.0,1.0,1.0,true
Document,Conference_document,0.4,0.54,0.45,1.0,1.0,1.0,true
Co-author,Contribution_co-author,0.63,0.62,0.57,1.0,1.0,1.0,true
Person,Person,1.0,1.0,1.0,1.0,1.0,1.0,true
Chairman,Chair,0.94,0.71,0.59,0.07,1.0,1.0,true
ExternalReviewer,Extended_abstract,0.77,0.41,0.22,0.05,0.0,0.18,false
Author,Regular_author,0.49,0.42,0.42,1.0,1.0,1.0,true
Rejection,Accepted_contribution,0.6,0.4,0.24,0.08,0.29,0.75,false
Co-author,Contribution_1th-author,0.63,0.62,0.5,0.0,0.0,0.0,false
AuthorNotReviewer,Reviewed_contribution,0.0,0.53,0.24,0.06,0.0,0.21,false

8 Respostas

R

Antes de mais nada:

  • tente formatar estes dados para visualizarmos melhor.
  • você já tentou começar ? já conseguiu pelo menos carregar estes dados em memória ?
  • já tentou pelo menos declarar como será a chamada deste método ?

Eu estudei bastante métodos de clustering no mestrado e posso te ajudar, mas você precisa pelo menos começar, ok ?

R

Me parece que também estão faltando os nomes das colunas, isso é essencial para fazer sua análise.

R

PaperFullVersion,Conference_announcement,0.0,0.32,0.0,0.11,0.51,0.75
Conference,Conference,1.0,1.0,1.0,1.0,1.0,1.0
Decision,Conference_proceedings,0.0,0.43,0.0,0.1,0.35,0.67
Reviewer,Reviewer,1.0,1.0,1.0,1.0,1.0,1.0
ProgramCommitteeChair,Chair,0.0,0.45,0.33,1.0,1.0,1.0
Review,Review,1.0,1.0,1.0,1.0,1.0,1.0
PaperAbstract,Abstract,0.0,0.7,0.64,1.0,1.0,1.0
Document,Conference_document,0.4,0.54,0.45,1.0,1.0,1.0
Co-author,Contribution_co-author,0.63,0.62,0.57,1.0,1.0,1.0
Person,Person,1.0,1.0,1.0,1.0,1.0,1.0
Chairman,Chair,0.94,0.71,0.59,0.07,1.0,1.0
ExternalReviewer,Extended_abstract,0.77,0.41,0.22,0.05,0.0,0.18
Author,Regular_author,0.49,0.42,0.42,1.0,1.0,1.0
Rejection,Accepted_contribution,0.6,0.4,0.24,0.08,0.29,0.75
ConferenceChair,Chair,0.56,0.5,0.5,1.0,1.0,1.0
AuthorNotReviewer,Invited_speaker,0.54,0.25,0.11,0.3,0.78,0.86
ProgramCommitteeMember,Committee_member,0.81,0.62,0.57,1.0,1.0,1.0
Preference,Review_preference,0.6,0.42,0.58,1.0,1.0,1.0
SubjectArea,Track,0.43,0.25,0.0,0.12,0.42,0.71
Administrator,Abstract,0.76,0.42,0.24,0.07,0.0,0.2
Paper,Paper,1.0,1.0,1.0,1.0,1.0,1.0
ConferenceMember,Active_conference_participant,0.57,0.35,0.29,0.23,0.76,0.91

R

Primeiramente, criei a Classe Correspondencia conforme abaixo:

import org.apache.commons.math3.stat.descriptive.rank.Median;

public class Correspondencias {

private int cluster;
	private String nome;
	private String entidade1;
	private String entidade2;
	private double m1;
	private double m2;
	private double m3;
	private double m4;
	private double m5;
	private double m6;
	private double[] valores;
	
	public Correspondencias() {};

	public Correspondencias(String entidade1, String entidade2, double m1, double m2, double m3, double m4, double m5,
			double m6, double[] valores) {
		this.entidade1 = entidade1;
		this.entidade2 = entidade2;
		this.m1 = m1;
		this.m2 = m2;
		this.m3 = m3;
		this.m4 = m4;
		this.m5 = m5;
		this.m6 = m6;
		this.valores = valores;
	}

	public Correspondencias(String entidade1, String entidade2, double m1, double m2, double m3, double m4, double m5,
			double m6) {
		this.entidade1 = entidade1;
		this.entidade2 = entidade2;
		this.m1 = m1;
		this.m2 = m2;
		this.m3 = m3;
		this.m4 = m4;
		this.m5 = m5;
		this.m6 = m6;

	}

	public int getCluster() {
		return cluster;
	}
	public String getNome() {
		return nome;
	}


	public String getEntidade1() {
		return entidade1;
	}

	public String getEntidade2() {
		return entidade2;
	}

	public double getM1() {
		return m1;
	}

	public double getM2() {
		return m2;
	}

	public double getM3() {
		return m3;
	}

	public double getM4() {
		return m4;
	}

	public double getM5() {
		return m5;
	}

	public double getM6() {
		return m6;
	}

	public double[] getValores() {
		return valores;
	}

	public double getMediana() {
		double[] valores = { this.getM1(), this.getM2(), this.getM3(), this.getM4(), this.getM5(), this.getM6()

		};

		Median mediana = new Median();
		double medianValue = mediana.evaluate(valores);
		return medianValue;
	}

	public String toString() {
		return entidade1 + "," + entidade2 + "," + m1 + "," + m2 + "," + m3 + "," + m4 + "," + m5 + "," + m6;

	}
	public String toString1() {
		return nome + "," + m1 + "," + m2 + "," + m3 + "," + m4 + "," + m5 + "," + m6;

	}

	public int compareTo(Correspondencias outraCorrespondencias) {

		return this.getEntidade1().compareTo(outraCorrespondencias.entidade1);
	}
}
R

Em seguida, criei a Classe LerCorrespondencia para ler os dados:

import java.io.BufferedReader;

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Comparator;

import java.util.List;

public class LerCorrespondencia {

public static List<Correspondencias> lerCorrespondencia(String filePath){
    File arq = new File(filePath);

    List<Correspondencias> ccList = new ArrayList<Correspondencias>();

    try {
        FileReader fileReader = new FileReader(arq);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String linha = " ";
        while ((linha = bufferedReader.readLine()) != null) {
            String[] split = linha.split(",");
            ccList.add(new Correspondencias(getEntidade1(split),getEntidade2 (split), m1(split),
            		m2(split),m3(split),m4(split),m5(split),m6(split)));
        }

        fileReader.close();
        bufferedReader.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

    return ccList;
}



private static String getEntidade1(String[] split) {
	if (split == null || split.length < 1 || split[0] == null){
        return null;
    }

    return split[0];
}
private static String getEntidade2(String[] split) {
	if (split == null || split.length < 2 || split[1] == null){
        return null;
    }

    return split[1];
}
private static double m1(String[] split) {
	if (split == null || split.length < 3 || split[2] == null){
	return -1;
}
	return Double.valueOf(split[2].trim());

}
private static double m2(String[] split) {
	if (split == null || split.length < 4 || split[3] == null){
	return -1;
}
	return Double.valueOf(split[3].trim());

}
private static double m3(String[] split) {
	if (split == null || split.length < 5 || split[4] == null){
	return -1;
}
	return Double.valueOf(split[4].trim());

}
private static double m4(String[] split) {
	if (split == null || split.length < 6 || split[5] == null){
	return -1;
}
	return Double.valueOf(split[5].trim());

}
private static double m5(String[] split) {
	if (split == null || split.length < 7 || split[6] == null){
	return -1;
}
	return Double.valueOf(split[6].trim());

}
private static double m6(String[] split) {
	if (split == null || split.length < 8 || split[7] == null){
	return -1;
}
	return Double.valueOf(split[7].trim());

}

public static void main(String[] args) throws Exception{
       List<Correspondencias> leitura = lerCorrespondencia("src/main/java/correspondencia.txt");

       for (Correspondencias data : leitura){
        	
            System.out.println(data.toString());
        }
        
        System.out.println();
        for(int i=0; i<leitura.size(); i++) {
        	leitura.sort(Comparator.comparing(Correspondencias::getEntidade1));
        	System.out.println(leitura.get(i).toString());
        }

}

R

A minha dúvida é como posso gerar os Clusters com o algoritmo k-means conforme abaixo pelo Weka:

8 - CLUSTER: 21 = AuthorNotReviewer,Invited_speaker,0.54,0.25,0.11,0.3,0.78,0.86,0.42
9 - CLUSTER: 21 = ConferenceMember,Active_conference_participant,0.57,0.35,0.29,0.23,0.76,0.91,0.46
10 - CLUSTER: 01 = ProgramCommitteeMember,Committee_member,0.81,0.62,0.57,1.0,1.0,1.0,0.725

R

não entendi, você precisa escrever o algoritmo ou usar a biblioteca do Weka ? Outra coisa que não ficou claro é se as colunas do tipo String precisam de alguma maneira entrar na análise ou se você pode simplesmente removê-las … outro ponto, você conhece o pseudo-código do algoritmo ? isso ajuda muito

R

Pode ser as duas coisas. A princípio, estava pensando em usar a biblioteca do Weka. As colunas do Tipo String precisam entrar na Análise do K-means.
A saída do k-means deve ser do tipo:
entidade1, entidade2, m1, m2, m3, m4, m5, m6, md
CLUSTER: 21 = AuthorNotReviewer,Invited_speaker,0.54,0.25,0.11,0.3,0.78,0.86,0.42
CLUSTER: 21 = ConferenceMember,Active_conference_participant,0.57,0.35,0.29,0.23,0.76,0.91,0.46
CLUSTER: 01 = ProgramCommitteeMember,Committee_member,0.81,0.62,0.57,1.0,1.0,1.0,0.725

Criado 4 de julho de 2018
Ultima resposta 5 de jul. de 2018
Respostas 8
Participantes 2