All the latest news and views from the world of Java

The Podcast

Podcast 15: Spark Core, Spark SQL and Spark ML

31-Oct-2018 in Podcasts Core Java

Richard came into the office this morning in a real mood, had a rant, and so we decided to do a podcast today! It's all about Apache Spark this time.

Transcript

Matt:Hello and welcome to podcast number 15. This is an unscheduled extra podcast. We hadn't planned to do this today, but Richard walked into the office this morning, huffing and puffing, in a right mood, and had a real rant, and I thought that was too good an opportunity to not try and record some of it for the benefit of our listeners out there.

Richard:Right. Yes. I am in a bad mood, but I'm also supposed to be on holiday after doing my release. So, I'm in an even worse mood now that you're making me do a podcast. So, I'm going to sit here, and say nothing.

Matt:Well, we'll see how long that lasts, won't we? Anyway, welcome. I'm Matt. That was Richard, who's now sitting here growling at me, and welcome to our latest podcast. I guess we should start, then, with what you were just saying, Richard, which is that we have just released a course. I'm using the royal "we" there.

Richard:Have we?

Matt:So, what have we released?

Richard:Spark SQL.

Matt:And that's now live on the Virtual Repair Programmers website. It's a follow on to module one, which covered-

Richard:Spark RDD.

Matt:This is going to be hard, right? Richard will be here for the rest of the day. God, we need a bit more. So yeah, it's Spark RDD. I promote this course actually, so I will talk about it if you won't, but tell us a bit about Spark SQL, and what it's used for. What are it's great selling points, all that kind of stuff, or don't.

Richard:Well, everybody now says, "Nobody uses Spark RDD, everybody uses Spark SQL." There's a sense, if you read a few places, you'll see people saying, "No, Spark RDD is deprecated and it's not to be used anymore." This sort of thing winds me up.

Matt:I can say you're saying this very calmly, given this is what part of your huff was about.

Richard:I'll build up into something a bit later on, but I should back off a little bit, because we'll make Spark the overall topic, I think, of the podcast. It's going to be the main technical thing we talk about. If anybody's not done Spark before, there's two, there's actually more, which we'll talk about later, but there's two main programing models. The Spark RDD that we covered in the module we released in February, that's the one that people say is the older way of working. That's much more like you might have worked in Hadoop, or much easier than working in Hadoop, it's a high ... You think in terms of MAP reducers, still. Much better model than Hadoop, it's much less rigid, and you can change ops together, but you're still doing kind of low level processing of data.

Matt:I'd say actually that it's more, the mindset is more of a bit like a typical Java collection. Or manipulating a collection when you're working with RDD. So it's a much easier way to get started than Hadoop.

Richard:Oh, absolutely, a million times easier.

Matt:And you don't have that massive overhead you have in Hadoop, of having to set everything up. It's a much quicker process to start, actually as well.

Richard:Just a Java programme, you've just got one dependency, well a couple of dependencies in a palm. It's so easy to get started, it's ridiculous. Java RDD, but it is still, most of what you do in that is going to a MAP, or a reduce, or a ... There's operations, like sorts and things, but it's still ... You have to do a lot of low level crunching. Whereas, with the SQL programing model, and the first misconception is it's not about databases. It's saying you can use and SQL style syntax to mine your data, and the data, when you're working in SQL is more likely to be structured data, some kind of column type format of input data, rather than just raw strings or whatever.

Matt:Yes.

Richard:But one of the key points we make on the second module is when you're working in Spark SQL you can think much more like a data scientist, think much less about Java code. We actually do a great comparison on the course, where we worked quite hard on a particular SQL statement, and we got a lot of value from that SQL statement. It's one line of code, really. I don't get the viewers to do this, but I've done it off camera, I've implemented the same job in RDDs, and you've got 100 lines of code, compared to the ... That's not fair actually, a lot of that is just helper code.

Richard:You have probably, I would say about 30 lines of code compared to the 1 line of code in the SQL. Now, I'm not claiming, this got lost in the edit, unfortunately, but to the point I was making, I wrote that RDD version at like 8:00 in the morning. Just rolled out of bed, I was in a bad mood, and just knocked it out. It wasn't like, I'm sure you could polish it and tailor it and reduce it down to fewer lines of code. I was trying to make the point, I've got a job to do, I've got to get this out quickly. Would I do it with about 35 lines of code that I'm not quite sure is it elegant, compared to the one line of SQL that did the same job. It's all about that higher level of thinking, and it's so elegant, Spark SQL.

Matt:That's interesting, because I'd previously done, actually, just as a comparison, it was for a talk I gave. Where I looked at different ways of achieving the same thing. One being writing SQL that would run within a database. One being writing it as Spark code. I think the third must have been as using Java, not within the Spark sort of framework. But the SQL effect, and this wasn't Spark SQL, this is just SQL, yes was very neat and elegant, and it was effectively one statement, but the process to write it took a long time, because there's a lot of trial and error to get it right, and when you look at it three months later, I have no idea what's going on. It's completely unmaintainable.

Richard:The SQL version?

Matt:The SQL version. Would that same concept still be ... Would you be able to look at the Spark SQL and say, "That's as easy to maintain and understand."

Richard:I would hope so. The whole point of SQL is supposed to be ... Is it fourth generation language, I've lost track. Fourth generation I think. I don't explain this properly on the course I don't think, but I make the point that SQL is a declarative, it's actually a mix of declarative and there were procedural elements necessary, but certainly the SQL you do in Spark is declarative.

Matt:Yes.

Richard:You are saying to an engine, "These are the results I want, go off and do it." And the engine has to work out what lines of code to generate. That's what SQL is always doing. We look at query plans in detail and all that on the course. So, the idea should be it's like an English statements SQL, so I would argue. I don't know how complex, I imagine you do something pretty ...

Matt:Well, yes, there was some complication. In order to be able to get to the end result, I think we had to create a couple of temporary tables, that had been memory tables, and I seem to remember there were some sort of group buys and where's and happenings, all this stuff built in.

Richard:And you recon your RDD version was easier to ...

Matt:It was certainly in that one example, it was certainly easier to maintain, because it was multiple statements. It was clear what each did. I guess, going back to that bit you said right at the start, which is that people say, "RDD are deprecated, no one's using them."

Richard:You're trying to wind me up aren't you? You're trying to get me angry again. I sound all professional and slick in front of the microphone. Yeah, well, I will get there. I think what we've just identified there actually, is that, in Spark, you've got two completely different models that ultimately lead to the same thing. Whether you're working SQL or in the spark core, you're using RDDs are the end results.

Matt:Yes, because the SQL objects, if you like, are an abstraction of RDDs.

Richard:What happens is a code generates, we look in a little bit of detail at what's happening under the hood in this regard. A code generator works from the SQL, and writes on the fly Java code that is the same java code you would've ... It's not, sorry. Not the same Java code you would've written, but it is the same sorts of things. It actually turns into a large series of MAP operations, on RDDs. It is the same thing, just a different way into it. And the joy of that is ... I guess it depends on whether you're a data scientist or a programmer, or a mix of the two, or whatever. You can use whichever model is best for your job. Certainly, in the example of course SQL was by far the best in my opinion, but you can always use ... I think if you're working with unstructured data, you've got a lot of series of strings or words to count for instance, then RDD's is probably the best choice. You've got both available.

Matt:Yes, and this structured versus unstructured. I want to get to this point about why it's not true to say No one uses RDDs, because obviously people do. And of course you can use ...

Richard:Well, actually, I don't know. I have figures for how many programing projects ... I'll tell you where I think this has come from.

Matt:Okay.

Richard:In the Spark ML module, which you're currently working on a training course on that. On the Spark ML module-

Matt:Just for the benefit ML stands for Machine Learning.

Richard:Yeah. Now traditionally, you worked in Spark ML using raw RDDs, and recently, I don't know when, have you got a handle on?

Matt:It's within the last year, I think.

Richard:They have decided not to support that model anymore, so they have deprecated RDD as a programme model in Spark ML.

Matt:Actually, to be a bit more precise, there are two packages where all the different functions sit within Spark Machine Learning, called Spark ML and Spark MLlib. One of those packages uses RDDs, the other one uses spark SQL.

Richard:And it's ingrained sort of [smogged 00:10:26].

Matt:And the RDD bit, which is now in maintenance mode, and they have said may not be supported in the future.

Richard:Exactly.

Matt:So it is just Machine Learning.

Richard:Exactly. I think people have heard that, and now it's all, "Oh, nobody uses, oh no, you wouldn't switch to RDDs anymore." But I don't know how ... We never get figures on this. It could be anything. It could, indeed, be everybody's using spark SQL, but that will be to miss the point. I can guarantee you if I had started my spark journey by learning spark SQL, I wouldn't have a clue what was going on at any stage. Maybe I could get away with that.

Matt:The critical thing that is almost impossible to do in Spark SQL, unless you tell me you've worked out how to do it, and it's on this course, is optimization. Because you get so much detail on the execution plan when you work with RDDs, that if you've got a slow running job, you can really pinpoint and find opportunities to improve it.

Richard:It's probably because you get the DAG, you get this beautiful graphical chart showing you every step in the process and everywhere where there's an expensive operation, a new column appears, and you can see where the expense tends to fall. Just for people that haven't done any spark, but where there's a shuffle, you can see that. But that's very easy to do, because basically every line of every operation you perform in spark RDD, it can just put that on the graph. You've done it all, so it's obvious. The problem when working in Spark SQL is all of those, basically every step in a stage, so everything that doesn't result in a shuffle, is all combined together, and it runs through a code generator. The whole stage code gen. That is just mind blowingly complicated and you can't see any of those steps. So if you write an SQL statement with a group by clause, you'll just get a blob, basically. But I would argue that sort of, you don't need to look into that, you have to take on faith that those steps, there's no shuffle in there anyway.

Matt:Yes.

Richard:Well, it's just a lot of MAPS that have been combined together, really. You don't know or care what's going on.

Matt:No, but the problem is that you might have one, let's call it SQL, statement which generates two or three shuffles, and you don't know.

Richard:You'd see the shuffles, the shuffles would be clear. They wouldn't be ... Whole stage code gen doesn't ...

Matt:But you don't know why those shuffles are necessarily having to happen.

Richard:That's true, it's harder to tell, so we have got a great example, I think, on this course. I'm pleased with this. I hope our competition doesn't have anything as good as this, really. It's quite a simple statement, don't get me wrong. It's a training course, we want to be clear, but we have all the way through the course. By the way, one thing we haven't mentioned ... Oh dear, this is quite complicated, isn't it? Spark SQL, itself has two programming models.

Matt:Yes.

Richard:One of which is SQL, so actually Spark SQL isn't really called Spark SQL, is it? It's called Spark SQL and the Data Frames API, which also has another name. It's just an incredibly ...

Matt:It's a mess.

Richard:It made a mess of the marketing of it, frankly. You get people saying, "Oh, yes, we use Data Frames." There's no classical data frame in the API anymore, it's data sets. Unpicking that was a royal nightmare, and making that clear. You've got two programing models, anyway, in Spark SQL. They're actually the same thing. You can write code in Java that gets you these data frames. Actually, all that's doing, it's still declarative Java is what I'm trying to say. You're not saying, "Do a MAP, do this, do that." You're still just saying, "Do a select, do a group by," it's just another way of writing the same syntax.

Matt:Yes.

Richard:Theoretically, all that should be happening is, when you're using the SQL API, is that there's just going to be a step where it has to pause that SQL and convert it into data frame types. I don't know if this is exactly how they do it, but conceptually, I'd imagine that's a few milliseconds to do that. Then you've got to the same kind of Java that you would've written. But on the course, we observe that it's relatively simple, I'm just doing a log file, and I'm counting how many errors, fatals, and warnings there were by month. So, easy. The SQL version was twice as slow.

Matt:Really?

Richard:I meant to say half the speed of using the data frame API. Half the speed.

Matt:Gosh.

Richard:And that was replicable on a cluster as well, so it wasn't just because it was a development machine, there was a start-up time or something. So like a half-hour job on a cluster, and hour easily, SQL version. Fascinating. We do this on the performance chapter, we look at the execution plan but at first glance I'd say it's much, much much harder to understand than the RDD version. You wouldn't stand a chance of understanding this, had you not learnt RDD's first. That's what we're trying to get at.

Matt:Yes.

Richard:We're being told already, "Nah, I don't want to do the RDD. Nah, just straight to SQL," and I think that is a ... It winds me up, really.

Matt:Yes.

Richard:The problem we have with course is basically the start of the module says we're assuming you've done module one, or you already know Spark RDDs, and people are complaining about it.

Matt:Which is interesting. Going on to the Machine Learning bit, in a production environment, in my background, in IT, which started off in banking, okay? One of the things that they did, what we'd now call Machine Learning. It wasn't quite called that at this point. For every loan that this bank had, every single month, they wanted to calculate, probably lets see if that was a good or a bad customer, right? A simple process that they were running a formula on a massive amount of data, and that ... That's not the Machine Learning bit, because they've got the formula. The Machine Learning bit is creating the formula. So the run in production is exactly the sort of thing you'd want to use Spark for. You've got a algorithm you want to apply to millions of records, and you want to run it reasonably quickly.

Matt:Perfect use case for Spark, and that's the kind of thing would work well with RDDs or SQL, as your choice. Then in fact, because the regulator had a requirement, they submitted information by a certain date every month, actually it was time critical, and I suspect if they were doing it now, in Spark ... they weren't using, obviously, Spark in those days, this is 50 years ago. They might well say, "Right, Criticality is everything in terms of timing. RDDs are certainly not slower than SQL.

Richard:Oh, no.

Matt:And you have more ability to optimise. They would start off, I am sure, building that as RDDs. This structured-

Richard:By the way, we did compare the performance of, it's complicated. We're now turning to we've got RDDs, we've got the SQL syntax, and we've got Data Frames, which is part of SQL. We compare the performance of all three, and show them. That's quite good fun.

Matt:I'm looking for that, but I haven't seen that yet. So I'm not sure where to get that.

Richard:Although I'm never sure whether they give stuff away on the podcast. I don't want anyone to spoil it for me.

Matt:No, don't spoil it. Don't spoil it for me, I want to watch that.

Richard:We should certainly save it, just in case anybody misquotes what I just said there. The end results of the course is that Spark SQL and Data Frames is now appreciable difference in performance. There was a reason for that 50% slower.

Matt:Right.

Richard:I would never have been able to spot that by looking at the code. Not in a million years would I know, "Duh, look what I've done there." I'd made, not a mistake, as such, but it was a sub-optimal way of doing a query.

Matt:Okay.

Richard:I would challenge anybody look at that and say, "How can I refactor it?" The obvious refactorings all made no difference whatsoever, and it's only by explaining the query plan can you spot what the difference is. That gives you a clue where to start.

Matt:Right.

Richard:Actually, there's no documentation, as far as I can see, anywhere on this topic.

Matt:Interesting.

Richard:I had to go to Postgres, into their reference material. What's Postgres QL got to do with Apache Spark? Nothing. It's just they use a similar algorithm. So they've documented this algorithm. I was able to get the information from there, and looking at the Scala and Java source codes. We've gone quite cutting edge with this. We go pretty deep. I hope it's all understandable and explained well. I'm sure we'll get feedback on that. We've got, yeah ...

Matt:Yeah. I want to, if you don't mind, just go back to ... Where I was trying to get to-

Richard:Yeah, sorry, I digressed you there.

Matt:RDDs are absolutely usable and being used in production environments, for the purposes of processing large amounts of data.

Richard:Yes.

Matt:Why have they turned off, if you like, or are turning off the use of RDDs for Spark SQL. I believe the reason is-

Richard:For Spark ML.

Matt:Sorry, for Spark ML, thank you, and I believe the reason is, is that Spark ML, and once it's released and people have watched it, they'll hopefully get, is a very different use case. Machine Learning is about coming up with that algorithm that formula that you then want to apply to your large data. Although Spark is a great environment for doing Machine Learning, don't get me wrong, actually Machine Learning is not that sort of regular production job you're running time after time over large amounts of data. Machine Learning is a sort of intuitive one-off process that you may then run quarterly, half-yearly, to check your algorithm's still valid. It's not normally ... Sorry, let me get that sentence right. Normally performance is not your key thing.

Richard:So, presumably, they didn't want to maintain two separate programming models, so It's just easier for them to work.

Matt:Exactly. It's easier to build, using the SQL model, and in actual fact, I don't see why you would want to look at the execution plan. Because actually, how long it takes to run is not what's important. What's important is the gluteus of your outcomes.

Richard:Definitely. I wouldn't expect to see a chapter on performance on the Spark ML course. Wouldn't think of it.

Matt:There won't be one.

Richard:Irrelevant.

Matt:The word performance is about the performance of your algorithm. Is your algorithm performing correctly. Spark as in infrastructure, it was almost irrelevant.

Richard:By performing there, you mean giving the right answers right now, yes.

Matt:Or getting good enough answers, yes. But of course the point of Machine Learning, or what you'll hopefully learn when you do this course, is that you're not going to build your model, build your algorithm around a huge amount of data, you're going to build it on a small amount of data, and then see if it's a good predictor, be a bigger data set, so Machine Learning is less about big data. The reality is you want to use a large amount of data ideally, and Spark allows you to do it, which is why it's a good platform.

Matt:The use case for Machine Learning means that actually, keep just repeating this, the advantage of RDDs, which is that greater control over performance is just something that is generally not going to be that relevant. Now with that said, in a production environment, I suspect you would probably implement your algorithm which tends to be a relatively simple take this figure, add this to it, multiply it by that. Simple maths, you might well do that with RDDs. Of course, we talk about RDDs for unstructured. You can have an RDD of an object that you've built in Java, of course. You can use-

Richard:But you would have to do the work of any raw data wells if it was a text file. You've gotta build up the objects yourself.

Matt:To do the importing it. Absolutely. There's some work.

Richard:Whereas with SQL, it will automatically infer the schemer from the-

Matt:Well, that's true, and I don't know if you've covered this in the course. Again something you can do is use Spark SQL to import your data from it's text format. Then do a bit of manipulation to convert that to an RDD very easily, and then do your work in RDD format. I think that might well be the-

Richard:I don't think we ever bother doing that in the course, it's mentioned as an option, but-

Matt:To be fair, to go from SQL to RDD is pretty much one line of code, going the other way is a bit more complicated.

Richard:Okay.

Matt:RDD's are absolutely not dead. That was the message we wanted to get out there.

Richard:It's only a few people complaining that, "Well, if you don't RDDs first, and you should've ... Oh no." I can say there's no way I would've ... I would never have understood Spark SQL had I not started with RDD.

Matt:Actually, if you're going to do this, if you want to say you are proficient and good at this, just not knowing RDDs wouldn't allow you to. There's boxes, you'd miss so much of-

Richard:I don't want to be rude to customers who've maybe complained, but there is a large centre of our industry that just want to ... I'm pragmatic, by the way, I like to get a job done, and I'm not an academic. Maybe it's because of the industry we're in. We're trainers, we have to unpick some things that in real life we probably wouldn't have concerned ourselves with. We do need to go a bit deeper. I still think this argument, "Oh, you don't need to know that," or "All that's under the hood." We could go all the way back to assembly here, and should a good programmer know assembly, or some assembly language. It's back to that really. The most obvious example I struggle with mainly, with my courses, is the old Spring business. It's happened at least two or three times, with Spring. Most famously with the XML fiasco. Spring used to just have an XML configuration file format. Then they moved to what I think is an even worse format.

Richard:Personally, I think it's worse, but the industry adopted it, and then we looked completely old fashioned and out of date because we were still using ... It's really frustrating to us, as trainers that we know that's irrelevant. We know it's irrelevant, but we can't convince lots of customers that it's irrelevant. Because you're learning the principles, the syntax that you're using is rubbish it doesn't matter what syntax you're using. It's the dependency injection, the where AOP is being used, where are transaction boundaries in a real system. We cover that in great detail on our Spring Fundamentals course, and yet people still say, "That's all dated, because you're not using ..." Don't get me wrong, by now we've added all of the Java config, and we've added those models. And that goes around and around again. We now have Spring Boot, anybody can use Spring Boot now in two and a half minutes. You're up and running.

Matt:Yes.

Richard:So do you need to know about dependency injection? What a waste of time, learning dependency injection and AOP when I can use Spring Boot. It's that again. It's well, yeah, you can use Spring Boot in two and a half minutes, but you've hit your first problem, and you will, and you don't know what's happening under the hood, you've got to call a mechanic who does know.

Matt:Absolutely.

Richard:You want to be that rubbish developer who's just using API's.

Matt:I've had this similar conversation actually about exactly Spring Boot, with somebody recently on the phone. One of our customers. I was saying to him, "You could start with Spring Boot, but then go back and study fundamentals and fill in the gaps."

Richard:Absolutely right.

Matt:I'm speaking MBC as well. We've got a course, it was a bit controversial as to whether we did this course, called Java web development under hood. How to build websites in Java, with no frameworks. One of the places we've put this course in on Udemy, right? On Udemy, it doesn't have the world's greatest viewing figures, but it's interesting that the comments on there are saying things like, "Wow, this has really filled in gaps in my knowledge, thank you." And yet the purpose for that course, our thinking behind it was that if you're using Spring Boots or Struts or any of these frameworks, and things start to go wrong and you can't work out why, actually understanding what's really going on with servlets and JSP, and all that kind of stuff, really can help you figure out the right way around it.

Richard:It's on Virtual Paper as well I'm sure, as Java Web Development. Just has a slightly different name, for a different audience, but no one's missing out there.

Matt:No, absolutely.

Richard:Exactly that. I would like to think that most of our customers get that, we want the background, we want the detail.

Matt:Yes.

Richard:It's just frustrating when the day you release a course, the first comment you get is, "Huh, why have you done that? You should've done this because this is what we're using on my project." I don't want them as customers.

Matt:Don't say that, we need their money.

Richard:I can't?

Matt:No. It's interesting. It's very difficult to get a sense of the massive breadth of people using this kind of technology in massively different ways. One of the ways we do it is we try, whenever we can, to go to meet ups, meet with people who are doing it. See what their doing, talk to them and find out what their doing. It's through that you sense a sense of if you're just in your little silo, you think this is how we do stuff, therefore, this is the way it is done. And anyone who's suggesting something different, they don't know what they're talking about, which I think we would fit in that scene. We all have those sorts of moments. But there we go ...

Richard:By the same token, we are very weak for example on modern development with JavaScript for an end, and rest back ends. We got very little in the library on that at the minute, and we know that's the most common way. Again, there's no figures, we don't know, but it's what people talk about so that's the most likely way. Relatively few, I think are doing servlet based and whatever framework. It's more like it's been JavaScript front end. So when we come back, and we will improve that, we'll get there.

Matt:Yes.

Richard:It's just this to-do list is quite, it's quite twisted isn't it?

Matt:Absolutely. It is.

Richard:Definitely we will bridge that gap soon. It's the people wanting short cooks. We can't help people who want the short cut.

Matt:That's not what we're about. We was actually just chatting over coffee about that. One of the things that makes us different ... There are competitors out there, lets be honest. You will find some of our courses on other sites, not just our own. Where we are, therefore, competing much more directly with some of our competitors. But what we hope makes us stand out is that we go deeper, we explain things better, we cover a lot more of the fundamentals. We're not just giving you the quick win. We don't just try and do a show and tell. That's what people are buying when they buy from us, is that much greater in depth knowledge. Which hopefully means you come away with much more confidence that you can do this with a greater level of efficiency. That's what we're about as a company.

Richard:That's what training's about really, isn't it, yes.

Matt:That's what it should be about. The fact that there are now these platforms, I've mentioned one already, Udemy, where there are hundreds of people out there, and anyone can, tomorrow, put a training course on Udemy and what the platform people check is the quality of your audio and video, not the quality of your content. Now, okay, if you put rubbish up there, you're like to get not very good reviews quite quickly and you won't make any money. We have developed a whole process of learning and reviewing what we need to teach in such depth, hopefully, that we can focus on these are things that make a difference.

Richard:Sure, yeah. So, I had quite a round about that this morning. I'm pleased with the course. For looking forward to the Machine Learning bit, we've kind of forgot about streaming, so I think I might go away and do a bit on streaming. I think if nothing else, I keep saying this, but if you want to, even if you're not doing data science or data processing, I think Spark is a fascinating library to work with. It's not really a library, is it? Framework to work with. If only to get some practise at doing functional style Java, which is still a bit unusual. You can do a lot of Lamdas on this course, even on the second module.

Richard:It's less Lamda-ish, but there's still plenty of Lamdas in there. I've found that when we work in other languages we don't worry too much. It's natural and easy, but somehow to me it still feels a bit awkward in Java, but actually it's a pleasure to do it. It's great to see Java right up there with the Scala's and Python's and being expressive and clean. I love it. If we've alienated anyone who's just on the podcast for Java, we should rename the podcast, really, we never talk about Java.

Matt:Well, everything's Java-related, that's not really related at the end, but I agree with you. Any good developer needs to be able to put something about big data on their CV, and this is the way, one of the ways to do it, if you're working with, if you have some experience with Spark, at least it means that if you're ever talking to people, you can talk with some confidence, right? Because you get one of the challenges around manipulating and working with big data, and actually one of the ways to do it, so it's not that difficult as a programmer to do. One of the beautiful things about Spark, which Hadoop doesn't do, is it lets you forget about the size of the data mesh at the time and concentrate on the programming.

Richard:You did a seminar on this. Doesn't have to be big data, particularly. We mean multi-terabyte data when we're talking big data.

Matt:Yes.

Richard:It's a multi threaded execution framework, and you can run it on a local, it wouldn't be a local desktop, but you can run it on a single JVM without a cluster, and you're going to get the benefit of true, if you've got full cause, or whatever, it will distribute the load across that. You could be working with just a gigabyte or something. Do you want to sit there doing threading? No. It's a great framework for that as well.

Matt:Also, if performance is not your critical requirement, actually it's a very easy programming environment. If you've got to do some quite complex manipulations on data, it's a much quicker and easier way to write it and maintain it, than ... Well, what would your alternative be? It would be writing SQL statements or writing underlying Java. That's another great use-case, I think.

Richard:I think you're right. That still assumes you'd have to have it in a database. You might have a flat file that you're wanting to teach the fun, and to load it into Sparks. Even I've done the course, and it's not really my, I'm more of a middle tier type person. With a slightly needs what's front end as well, for some reason. It wouldn't normally be my field. There's a lot to enjoy in Spark, even if you only do the course for the performance chapters, I think they're market leading, our performance chapters. It's great.

Matt:You said, "Market leading." I've not seen anybody else covering that kind of material anywhere. It is the market, I think.

Richard:It will be now, though. Once we sell a copy.

Matt:Oh, the copy.

Richard:I don't like it so much. No, it's fine. It's all being Spark and stuff. I'm going to go off and golf and do some material on Istio next. That's my big thing.

Matt:Istio? What's Istio?

Richard:Istio is a framework that adds on top of Kubernetes for better monitoring, in particular, things like tracing. If you built a complex micro surface architecture, and something's going wrong, real difficulty there is where is it going wrong? You could have a request coming in and then bounces through 170 different microservices and something along the way is failing. How on earth do you ... So it's got things like tracing in it, so you can watch your requests and see exactly where it's gone across the entire system, with timings and it just tickles my geek radar.

Matt:Lovely.

Richard:It's a small subject, so it might be an hour module or something. I always say that, and then it turns into a 55-hour masterpiece.

Matt:We should look for streaming first, and then that from you.

Richard:Streaming and Istio. Well I'm juggling the two. I don't know I'm going to do one or the other in which order.

Matt:I'm currently going to be finishing the Machine Learning, that's definitely going to be my next one, and then I am planning, and I feel brave saying this, because this is going to be months of work, I'm sure. I'm planning a course on Java performance. Whether that will be trouble shooting performance issues in your applications, or whether it will be more from the point of view of thinking about... Well, I think there's going to be a few aspects to it. One will certainly be about JVM tuning, and when to be thinking about it, and what kind of options are out there. There'll be something around the tools that you get with, certainly open JDK, possibly the extras that you get with the Oracle JVM, around monitoring and understanding application performance. I'm hoping, and this is the more challenging bit maybe, something around coding changes, and how they affect performance as well. I'm thinking about that at the moment. That, I hoping is going to be my next biggy, but it will be a biggy, so it's going to be months away yet, I'm afraid.

Richard:That's a tough one to do. I would be terrified taking on that. If it's right, it turns out well, it will an absolute smash hit. It will be a runaway performer.

Matt:I'm doing it on the back of the fact that one of our best performing courses, when the virtual pack programme was started this is actually a part of our Java advance topics course. It's the whole section on memory management. We've released that as a stand alone, just the memory management part, on a couple of other platforms, where it does very well for us. There clearly is a desire for more of that kind of thing. This will be effectively a follow on to that. It will be going further. Memory management's not quite the right phrase. It should have been called "How memory works in Java," really, shouldn't it?

Richard:Yes, it's certainly a sloppy title.

Matt:But that, getting into the depths of the JVM and how you can make changes to benefit from the optimizations the JVM can do, is a big topic, is a difficult topic, but I'm hoping to make that my next big one.

Richard:It's one that people seem to get very excited about. I wonder if maybe it's because people, I think you were touching on this this morning, that people think they're missing out on something. Am I using the wrong garbage collector, or runs with the defaults. I wonder if a lot of it is necessary, and that will be the end result of the course. Saying where should you be focusing your efforts. It might be that you don't need to ever change any defaults.

Matt:It might be, but I think that there will be ... I have never thought about, in a production environment, what should I be setting my heap size to? Just as a really simple example, I look after a server at the moment that has two JVMs running on it, I have never done any kind of session JVM flags. Yet I know, this is a pre-cust server, well I have to be a bit more detail in the course, that when garbage collection runs, it will attempt to use every bit of processing power it can from your machine, for the shortest time possible.

Matt:That's what garbage collection is set up to do. But if you have two JVM's on the same machine, what happens? That's the kind of thing you want to be putting a little flag on, to say, "Never use more than 50% of the resources," for example. Little thing but it's the kind of thing that I've never thought about, and I've got this server live, never even worried about it. If I had five JVMs, that might well cause an issue. It needs to be fair. Well, actually websites that have very low usage I'm not concerned, but these are the kind of things that are worth understanding. That's a flavour for the kind of things we'll be going into, but don't hold your breath for that one, it will be a little while off, I'm afraid.

Richard:Good luck with it.

Matt:Well, I might be coming for some help from you, it should save ...

Richard:Blimey, I'm not sure I'm just looking at your stack of books that you've been working through. That's a scary pile.

Matt:I didn't purposely leave those out. Obviously, when we do courses like this we go and buy every book we can possibly find, and start researching our topics.

Richard:We don't read them, of course. These books look pristine. They've never been opened.

Matt:I must confess, the biggest one I took with me when I went on holiday, and I sat by the pool and I found I slept more restful that day than I have ever done before. It's been brilliant. Anyway, there we go. That's what we've been up to, that's what's hopefully coming up in the next few weeks and months.

Richard:And as always, Matthew, speaking to you has cheered me up no end. I came in this morning in such a mood. Now I'm cheerful and joyful, and I'm going to carry that through the rest of my day.

Matt:Well, there we go. I just want to say that we mentioned Udemy, we also have courses now live on a couple of other platforms out there. Which means I'm sure we'll be getting people listening to this podcast for the first time, who aren't our traditional [peridomas] customers, so welcome to you guys, if you are listening for the first time, and if you have any thoughts or questions, go through the allthingsjava.io website and I'm pretty sure there's a contact link on there where you can send us any thoughts or feedback. Which we're always. WE are always glad to see.

Richard:Matthew adores feedback. Any kind of feedback, doesn't matter if it's negative, that's in fact, even more welcomed is negative feedback for Matthew.

Matt:On that positive thought, thank you. And I think we should probably leave today for there.

Richard:Yes. See you next time.

Matt:See you next time. Thank you for listening.

Listen via:


Be notified of new posts